diff --git a/rfcs/THV-0038-session-scoped-client-lifecycle.md b/rfcs/THV-0038-session-scoped-client-lifecycle.md new file mode 100644 index 0000000..f042b67 --- /dev/null +++ b/rfcs/THV-0038-session-scoped-client-lifecycle.md @@ -0,0 +1,536 @@ +# THV-0038: Session-Scoped MCP Client Lifecycle Management + +- **Status**: Draft +- **Author(s)**: @yrobla, @jerm-dro +- **Created**: 2026-02-04 +- **Last Updated**: 2026-02-04 +- **Target Repository**: `toolhive` +- **Related Issues**: [toolhive#3062](https://github.com/stacklok/toolhive/issues/3062) + +## Summary + +Move MCP backend client lifecycle from per-request creation/destruction to session-scoped management. Clients will be created once during session initialization, reused throughout the session lifetime, and closed during session cleanup. This simplifies the architecture and ensures consistent backend state preservation. + +## Problem Statement + +### Current Limitations + +The current `httpBackendClient` implementation creates and closes MCP clients on every request. Each method (`CallTool`, `ReadResource`, `GetPrompt`, `ListCapabilities`) follows the same pattern: + +1. Create a new client via `clientFactory()` +2. Defer client closure with `c.Close()` +3. Initialize the client with MCP handshake +4. Perform the requested operation +5. Close the client when the function returns + +**Problems with Per-Request Client Lifecycle**: + +1. **Connection Overhead**: Every tool call incurs TCP handshake, TLS negotiation, and MCP protocol initialization overhead. + +2. **State Loss**: MCP backends that maintain stateful contexts (Playwright browser sessions, database transactions, conversation history) lose state between requests. While sessions preserve routing information, the underlying connections are recreated each time. + +3. **Redundant Initialization**: Capability discovery during session creation (`AfterInitialize` hook) establishes which backends exist, but clients are created fresh for every subsequent request. + +4. **Authentication Overhead**: Each client creation re-resolves authentication strategies and validates credentials, even though this information doesn't change within a session. + +5. **Resource Waste**: Creating and destroying clients repeatedly wastes CPU cycles and network bandwidth, especially problematic for high-throughput scenarios. + +### Affected Parties + +- **vMCP Server Performance**: Every tool call creates/closes connections, multiplying latency by the number of requests +- **Backend Servers**: Repeated connection churn increases load on backend MCP servers +- **Stateful Backends**: Backends relying on persistent state (Playwright, databases) lose context between calls +- **High-Throughput Scenarios**: Connection overhead becomes a bottleneck for applications making many tool calls + +### Why This Matters + +MCP backends often maintain stateful contexts (browser sessions, database connections, conversation history). The current per-request pattern means: +- A browser automation workflow must navigate to the same page repeatedly +- Database transactions cannot span multiple tool calls +- Conversation context in AI assistants is lost between interactions + +Sessions already exist and have defined lifetimes (TTL-based expiration). Aligning client lifecycle with session lifecycle is a natural fit that simplifies the architecture while enabling better state preservation. + +## Goals + +1. **Align Lifecycle with Sessions**: Move client lifecycle from per-request to session-scoped +2. **Eager Initialization**: Create all backend clients during session initialization alongside capability discovery +3. **Enable Stateful Workflows**: Allow backends to maintain state (browser contexts, DB transactions) across multiple tool calls within a session +4. **Reduce Overhead**: Eliminate redundant connection creation, TLS negotiation, and MCP initialization +5. **Simplify Code**: Remove repeated client creation/closure boilerplate from every method + +## Non-Goals + +- **Connection Pooling Within Clients**: Individual MCP clients may internally pool connections, but that's outside this RFC's scope +- **Multi-Session Client Sharing**: Clients remain session-scoped and are not shared across sessions +- **Lazy Backend Discovery**: Backend discovery remains eager (current behavior) +- **Client Versioning**: Handling MCP protocol version negotiation is out of scope + +## Proposed Solution + +### High-Level Design + +``` +┌─────────────────────────────────────────────────────────────┐ +│ Session Manager │ +│ │ +│ ┌────────────────────────────────────────────────────┐ │ +│ │ VMCPSession (TTL-scoped) │ │ +│ │ │ │ +│ │ - Session ID │ │ +│ │ - Routing Table │ │ +│ │ - Tools List │ │ +│ │ - Backend Clients Map[WorkloadID]*client.Client ◄├───┼─── Created during AfterInitialize +│ │ │ │ +│ │ Lifecycle: │ │ +│ │ 1. Create session (Generate) │ │ +│ │ 2. Initialize clients (AfterInitialize) │ │ +│ │ 3. Use clients (CallTool/ReadResource/...) │ │ +│ │ 4. Close clients (TTL expiry/Delete/Stop) │ │ +│ └────────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ + ┌────────────────────────┐ + │ Backend Client │ + │ (mcp-go/client.Client) │ + │ │ + │ - Initialized once │ + │ - Reused per session │ + │ - Closed with session │ + └────────────────────────┘ +``` + +### Terminology Clarification + +This RFC uses the following terms consistently: + +- **Initialize request**: HTTP request to `/mcp/initialize` endpoint (no session ID header) +- **Discovery middleware**: HTTP middleware that handles initialize requests and performs backend capability discovery +- **Session initialization**: The complete process from initialize request → discovery → session creation → client storage +- **Backend client**: An MCP client connected to a specific backend workload (instance of `mcp-go/client.Client`) +- **Session-scoped**: Resources that live for the duration of a session (created on init, closed on expiration) + +**Key Changes**: +- Add `backendClients` map to `VMCPSession` for storing initialized clients +- Modify `AfterInitialize` hook to create and initialize clients alongside capability discovery +- Change `httpBackendClient` methods to retrieve clients from session instead of creating them +- Add `Close()` method to `VMCPSession` to close all clients on session expiration +- Remove per-request `defer c.Close()` pattern from all client methods + +### Detailed Design + +#### 1. VMCPSession Structure Changes + +**Add Backend Clients Map**: +- Add `backendClients map[string]*client.Client` field to `VMCPSession` struct +- This map stores initialized MCP clients keyed by workload ID +- Protected by existing mutex for thread-safe access + +**New Methods**: +- `GetBackendClient(workloadID string)`: Retrieves an initialized client for the given backend +- `SetBackendClients(clients map[string]*client.Client)`: Stores the client map during session initialization +- `Close()`: Iterates through all clients and closes them, returning any errors encountered + +**Location**: `pkg/vmcp/session/vmcp_session.go` + +#### 2. Client Initialization During Session Setup + +**Extend Discovery Middleware**: + +The discovery middleware already performs capability discovery for each backend during initialize requests. Restructure this to create clients once and reuse them: + +- During initialize request handling, create and initialize a client for each backend +- Use the same client to perform capability discovery (`ListCapabilities`) +- Store initialized clients in a map keyed by workload ID +- For failed initializations: log warning, skip backend, mark unhealthy, and continue (partial initialization is acceptable) +- Pass the client map through request context to session storage +- Store clients in the session via `SetBackendClients()` +- Subsequent requests to backends not in the map receive "no client found" errors + +**Key Improvement**: By reusing the discovery client as the session-scoped client, we avoid redundant connection setup. The client map flows from discovery middleware → request context → session storage, bridging the gap between client creation and session registration. + +**Key Improvement**: By reusing the discovery client as the session-scoped client, we avoid redundant connection setup and better meet the stated goal of reducing overhead. + +**Error Handling**: +- Individual client initialization failures don't fail session creation +- Failed backends are marked unhealthy via existing health monitoring +- Subsequent requests to failed backends return "no client" errors + +**Location**: `pkg/vmcp/discovery/middleware.go` + +#### 3. Refactored BackendClient Implementation + +**Add Session Manager Reference**: +- Add `sessionManager` field to `httpBackendClient` struct to enable session lookup +- Pass session manager to constructor during server initialization + +**New Helper Methods**: +- `getSessionClient(workloadID)`: Retrieves pre-initialized client from session + - Extracts session ID from request context + - Looks up VMCPSession from manager + - Returns the client for the requested backend +- `createAndInitializeClient(target)`: Creates and initializes a client + - Uses existing `clientFactory` to create the client + - Immediately initializes with MCP handshake + - Returns initialized client ready for use + - Called only during session setup, not per-request + +**Refactor Request Methods**: +- `CallTool`, `ReadResource`, `GetPrompt`, `ListCapabilities`: Replace client creation pattern with: + - Call `getSessionClient()` to retrieve pre-initialized client + - Remove `defer c.Close()` (client managed by session) + - Remove `initializeClient()` call (already initialized) + - Use client directly for the operation +- **Note**: During session initialization in the discovery middleware, `createAndInitializeClient()` is called once per backend, and that same client is used for capability discovery and registered as the session-scoped client, avoiding redundant connection setup + +**Location**: `pkg/vmcp/client/client.go` + +#### 4. Session Cleanup Integration + +**Current State**: The `Session` interface does not currently have a `Close()` method. This RFC will add it. + +**Required Changes**: +1. Add `Close() error` method to the `Session` interface in `pkg/transport/session/manager.go` +2. Implement `Close()` in `VMCPSession` to close all backend clients +3. Update `LocalStorage` to call `session.Close()` before deletion in `Delete()` and `DeleteExpired()` methods + +**Reference**: The storage layer was refactored to use a pluggable interface in [PR #1989](https://github.com/stacklok/toolhive/pull/1989), but session cleanup hooks were not added at that time. + +**VMCPSession.Close() Implementation**: +- Iterate through all clients in the `backendClients` map +- Call `Close()` on each client +- Collect any errors encountered +- Return combined errors if any failures occurred + +**Cleanup Triggers**: + +By adding `Close()` to the `Session` interface and calling it from `LocalStorage`, clients will be closed when: +- **TTL Expiration**: Session manager's cleanup worker calls `DeleteExpired()`, which will call `session.Close()` on expired sessions +- **Explicit Deletion**: Manual session deletion via `Delete()` will call `session.Close()` before removing the session +- **Manager Shutdown**: `Manager.Stop()` will call `Close()` on all remaining sessions + +The session cleanup infrastructure already exists (from [PR #1989](https://github.com/stacklok/toolhive/pull/1989)); we're adding the `Close()` hook to integrate with it. + +#### 5. Error Handling + +**Connection Failures During Session Creation**: +- Log warning and skip that backend (do not add to client map) +- Backend marked unhealthy via existing health monitor +- Session creation succeeds with partial backend connectivity +- Subsequent requests to failed backends return "no client found for backend X" error + +**Connection Failures During Tool Calls**: +- Return error to client (existing behavior) +- Health monitor marks backend unhealthy (existing behavior) +- Client initialization attempts continue for unhealthy backends (health-based filtering is future work, see Phase 4) + +**Client Already Closed**: +- If session expired and client is closed, return clear error +- Client should refresh session via `/initialize` endpoint + +## Security Considerations + +### Threat Model + +No new security boundaries are introduced. This is a refactoring of existing client lifecycle management. + +### Authentication & Authorization + +**No Changes**: Authentication flow remains identical: +1. Client → vMCP: Validate incoming token (existing) +2. vMCP → Backend: Use outgoing auth configured per backend (existing) +3. Backend → External APIs: API-level validation (existing) + +The only difference is **when** clients are created (session init vs first use), not **how** they authenticate. + +### Data Protection + +**Session Isolation**: Each session has its own client map. No cross-session data leakage risk. + +**Connection Security**: TLS configuration and certificate validation remain unchanged. + +### Concurrency & Resource Safety + +**Client Usage During Cleanup**: +- Race condition exists: request may use client while session is being closed +- Mitigation: Session `Touch()` extends TTL on every request, reducing race window +- MCP client library handles `Close()` on active connections gracefully (returns errors) +- Handlers should catch and handle "client closed" errors appropriately +- Future Enhancement: Add reference counting to delay `Close()` until all in-flight requests complete + +### Secrets Management + +**No Changes**: Outgoing auth secrets are still retrieved via `OutgoingAuthRegistry` during client creation. The timing changes (session init vs first request) but the mechanism is identical. + +### Audit Logging + +**New Log Event**: Add audit log entry for client initialization during session setup: +```json +{ + "event": "backend_client_initialized", + "session_id": "sess-123", + "workload_id": "github-mcp", + "timestamp": "2026-02-04T10:30:00Z" +} +``` + +**Existing Events Unchanged**: Tool call logs, authentication logs, and session lifecycle logs remain the same. + +## Alternatives Considered + +### Alternative 1: Keep Per-Request Pattern with Connection Pooling + +**Approach**: Continue creating/closing clients per request but add a connection pool underneath to reuse TCP connections. + +**Pros**: +- Minimal changes to existing code +- Reduces TCP handshake overhead + +**Cons**: +- Doesn't address MCP protocol initialization overhead (still happens per request) +- Doesn't solve state preservation problem (each request still gets fresh MCP client) +- Authentication still resolved and validated per request +- Adds complexity at wrong layer + +**Decision**: Rejected. This addresses symptoms but not the root cause. + +### Alternative 2: Lazy Client Creation on First Use + +**Approach**: Create clients on first tool call to a backend, then store in session for reuse. + +**Pros**: +- Avoids creating clients for unused backends +- Delays initialization until needed + +**Cons**: +- First tool call to each backend still has initialization latency +- More complex state management (need to track which clients exist) +- Doesn't leverage existing capability discovery phase +- Inconsistent performance (first vs subsequent calls) + +**Decision**: Rejected. Complexity outweighs benefits. Capability discovery already knows which backends exist. + +### Alternative 3: Global Client Cache Across Sessions + +**Approach**: Share clients across all sessions using a global cache, keyed by backend + auth identity. + +**Pros**: +- Maximum connection reuse +- Lowest initialization overhead + +**Cons**: +- **State Pollution**: Backend state (Playwright contexts, DB transactions) would leak across sessions +- **Security Risk**: Potential for cross-session data exposure if keying is incorrect +- **Complexity**: Requires complex cache eviction, client sanitization, and identity tracking +- **Violates Session Isolation**: Sessions should be independent + +**Decision**: Rejected. Session isolation is a core requirement for vMCP. The security and complexity risks outweigh performance gains. + +## Compatibility + +### Backward Compatibility + +**External APIs**: No breaking changes. The `/vmcp/v1/*` HTTP API remains unchanged. Clients see identical behavior. + +**Internal APIs**: + +⚠️ **Breaking Change**: Adding `Close() error` to the `Session` interface + +- **Affected Implementations**: All types implementing `pkg/transport/session.Session` must add a `Close()` method: + - `ProxySession` - Add no-op `Close()` (no resources to clean up) + - `StreamableSession` - Add no-op `Close()` (no resources to clean up) + - `SSESession` - Add no-op `Close()` (no resources to clean up) + - `VMCPSession` - Implement `Close()` to close backend clients (this RFC) +- **Migration Path**: All session implementations will be updated atomically in Phase 1 to implement `Close()`. Storage and manager components will be updated to call the new method. No external packages are affected. +- **Impact**: Internal to ToolHive; no external packages implement this interface +- `BackendClient` interface: No signature changes, but behavior changes from "create on demand" to "retrieve from session" +- `httpBackendClient` constructor: Requires session manager parameter (added field) + +### Forward Compatibility + +Future enhancements are easier with this design: + +- **Connection Warming**: Pre-connect to backends during idle time +- **Health-Based Initialization**: Skip unhealthy backends during session setup +- **Client Version Negotiation**: Perform protocol version negotiation once per session + +## Implementation Plan + +### Phase 1: Session Client Storage and Cleanup Hooks + +Add client storage capabilities and cleanup infrastructure: + +**Session Interface Changes**: +- Add `Close() error` method to `Session` interface in `pkg/transport/session/manager.go` +- Implement `Close()` in all session types: + - `ProxySession` - No-op implementation (returns nil) + - `StreamableSession` - No-op implementation (returns nil) + - `SSESession` - No-op implementation (returns nil) + - `VMCPSession` - Full implementation (closes backend clients) +- Update `LocalStorage.Delete()` to call `session.Close()` before deletion +- Update `LocalStorage.DeleteExpired()` to call `session.Close()` on each expired session before deletion +- Update `Manager.Stop()` to close all remaining sessions before closing storage + +**VMCPSession Implementation**: +- Add `backendClients map[string]*client.Client` field to `VMCPSession` struct +- Implement `GetBackendClient(workloadID string)` method for retrieving clients +- Implement `SetBackendClients(clients map[string]*client.Client)` for storing client map +- Implement `Close()` method to iterate through all clients, close them, and collect errors + +**Files to modify**: +- `pkg/transport/session/manager.go` (add Close() to interface, update Manager.Stop()) +- `pkg/transport/session/storage_local.go` (call session.Close() before deletion) +- `pkg/transport/session/proxy_session.go` (add no-op Close()) +- `pkg/transport/session/streamable_session.go` (add no-op Close()) +- `pkg/transport/session/sse_session.go` (add no-op Close()) +- `pkg/vmcp/session/vmcp_session.go` (implement Close() with client cleanup and add client storage) + +**Testing**: +- Unit tests for client storage/retrieval +- Unit tests for Close() error handling +- Tests verifying LocalStorage calls session.Close() before deletion +- Tests for cleanup on TTL expiration and explicit deletion + +### Phase 2: Client Initialization at Session Creation + +Restructure the discovery middleware to create clients once and reuse them: + +- Modify discovery middleware's initialize request handler +- Create and initialize clients for all backends during discovery +- Use the same clients to perform capability discovery +- Pass client map through request context to session storage +- Store clients in session via `SetBackendClients()` +- Mark failed backends as unhealthy via health monitor +- Log warnings for failed initializations but continue (partial initialization acceptable) + +**Files to modify**: +- `pkg/vmcp/discovery/middleware.go` + +**Testing**: +- Integration tests verifying clients created during session init +- Tests for partial backend initialization (some backends fail) +- Tests confirming failed backends don't block session creation + +### Phase 3: Refactor Backend Client Methods + +Modify httpBackendClient to retrieve clients from session instead of creating per-request: + +- Add `sessionManager` field to `httpBackendClient` struct +- Pass session manager to `NewHTTPBackendClient()` constructor +- Create `getSessionClient(ctx, workloadID)` helper method +- Create `createAndInitializeClient(ctx, target)` method for session init use (called by discovery middleware) +- Refactor `CallTool`, `ReadResource`, `GetPrompt`, `ListCapabilities` to use `getSessionClient()` instead of `clientFactory()` +- Remove `defer c.Close()` calls from all methods +- Remove `initializeClient()` calls from all methods + +**Files to modify**: +- `pkg/vmcp/client/client.go` + +**Testing**: +- Integration tests confirming clients reused across multiple tool calls +- Tests verifying no client leaks (clients closed on session expiration) +- Tests for "no client found" errors when backend unavailable + +### Phase 4: Observability & Monitoring + +Add observability for client lifecycle: + +- Add audit log events for client initialization during session setup +- Add metrics for: + - Client initialization success/failure rates per backend + - Client initialization latency + - Client reuse patterns within sessions + - Client lifecycle duration +- Add distributed traces spanning session initialization through first tool call +- Consider health-based filtering: skip unhealthy backends during initialization + +**Files to create/modify**: +- Add logging in discovery middleware +- Add metrics in client initialization path + +**Testing**: +- Verify audit logs contain session ID and backend ID +- Verify metrics show client init success/failure rates +- Verify traces show reduced latency for subsequent tool calls + +### Dependencies + +- Existing transport session cleanup infrastructure from [PR #1989](https://github.com/stacklok/toolhive/pull/1989) (provides deletion hooks where we'll add `Close()` calls) +- Existing discovery middleware (`AfterInitialize` hook) +- Existing health monitoring system + +### Testing Strategy + +**Unit Tests**: +- Session client map operations (get, set, iterate) +- Client closure on session cleanup with error handling +- Error handling for missing/uninitialized clients + +**Integration Tests**: +- Session creation initializes clients for all configured backends +- Multiple tool calls within same session reuse clients +- Session expiration triggers client closure +- Partial initialization doesn't fail session creation +- Failed backend initialization marked in health system + +**End-to-End Tests**: +- Full vMCP workflow with multiple backends +- Backend state preservation across tool calls (e.g., Playwright browser session) +- Client resource cleanup on TTL expiration +- High-throughput scenarios verify no connection leaks + +## Documentation + +**Architecture Documentation**: +- Update `docs/arch/10-virtual-mcp-architecture.md` with session-scoped client lifecycle diagrams +- Document client initialization during session setup phase +- Add sequence diagrams showing client creation timing + +**Code Documentation**: +- Add package-level comments explaining session-client relationship +- Document `VMCPSession.GetBackendClient()` and `SetBackendClients()` methods +- Add examples of client retrieval in method documentation + +**Operational Guides**: +- Update troubleshooting guide with client lifecycle debugging tips +- Document metrics and logs for monitoring client initialization +- Add runbook for investigating client initialization failures + +## Open Questions + +All major design questions have been resolved during RFC development: + +1. ~~Should clients be created eagerly or lazily?~~ → **Resolved: Eager initialization during session setup alongside capability discovery** +2. ~~What happens if backend client initialization fails during session creation?~~ → **Resolved: Log warning, mark backend unhealthy, continue with partial initialization** +3. ~~Should we share clients across sessions for efficiency?~~ → **Resolved: No, session isolation is more important than marginal efficiency gains** +4. ~~How should client closure be triggered?~~ → **Resolved: Leverage existing session cleanup infrastructure (TTL expiration, explicit deletion, server shutdown)** + +## References + +- [MCP Specification](https://modelcontextprotocol.io/specification/2025-06-18) +- [ToolHive Issue #3062](https://github.com/stacklok/toolhive/issues/3062) +- [mark3labs/mcp-go SDK](https://github.com/mark3labs/mcp-go) - Backend client implementation +- [Virtual MCP Architecture Documentation](https://github.com/stacklok/toolhive/blob/main/docs/arch/10-virtual-mcp-architecture.md) +- Related components in ToolHive: + - Session management infrastructure (from [PR #1989](https://github.com/stacklok/toolhive/pull/1989)) + - Health monitoring system + - Backend discovery mechanism + +--- + +## RFC Lifecycle + + + +### Review History + +| Date | Reviewer | Decision | Notes | +|------|----------|----------|-------| +| 2026-02-04 | @yrobla, @jerm-dro | Draft | Initial submission | + +### Implementation Tracking + +| Repository | PR | Status | +|------------|-----|--------| +| toolhive | - | Pending |