Skip to content

improvement(mcp): post-merge hardening — protocol negotiation + distributed OAuth lock + typed errors#4722

Merged
waleedlatif1 merged 9 commits into
stagingfrom
improvement/mcp-hardening
May 22, 2026
Merged

improvement(mcp): post-merge hardening — protocol negotiation + distributed OAuth lock + typed errors#4722
waleedlatif1 merged 9 commits into
stagingfrom
improvement/mcp-hardening

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Inbound MCP server protocol-version negotiation (serve/[serverId]/route.ts): echoes the client's protocolVersion when supported (2025-06-18 / 2025-03-26 / 2024-11-05), falls back to our latest. Previously hardcoded to the oldest spec version, forcing modern clients into legacy behavior.
  • Distributed OAuth refresh lock (oauth/storage.ts): withMcpOauthRefreshLock now wraps fn in a Postgres advisory transaction lock (pg_advisory_xact_lock) in addition to the in-process Promise chain. Multi-task ECS deployments no longer race rotated refresh tokens.
  • Typed-error dispatch in categorizeError (utils.ts): dispatches on McpOauthAuthorizationRequiredError / UnauthorizedError / McpConnectionError first. Adds HTTP 502 for connection failures and 503 for cooldown. Substring matching kept only as a fallback for SDK/third-party errors we don't own a class for.
  • Drop dead allSettled().catch() (service.ts): each deferred side-effect already self-logs; just mark per-promise as handled to silence unhandled-rejection warnings.

Type of Change

  • Improvement

Testing

Tested manually. 298 MCP tests pass (8 new: 4 covering protocol-version negotiation paths, 4 covering typed-error dispatch). `bun run check:api-validation` and lint clean.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 22, 2026 5:19pm

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 22, 2026

PR Summary

Medium Risk
Touches MCP protocol handshake and OAuth token-refresh coordination, plus adds Redis-based locking; failures here could impact connectivity/auth flows across deployments, though changes are additive with fallbacks and covered by new tests.

Overview
Adds MCP protocol version negotiation on the workflow MCP serve initialize path, echoing a client-requested version when supported and otherwise falling back to the SDK LATEST_PROTOCOL_VERSION (with new route tests covering supported/unsupported/omitted cases).

Hardens outbound MCP OAuth by upgrading withMcpOauthRefreshLock from in-process serialization to a distributed Redis mutex with TTL watchdog, polling, and bounded wait to avoid refresh-token rotation races across processes (with extensive new unit tests).

Improves reliability/ops behavior by pooling undici pinned-DNS Agents per resolvedIP (LRU-capped, no-close eviction) and by making categorizeError prefer typed MCP/SDK auth/connection errors (returning 401/502/503 accordingly); also removes a redundant allSettled().catch() in discovery deferred side-effects to avoid unhandled-rejection warnings.

Reviewed by Cursor Bugbot for commit 4c05f63. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR hardens the MCP integration across four areas: protocol-version negotiation on the inbound serve endpoint (echo client's version if supported, fall back to latest), a two-tier distributed OAuth refresh lock replacing the single-process promise chain (Redis mutex with watchdog TTL extension + in-process chain), typed instanceof dispatch in categorizeError before substring fallback, and an LRU connection-pool for pinned undici Agents to reuse keep-alive sockets across McpClient instances.

  • Protocol negotiation (route.ts): negotiateProtocolVersion() echoes SUPPORTED_PROTOCOL_VERSIONS from the SDK; tests cover all four negotiation paths.
  • Distributed lock (storage.ts): withMcpOauthRefreshLock now chains an in-process serialization queue with a polled Redis mutex and a watchdog setInterval that extends the TTL while fn() runs; falls open (logs a warning) if Redis is unavailable.
  • categorizeError (utils.ts): adds McpConnectionError → 502/503 and UnauthorizedError → 401 paths before the substring fallback; pinnedAgents LRU pool in pinned-fetch.ts reuses undici Agents per resolved IP.

Confidence Score: 5/5

Safe to merge — all four changes are well-bounded, thoroughly tested, and the two-tier OAuth lock correctly preserves the per-caller fn() invariant.

No incorrect behavior in the changed paths. The distributed lock implementation correctly handles Redis unavailability, lock release on fn() throws, watchdog extension for long-running refreshes, and per-row key isolation. The protocol negotiation, typed error dispatch, and agent pool are all verified by the new tests. The one observation is a test-utility export in the production module, which does not affect runtime correctness.

apps/sim/lib/mcp/pinned-fetch.ts — __resetPinnedAgentsForTests is a test-only helper exported from the production module.

Important Files Changed

Filename Overview
apps/sim/lib/mcp/oauth/storage.ts Replaces single-process promise chain with two-tier Redis + in-process lock; watchdog extends TTL, falls open on Redis unavailability, correct cleanup in finally block.
apps/sim/lib/mcp/pinned-fetch.ts Adds LRU agent pool (64 entries, no TTL) to reuse keep-alive connections; exports __resetPinnedAgentsForTests test helper from production module.
apps/sim/lib/mcp/utils.ts Typed instanceof dispatch for McpConnectionError/UnauthorizedError/McpOauthAuthorizationRequiredError added before substring fallback; clean separation of concerns.
apps/sim/app/api/mcp/serve/[serverId]/route.ts negotiateProtocolVersion() echoes SDK-supported versions from the client's request, correctly falls back to LATEST_PROTOCOL_VERSION.
apps/sim/lib/mcp/service.ts Replaces dead Promise.allSettled().catch() with per-promise .catch(()=>{}) — all deferred side-effects already self-log via try/catch in their private methods.

Sequence Diagram

sequenceDiagram
    participant C1 as Caller 1 (in-process)
    participant C2 as Caller 2 (in-process)
    participant CP as Caller P (other process)
    participant IC as inflightChains
    participant R as Redis
    participant F as fn()

    C1->>IC: "get(lockKey) prev=resolved"
    C1->>IC: set(lockKey, next1)
    C2->>IC: "get(lockKey) prev=next1"
    C2->>IC: set(lockKey, next2)
    C1->>R: acquireLock(key, token1, 15s) true
    C1->>C1: start watchdog setInterval(5s)
    C1->>F: fn() runs
    CP->>R: acquireLock(key, tokenP, 15s) false
    CP->>CP: sleep(100ms), poll...
    C1->>R: extendLock(key, token1, 15s)
    F-->>C1: result1
    C1->>C1: clearInterval(watchdog)
    C1->>R: releaseLock(key, token1)
    C1->>IC: cleanup delete(lockKey)
    C2->>R: acquireLock(key, token2, 15s) true
    C2->>F: fn() runs own invocation
    F-->>C2: result2
Loading

Reviews (5): Last reviewed commit: "fix(mcp): throw instead of falling open ..." | Re-trigger Greptile

Comment thread apps/sim/lib/mcp/oauth/storage.ts Outdated
Comment thread apps/sim/app/api/mcp/serve/[serverId]/route.ts Outdated
…ibuted OAuth lock + typed error dispatch

- Inbound MCP server now negotiates protocolVersion per MCP 2025-06-18: echoes
  the client's requested version when supported, falls back to our latest.
  Previously hardcoded to the oldest spec version (2024-11-05).
- withMcpOauthRefreshLock now takes a Postgres advisory transaction lock in
  addition to the in-process Promise chain, so concurrent processes (multi-task
  ECS) serialize on a per-OAuth-row basis. Previously a refresh race across
  processes could rotate a token under another process's feet and force re-auth.
- categorizeError dispatches on McpOauthAuthorizationRequiredError /
  UnauthorizedError / McpConnectionError first, only falling back to substring
  matching for SDK / third-party errors. Adds 502 for connection failures and
  503 for cooldown. Tests cover all four typed cases.
- discoverTools no longer pretends to handle deferred-side-effect rejections
  via a dead allSettled().catch() — each side-effect already self-logs; we just
  swallow per-promise to silence unhandled-rejection warnings.
- Replace `as never` cast in negotiateProtocolVersion with `as readonly string[]`
  on the array — preserves TypeScript narrowing on the comparison while
  satisfying the readonly-tuple `.includes()` constraint properly.
- Document the pg_advisory_xact_lock tradeoff: session-level locks
  (`pg_advisory_lock`) would release the connection earlier, but PgBouncer
  transaction-pooling mode breaks them. xact_lock is the correct choice for
  Sim's deployment; if pool pressure becomes real, the comment notes the
  Redlock escape hatch.
- McpTool now extends `Pick<Tool, 'name' | 'description'>` from
  @modelcontextprotocol/sdk so name/description fields stay in sync with
  the SDK; serverId/serverName remain Sim-specific additions.
- Drop file-header restatements ("MCP Types - for connecting to external
  MCP servers"), one-line wrapper docstrings ("Get connection status"),
  and narrative comment blocks that just restate visible code.
- Keep only comments that document non-obvious "why" — OAuth refresh-lock
  tradeoff, in-flight dedup key composition, SDK Tool.inputSchema typing,
  preregistered-client semantics, postMessage handshake contract.
withMcpOauthRefreshLock now uses `coalesceLocally` + Redis acquireLock/
releaseLock with polling — the same primitives backing regular OAuth
refresh (`app/api/auth/oauth/utils.ts`). No more pinning a Postgres
connection for the duration of the SDK's OAuth HTTP refresh.

- In-process dedup: shared promise via `coalesceLocally`.
- Cross-process: Redis SET NX EX mutex; followers poll until the leader
  releases (30s max wait, 100ms poll), then acquire and run fn().
- Each MCP caller still constructs its own client (semantics preserved).
- Falls open when Redis is unavailable — same behavior as the regular
  OAuth refresh code path.
@waleedlatif1 waleedlatif1 force-pushed the improvement/mcp-hardening branch from 14e22e9 to cdb3e14 Compare May 22, 2026 16:16
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/mcp/oauth/storage.ts Outdated
Comment thread apps/sim/lib/mcp/oauth/storage.ts Outdated
…ts + cover OAuth lock

- McpClient.SUPPORTED_VERSIONS removed; getVersionInfo() and the inbound
  serve route both import LATEST_PROTOCOL_VERSION / SUPPORTED_PROTOCOL_VERSIONS
  directly from @modelcontextprotocol/sdk/types.js. New protocol revisions
  ship automatically with SDK upgrades.
- pinned-fetch now caches undici Agents in a module-level LRU keyed by
  resolvedIP (max 64). Back-to-back MCP calls to the same server reuse the
  keep-alive connection pool instead of opening fresh TCP + TLS each time.
- New integration tests for withMcpOauthRefreshLock covering: in-process
  dedup via coalesceLocally, cross-process serialization via Redis mutex,
  fall-open on Redis unavailable, lock release on throw, release-failure
  swallow, per-row key isolation.
withMcpOauthRefreshLock previously wrapped fn() in coalesceLocally, which
returns the SAME promise (and the same resolved value) to all in-process
callers. fn() returns a stateful McpClient — sharing it meant whichever
caller finished first would disconnect the client while another was still
mid-call, leaving in-flight RPC on a closed connection.

Swap coalesceLocally for a per-row Promise chain: each caller waits for
the previous to settle, then runs its OWN fn() (gets its own client).
Cross-process Redis mutex semantics unchanged.

The "shareable scalar" assumption that makes coalesceLocally correct for
regular OAuth refresh (returns an access token string) does not hold for
MCP, where each caller needs an independent connection.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/mcp/oauth/storage.ts
Comment thread apps/sim/lib/mcp/pinned-fetch.ts
…inned agents

- Redis refresh lock now uses a 15s TTL with a watchdog that extends every
  5s while fn() runs. Long-running OAuth refreshes no longer lose the lock
  mid-flight and let another process race the same refresh.
- Pinned-agent LRU eviction no longer calls `agent.close()`. Existing
  `createMcpPinnedFetch` closures hold the dispatcher reference and were
  using a closed Agent after eviction. We drop from the cache and let GC
  release the dispatcher once the last closure dies; undici closes idle
  keep-alive connections via its own internal timeout.
- New tests: watchdog extends while fn() runs and stops once it settles;
  evicted agents are not closed and captured closures still work.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/mcp/oauth/storage.ts Outdated
…s deadline

When the Redis refresh lock can't be acquired within REFRESH_MAX_WAIT_MS
the previous code ran fn() uncoordinated — but another process can still
be holding the lock (watchdog-extended) and refreshing the same OAuth
row, recreating the exact race the lock prevents.

Throw on deadline. The caller can retry; the Redis-down branch remains
the only path that runs fn() uncoordinated (no coordination is possible
there).
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4c05f63. Configure here.

…hods

Earlier comment-trim pass went too far on a few exports — restored
the TSDoc that explained non-obvious "why" decisions:

- SimMcpOauthProviderInit.preregistered: when set, the SDK skips DCR.
- McpServerConfig.userId: required for OAuth; selects which user's
  stored tokens to use.
- McpOauthAuthorizationRequiredError: benign pending state vs failure.
- McpToolsChangedCallback / McpClientOptions: notification semantics,
  DNS-rebinding pinning rationale, OAuth provider contract.
- StoredMcpToolReference / StoredMcpTool: minimal vs extended use.
- McpClient.connect: documents listChanged handler registration.
- McpService.executeTool: documents session-error retry behavior.

Pure-restatement comments ("Disconnect from MCP server") stay trimmed.
@waleedlatif1 waleedlatif1 merged commit b2ad5e9 into staging May 22, 2026
9 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/mcp-hardening branch May 22, 2026 17:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant