Skip to content

improvement(mcp): per-server tool queries + negative cache#4715

Merged
waleedlatif1 merged 10 commits into
stagingfrom
improvement/mcp-per-server-queries
May 22, 2026
Merged

improvement(mcp): per-server tool queries + negative cache#4715
waleedlatif1 merged 10 commits into
stagingfrom
improvement/mcp-per-server-queries

Conversation

@waleedlatif1
Copy link
Copy Markdown
Collaborator

Summary

  • Move MCP tool discovery off the workspace-aggregated Promise.all fan-out onto per-server React Query keys (mcpKeys.serverToolsList(workspaceId, serverId)), matching how Cursor and Claude Code render remote MCP. A slow or hung server can no longer block another server's tools from loading.
  • useMcpToolsQuery is now a useQueries combiner over the live server list. Public shape (data, isLoading, isFetching, error) preserved so every existing consumer compiles unchanged.
  • Add short-TTL negative cache: when listTools fails (timeout, connection error) we mark the server unhealthy for 30s so subsequent discovery short-circuits instead of re-paying the timeout. OAuth-required errors are exempt so re-auth retries immediately.
  • Drop LIST_TOOLS_TIMEOUT_MS 30s → 10s to bound worst-case first failure.
  • Invalidations are per-server where the action is per-server (OAuth popup uses serverId from postMessage; SSE tools_changed uses serverId from payload; refresh/update/delete target a single key). Bulk operations stay workspace-broad.

Type of Change

  • Improvement

Testing

Tested manually. 289 MCP tests pass (incl. 3 new negative-cache tests + updated semantics on the existing "does not poison another server's cache" test). `bun run check:api-validation` passes.

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 22, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 22, 2026 5:17am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 22, 2026

PR Summary

Medium Risk
Changes MCP discovery caching/querying semantics and error handling, which can affect tool availability and refresh behavior across the UI and API. Risk is moderated by targeted cache keys, explicit force-refresh paths, and expanded tests around negative caching.

Overview
MCP tool discovery is now per-server throughout the stack: the UI switches from a single workspace-level tools query to parallel per-server React Query entries (mcpKeys.serverToolsList), and invalidations (OAuth callback, SSE tools_changed, refresh/update/delete) are updated to target the affected server key or the workspace aggregate.

Adds a short TTL “cooldown” negative cache for failing servers in McpService, so repeated discoveries skip recently-failed servers instead of re-paying timeouts; explicit refresh paths pass forceRefresh=true to bypass both positive and negative caches, and new tests cover the updated failure/skip semantics. Also reduces LIST_TOOLS_TIMEOUT_MS (30s→10s) and maps cooldown errors to HTTP 503.

Includes small icon fixes: unique gradient ids for FindymailIcon, updated HubspotIcon path, and adjusted ResendIcon viewBox sizing.

Reviewed by Cursor Bugbot for commit ed1b970. Configure here.

Comment thread apps/sim/hooks/queries/mcp.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts
Comment thread apps/sim/lib/mcp/service.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 22, 2026

Greptile Summary

This PR refactors MCP tool discovery from a single workspace-aggregated query into N parallel per-server React Query entries (mcpKeys.serverToolsList(workspaceId, serverId)), and adds a server-side negative cache so a failed server is skipped for a short cooldown window instead of re-paying the timeout on every discovery call.

  • Per-server useQueries fan-out replaces the single workspace useQuery; the public shape (data, isLoading, isFetching, error) is preserved so existing consumers compile unchanged. The servers-loading state is folded into the aggregate isLoading to prevent the false "loaded, no tools" flash that was flagged in a prior review.
  • discoverServerTools adds request-level inflight deduplication, cache-aside logic, and a forceRefresh flag used by the refresh button and OAuth callback paths; the DiscoveryOutcome.error variant now carries originalError: unknown so instanceof exemptions (OAuth, Unauthorized) survive the error-to-string round-trip.
  • A negative cache with a 120 s TTL is written on discovery failure and cleared on success; the LIST_TOOLS_TIMEOUT_MS cap is dropped from 30 s to 10 s to shorten worst-case first-failure time.

Confidence Score: 5/5

Safe to merge — the per-server fan-out, negative cache, and forceRefresh bypass all behave as intended and the previous review's concerns have been fully addressed.

The architectural refactor is sound: per-server React Query keys eliminate cross-server blocking, the negative cache is correctly exempted for OAuth errors, and originalError is now threaded through so instanceof checks survive the error-to-outcome round-trip. Tests cover the new cache paths. The only notable discrepancy is between the documented 30 s cooldown in the PR description and the 120 s value in the code — a minor factual mismatch that does not affect runtime correctness since the explicit refresh button bypasses the cache entirely.

No files require special attention. The TTL constant in apps/sim/lib/mcp/utils.ts is worth a second glance to confirm 120 s (vs the PR description's 30s) is intentional.

Important Files Changed

Filename Overview
apps/sim/lib/mcp/service.ts Core service refactored: adds per-server inflight deduplication, negative cache (markServerUnhealthy/isServerUnhealthy/clearServerFailure), forceRefresh bypass, and correct originalError threading through DiscoveryOutcome. Logic is sound with proper OAuth exemptions.
apps/sim/hooks/queries/mcp.ts useMcpToolsQuery replaced with useQueries fan-out; serversLoading folded into aggregate isLoading; useForceRefreshMcpTools updated to fan out per-server and invalidate failed entries; cache key hierarchy is clean.
apps/sim/lib/mcp/utils.ts Adds FAILURE_CACHE_TTL_MS: 120_000 and reduces LIST_TOOLS_TIMEOUT_MS from 30 s to 10 s; categorizeError extended with cooldown → 503 mapping, consistent with existing pattern.
apps/sim/lib/mcp/service.test.ts Three new negative-cache tests added (failure caching, cache clearance on success, OAuth exemption); existing test updated to match new semantics. mockListTools.mockReset() added to beforeEach to prevent queue leakage.
apps/sim/hooks/mcp/use-mcp-tools.ts refreshTools simplified to workspace-wide invalidation; forceRefresh parameter removed (now delegated to useForceRefreshMcpTools). Public interface updated accordingly.
apps/sim/hooks/mcp/use-mcp-oauth-popup.ts OAuth postMessage handler now invalidates only the specific server's tool key when serverId is present, falling back to workspace-wide invalidation.
apps/sim/app/api/mcp/tools/discover/route.ts forceRefresh now threaded through to discoverServerTools on the GET path (per-server) and POST bulk path. Simple, correct change.
apps/sim/components/icons.tsx FindymailIcon SVG gradient IDs made unique per-instance using useId() to prevent ID collisions. HubspotIcon path corrected. ResendIcon viewBox cropped to center content.

Sequence Diagram

sequenceDiagram
    participant UI as React UI
    participant RQ as React Query
    participant API as /api/mcp/tools/discover
    participant SVC as McpService
    participant Cache as Cache Adapter

    UI->>RQ: useMcpToolsQuery(workspaceId)
    RQ->>API: "GET ?workspaceId&serverId (per server)"
    API->>SVC: discoverServerTools(userId, serverId, ws, false)
    SVC->>Cache: get(serverCacheKey)
    alt Cache hit
        Cache-->>SVC: tools[]
        SVC-->>API: tools[]
    else Negative cache hit
        Cache-->>SVC: failure sentinel
        SVC-->>API: throw McpConnectionError (cooldown)
        API-->>RQ: 503
        RQ-->>UI: error state (server skipped)
    else Cache miss
        SVC->>SVC: listTools() via transport
        alt Success
            SVC->>Cache: set(serverCacheKey, tools, TTL)
            SVC->>Cache: delete(failureCacheKey)
            SVC-->>API: tools[]
            API-->>RQ: 200 tools
            RQ-->>UI: data (merged across servers)
        else Failure
            SVC->>Cache: delete(serverCacheKey)
            SVC->>Cache: set(failureCacheKey, sentinel, FAILURE_TTL)
            SVC-->>API: throw error
            API-->>RQ: 4xx/5xx
            RQ-->>UI: per-server error (other servers still render)
        end
    end

    Note over UI,Cache: Force-refresh path (refresh button / OAuth callback)
    UI->>RQ: useForceRefreshMcpTools.mutate(workspaceId)
    RQ->>API: "GET ?workspaceId&serverId&refresh=true (per server)"
    API->>SVC: discoverServerTools(userId, serverId, ws, true)
    SVC->>SVC: skip both caches, listTools()
    SVC->>Cache: clear failureCacheKey on success
    SVC-->>RQ: fresh tools[]
Loading

Reviews (4): Last reviewed commit: "fix(mcp): third-round bugbot review" | Re-trigger Greptile

Comment thread apps/sim/hooks/queries/mcp.ts Outdated
Comment thread apps/sim/lib/mcp/service.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts Outdated
waleedlatif1 and others added 5 commits May 21, 2026 20:13
…w server can't block the workspace

Move MCP tool discovery off the workspace-aggregated `Promise.all` fan-out and
onto per-server React Query keys, matching how Cursor and Claude Code render
remote MCP. `useMcpToolsQuery` is now a `useQueries` combiner: each server has
its own cache entry, its own loading state, and a slow neighbor never gates the
others. Public shape stays compatible with existing consumers.

Add a short-TTL negative cache: when `listTools` fails (timeout, connection
error, etc.) we mark the server unhealthy for 30s so subsequent discovery calls
short-circuit instead of re-paying the timeout. OAuth-required errors are
exempt so re-auth retries immediately. Drop `LIST_TOOLS_TIMEOUT_MS` from 30s
to 10s to bound the worst-case first failure.

Invalidations are per-server where the action is per-server (OAuth popup,
per-server SSE event, refresh, update, delete). Bulk operations stay
workspace-broad. Adds tests for the negative-cache behavior and the OAuth
exemption.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… window focus

Three small follow-ups on top of per-server tool queries:

- Coalesce concurrent `discoverServerTools(userId, serverId, workspaceId)`
  calls into a single upstream `tools/list`. Races between OAuth-callback
  cache priming, post-OAuth UI refetch, and multi-tab loads no longer
  double-fetch the same server.
- Bump negative-cache TTL from 30s to 2 minutes. Cleared on listChanged,
  OAuth completion, manual refresh, and the next successful discovery, so
  this floor only matters for genuinely dead servers — drops their floor
  traffic by 4x.
- Disable `refetchOnWindowFocus` on per-server tool queries. listChanged
  SSE + mutation invalidations already cover real schema changes; alt-tab
  no longer triggers N parallel `tools/list` calls.
- Workspace-scoped `mcpKeys.serverToolsWorkspace(workspaceId)` prefix for bulk
  invalidations (create-server, refresh-all, SSE workspace fallback, OAuth
  fallback). The previous `mcpKeys.serverTools()` prefix was global and
  invalidated every workspace's tools cache.
- `useMcpToolsQuery` folds `useMcpServers().isLoading` into the aggregate
  `isLoading` so mounting no longer flashes an "empty tools" state during the
  servers-list fetch. Aggregate `error` is suppressed when any per-server
  query already returned data so one slow server can't blank out the others.
- `useForceRefreshMcpTools` invalidates the per-server query keys of servers
  whose force refresh failed, so stale tools don't linger.
- `DiscoveryOutcome` error variant carries the original error, restoring the
  OAuth-exemption check that `getErrorMessage(...)` previously erased.
- `discoverServerTools(userId, serverId, workspaceId, forceRefresh = false)`
  now consults the positive + negative cache by default. Per-server React
  Query refetches hit the cache instead of re-paying the listTools timeout;
  callers that explicitly bypass cache (refresh route, OAuth callback, bulk
  POST refresh) pass `forceRefresh: true`. Negative-cache hits throw a typed
  `McpConnectionError` so the route layer can surface a fast 503.
- Drop unused `mcpKeys.tools()` / `mcpKeys.toolsList()` — replaced by
  per-server keys, no remaining callers.
- Trim narrative comments to keep only the non-obvious "why" notes.
@waleedlatif1 waleedlatif1 force-pushed the improvement/mcp-per-server-queries branch from 02cdc57 to 7ebef04 Compare May 22, 2026 03:13
`McpConnectionError` thrown when a server is in cooldown previously
fell through `categorizeError` to a generic 500. Cooldown is a
transient-unavailability condition, so route it to 503.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/hooks/queries/mcp.ts
Comment thread apps/sim/hooks/queries/mcp.ts
Comment thread apps/sim/hooks/queries/mcp.ts
Comment thread apps/sim/lib/mcp/service.ts
Comment thread apps/sim/hooks/mcp/use-mcp-tools.ts Outdated
- useMcpToolsQuery serverIds: filter on enabled + workspaceId match.
  Disabled rows no longer trigger discover calls that get negative-cached,
  and keepPreviousData on useMcpServers no longer races a workspace switch
  into cross-workspace discover requests.
- Aggregate skips per-server data when that server's latest refetch errored,
  so a broken server's last-known tools no longer linger in the workspace
  view while its card shows an error.
- discoverServerTools failure path drops the positive cache alongside writing
  the negative-cache marker. A cache-respecting follow-up now fails fast via
  cooldown instead of returning stale tools from a now-broken server.
- useMcpTools.refreshTools drops the dead forceRefresh param — the per-server
  queryFn always sends refresh=false, so the flag was never effective. Callers
  wanting cache-bypass should use useForceRefreshMcpTools.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/lib/mcp/service.ts Outdated
Comment thread apps/sim/hooks/queries/mcp.ts
Comment thread apps/sim/hooks/queries/mcp.ts Outdated
Comment thread apps/sim/lib/mcp/utils.ts
- discoverTools failure path now drops the per-server positive cache alongside
  writing the negative-cache marker, matching discoverServerTools' behavior so
  a workspace-aggregate failure doesn't leave stale tools cached.
- useForceRefreshMcpTools filters disabled and out-of-workspace rows before
  fan-out so disabled servers don't 404 → negative-cache themselves.
- Remove unused useMcpServerTools export — the aggregate goes through
  useQueries directly, no external consumer exists.
@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1
Copy link
Copy Markdown
Collaborator Author

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit ed1b970. Configure here.

@waleedlatif1 waleedlatif1 merged commit 0c96964 into staging May 22, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the improvement/mcp-per-server-queries branch May 22, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant