improvement(mcp): per-server tool queries + negative cache#4715
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Adds a short TTL “cooldown” negative cache for failing servers in Includes small icon fixes: unique gradient ids for Reviewed by Cursor Bugbot for commit ed1b970. Configure here. |
Greptile SummaryThis PR refactors MCP tool discovery from a single workspace-aggregated query into N parallel per-server React Query entries (
Confidence Score: 5/5Safe to merge — the per-server fan-out, negative cache, and forceRefresh bypass all behave as intended and the previous review's concerns have been fully addressed. The architectural refactor is sound: per-server React Query keys eliminate cross-server blocking, the negative cache is correctly exempted for OAuth errors, and originalError is now threaded through so instanceof checks survive the error-to-outcome round-trip. Tests cover the new cache paths. The only notable discrepancy is between the documented 30 s cooldown in the PR description and the 120 s value in the code — a minor factual mismatch that does not affect runtime correctness since the explicit refresh button bypasses the cache entirely. No files require special attention. The TTL constant in apps/sim/lib/mcp/utils.ts is worth a second glance to confirm 120 s (vs the PR description's 30s) is intentional. Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as React UI
participant RQ as React Query
participant API as /api/mcp/tools/discover
participant SVC as McpService
participant Cache as Cache Adapter
UI->>RQ: useMcpToolsQuery(workspaceId)
RQ->>API: "GET ?workspaceId&serverId (per server)"
API->>SVC: discoverServerTools(userId, serverId, ws, false)
SVC->>Cache: get(serverCacheKey)
alt Cache hit
Cache-->>SVC: tools[]
SVC-->>API: tools[]
else Negative cache hit
Cache-->>SVC: failure sentinel
SVC-->>API: throw McpConnectionError (cooldown)
API-->>RQ: 503
RQ-->>UI: error state (server skipped)
else Cache miss
SVC->>SVC: listTools() via transport
alt Success
SVC->>Cache: set(serverCacheKey, tools, TTL)
SVC->>Cache: delete(failureCacheKey)
SVC-->>API: tools[]
API-->>RQ: 200 tools
RQ-->>UI: data (merged across servers)
else Failure
SVC->>Cache: delete(serverCacheKey)
SVC->>Cache: set(failureCacheKey, sentinel, FAILURE_TTL)
SVC-->>API: throw error
API-->>RQ: 4xx/5xx
RQ-->>UI: per-server error (other servers still render)
end
end
Note over UI,Cache: Force-refresh path (refresh button / OAuth callback)
UI->>RQ: useForceRefreshMcpTools.mutate(workspaceId)
RQ->>API: "GET ?workspaceId&serverId&refresh=true (per server)"
API->>SVC: discoverServerTools(userId, serverId, ws, true)
SVC->>SVC: skip both caches, listTools()
SVC->>Cache: clear failureCacheKey on success
SVC-->>RQ: fresh tools[]
Reviews (4): Last reviewed commit: "fix(mcp): third-round bugbot review" | Re-trigger Greptile |
…w server can't block the workspace Move MCP tool discovery off the workspace-aggregated `Promise.all` fan-out and onto per-server React Query keys, matching how Cursor and Claude Code render remote MCP. `useMcpToolsQuery` is now a `useQueries` combiner: each server has its own cache entry, its own loading state, and a slow neighbor never gates the others. Public shape stays compatible with existing consumers. Add a short-TTL negative cache: when `listTools` fails (timeout, connection error, etc.) we mark the server unhealthy for 30s so subsequent discovery calls short-circuit instead of re-paying the timeout. OAuth-required errors are exempt so re-auth retries immediately. Drop `LIST_TOOLS_TIMEOUT_MS` from 30s to 10s to bound the worst-case first failure. Invalidations are per-server where the action is per-server (OAuth popup, per-server SSE event, refresh, update, delete). Bulk operations stay workspace-broad. Adds tests for the negative-cache behavior and the OAuth exemption. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… window focus Three small follow-ups on top of per-server tool queries: - Coalesce concurrent `discoverServerTools(userId, serverId, workspaceId)` calls into a single upstream `tools/list`. Races between OAuth-callback cache priming, post-OAuth UI refetch, and multi-tab loads no longer double-fetch the same server. - Bump negative-cache TTL from 30s to 2 minutes. Cleared on listChanged, OAuth completion, manual refresh, and the next successful discovery, so this floor only matters for genuinely dead servers — drops their floor traffic by 4x. - Disable `refetchOnWindowFocus` on per-server tool queries. listChanged SSE + mutation invalidations already cover real schema changes; alt-tab no longer triggers N parallel `tools/list` calls.
- Workspace-scoped `mcpKeys.serverToolsWorkspace(workspaceId)` prefix for bulk invalidations (create-server, refresh-all, SSE workspace fallback, OAuth fallback). The previous `mcpKeys.serverTools()` prefix was global and invalidated every workspace's tools cache. - `useMcpToolsQuery` folds `useMcpServers().isLoading` into the aggregate `isLoading` so mounting no longer flashes an "empty tools" state during the servers-list fetch. Aggregate `error` is suppressed when any per-server query already returned data so one slow server can't blank out the others. - `useForceRefreshMcpTools` invalidates the per-server query keys of servers whose force refresh failed, so stale tools don't linger. - `DiscoveryOutcome` error variant carries the original error, restoring the OAuth-exemption check that `getErrorMessage(...)` previously erased. - `discoverServerTools(userId, serverId, workspaceId, forceRefresh = false)` now consults the positive + negative cache by default. Per-server React Query refetches hit the cache instead of re-paying the listTools timeout; callers that explicitly bypass cache (refresh route, OAuth callback, bulk POST refresh) pass `forceRefresh: true`. Negative-cache hits throw a typed `McpConnectionError` so the route layer can surface a fast 503.
- Drop unused `mcpKeys.tools()` / `mcpKeys.toolsList()` — replaced by per-server keys, no remaining callers. - Trim narrative comments to keep only the non-obvious "why" notes.
02cdc57 to
7ebef04
Compare
`McpConnectionError` thrown when a server is in cooldown previously fell through `categorizeError` to a generic 500. Cooldown is a transient-unavailability condition, so route it to 503.
|
@greptile |
|
@cursor review |
- useMcpToolsQuery serverIds: filter on enabled + workspaceId match. Disabled rows no longer trigger discover calls that get negative-cached, and keepPreviousData on useMcpServers no longer races a workspace switch into cross-workspace discover requests. - Aggregate skips per-server data when that server's latest refetch errored, so a broken server's last-known tools no longer linger in the workspace view while its card shows an error. - discoverServerTools failure path drops the positive cache alongside writing the negative-cache marker. A cache-respecting follow-up now fails fast via cooldown instead of returning stale tools from a now-broken server. - useMcpTools.refreshTools drops the dead forceRefresh param — the per-server queryFn always sends refresh=false, so the flag was never effective. Callers wanting cache-bypass should use useForceRefreshMcpTools.
|
@greptile |
|
@cursor review |
- discoverTools failure path now drops the per-server positive cache alongside writing the negative-cache marker, matching discoverServerTools' behavior so a workspace-aggregate failure doesn't leave stale tools cached. - useForceRefreshMcpTools filters disabled and out-of-workspace rows before fan-out so disabled servers don't 404 → negative-cache themselves. - Remove unused useMcpServerTools export — the aggregate goes through useQueries directly, no external consumer exists.
|
@greptile |
|
@cursor review |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit ed1b970. Configure here.
Summary
Promise.allfan-out onto per-server React Query keys (mcpKeys.serverToolsList(workspaceId, serverId)), matching how Cursor and Claude Code render remote MCP. A slow or hung server can no longer block another server's tools from loading.useMcpToolsQueryis now auseQueriescombiner over the live server list. Public shape (data,isLoading,isFetching,error) preserved so every existing consumer compiles unchanged.listToolsfails (timeout, connection error) we mark the server unhealthy for 30s so subsequent discovery short-circuits instead of re-paying the timeout. OAuth-required errors are exempt so re-auth retries immediately.LIST_TOOLS_TIMEOUT_MS30s → 10s to bound worst-case first failure.serverIdfrom postMessage; SSEtools_changedusesserverIdfrom payload; refresh/update/delete target a single key). Bulk operations stay workspace-broad.Type of Change
Testing
Tested manually. 289 MCP tests pass (incl. 3 new negative-cache tests + updated semantics on the existing "does not poison another server's cache" test). `bun run check:api-validation` passes.
Checklist