Skip to content

Commit 1590896

Browse files
authored
fix(coderd/x/chatd): prime workspace MCP cache after create/start (#25298)
## Problem Mid-turn workspace MCP discovery was broken when an agent was still cold-starting. `PrepareTools` in `chatd.go` flipped `workspaceMCPDiscovered = true` *before* calling `discoverWorkspaceMCPTools`, so a failed discovery attempt permanently blocked retries within the turn. Customer-reported repro: - New chat with no pre-selected workspace. - LLM calls `create_workspace` mid-turn at `23:35:05`. - `PrepareTools` fires, dials the agent with a 30s timeout, dial times out at `23:38:15`, `discoverWorkspaceMCPTools` returns empty. - Agent connects at `23:38:29`, 14 seconds later. - `workspaceMCPDiscovered` was already true, so `PrepareTools` never retried for the rest of the turn. MCP tools only appeared on the next user message. A naive retry loop in `PrepareTools` would also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to call `PrepareTools` between LLM steps. ## Fix Do the workspace MCP discovery from inside the tool that already waits for the agent. `chattool.CreateWorkspace` and `chattool.StartWorkspace` call `waitForAgentReady`, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fire `OnChatUpdated`, the agent is `Ready`. The chatd `onChatUpdated` callback now launches an async `primeWorkspaceMCPCache` goroutine on every bind that has a valid workspace ID: - The primer calls `discoverWorkspaceMCPTools` until it returns a non-empty list or `workspaceMCPPrimeMaxWait` (30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCP `Connect` settling. - The primer runs asynchronously so the tool itself never blocks. Some templates simply do not advertise MCP tools, in which case the primer would otherwise spend its full budget for nothing. - The primer shares the chat `ctx` (not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn after `runChat`'s deferred `workspaceCtx.close()` and leak that conn. - `inflight.Add(1)` ensures server shutdown still waits for any in-progress primer. - `PrepareTools` is simplified back to a single discovery call. It now only sets `workspaceMCPDiscovered = true` on success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss. ## Tests All in `coderd/x/chatd/chatd_internal_test.go`: - `TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt` — single `ListMCPTools` call returning tools populates the cache. - `TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear` — first call empty, second returns tools; primer retries past the backoff and writes the cache. Uses `quartz.Mock.Trap` on `NewTimer`. - `TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline` — `ListMCPTools` always empty; primer stops at `workspaceMCPPrimeMaxWait` and refuses to cache the empty result so PrepareTools can retry on the next step. The existing integration test `TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace` continues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool. ``` go test ./coderd/x/chatd/... -count=1 go test ./coderd/x/chatd/ -race -count=1 make pre-commit ``` <details> <summary>Design notes</summary> - The first iteration of this PR added retry+cooldown+failure-cap logic inside `PrepareTools`. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, because `PrepareTools` only fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that. - Why async: a primer that ran synchronously inside the `OnChatUpdated` callback blocked the create_workspace tool from returning for up to `workspaceMCPPrimeMaxWait`, which broke `TestCreateWorkspaceTool_EndToEnd` and would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step. - Why share the chat `ctx` rather than `context.WithoutCancel(ctx)` (the title-generation pattern): the primer touches `workspaceCtx.getWorkspaceConn`, which `runChat`'s deferred `workspaceCtx.close()` invalidates. A detached primer outliving the chat would dial a fresh conn and leak it. - The constant naming distinguishes `workspaceMCPDiscoveryTimeout` (35s per-call dial budget, unchanged from #25169) from `workspaceMCPPrimeMaxWait` (30s total budget for the post-ready primer loop) and `workspaceMCPPrimeRetryInterval` (2s between empty-result retries). </details> Follow-up to #25169. --- _This pull request was generated by Coder Agents._
1 parent 7a985f8 commit 1590896

3 files changed

Lines changed: 802 additions & 8 deletions

File tree

coderd/x/chatd/chatd.go

Lines changed: 151 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,22 @@ const (
7272
// cold-start agent's first MCP reload can settle before
7373
// chatd gives up.
7474
workspaceMCPDiscoveryTimeout = 35 * time.Second
75-
turnStatusLabelWriteTimeout = 5 * time.Second
75+
// workspaceMCPPrimeMaxWait bounds the deadline used by the
76+
// create_workspace / start_workspace post-ready cache primer
77+
// loop. The primer checks the deadline only after each
78+
// discoverWorkspaceMCPTools call returns, so total wall-clock
79+
// time can exceed this by one such call (dialTimeout +
80+
// workspaceMCPDiscoveryTimeout in the worst case). The constant
81+
// caps when new retries can start, not when an in-flight call
82+
// must finish. Empty results usually mean the agent's MCP
83+
// Connect is still racing with agent startup. The agent-side
84+
// budget is agent/x/agentmcp.connectTimeout (30s).
85+
workspaceMCPPrimeMaxWait = 30 * time.Second
86+
// workspaceMCPPrimeRetryInterval is the short backoff between
87+
// re-attempts inside the primer when ListMCPTools returns an
88+
// empty list without error.
89+
workspaceMCPPrimeRetryInterval = 2 * time.Second
90+
turnStatusLabelWriteTimeout = 5 * time.Second
7691
// defaultDialTimeout matches the timeout used by ~8 other
7792
// server-side AgentConn callers.
7893
defaultDialTimeout = 30 * time.Second
@@ -568,6 +583,61 @@ func (p *Server) discoverWorkspaceMCPTools(
568583
return tools
569584
}
570585

586+
// primeWorkspaceMCPCache populates workspaceMCPToolsCache after the
587+
// create_workspace or start_workspace tool finishes waiting for the
588+
// workspace agent to become reachable. By the time it runs the agent
589+
// is already Ready, so a single ListMCPTools call usually succeeds.
590+
// When the agent's MCP server is still racing with agent startup,
591+
// ListMCPTools may return an empty list (no error) on the first call;
592+
// the primer retries with a short backoff up to
593+
// workspaceMCPPrimeMaxWait so the LLM step that follows the tool call
594+
// sees the workspace MCP tools in the cache and PrepareTools does not
595+
// need to dial again.
596+
//
597+
// Returns silently on every failure mode. The chat continues without
598+
// workspace MCP tools when the agent does not advertise any within
599+
// the budget. The next user turn re-runs top-of-turn discovery from
600+
// scratch.
601+
func (p *Server) primeWorkspaceMCPCache(
602+
ctx context.Context,
603+
logger slog.Logger,
604+
chatID uuid.UUID,
605+
workspaceCtx *turnWorkspaceContext,
606+
) {
607+
deadline := p.clock.Now().Add(workspaceMCPPrimeMaxWait)
608+
attempt := 0
609+
for {
610+
attempt++
611+
tools := p.discoverWorkspaceMCPTools(ctx, logger, chatID, workspaceCtx)
612+
if len(tools) > 0 {
613+
logger.Debug(ctx, "primed workspace MCP cache",
614+
slog.F("chat_id", chatID),
615+
slog.F("tool_count", len(tools)),
616+
slog.F("attempts", attempt),
617+
)
618+
return
619+
}
620+
if ctx.Err() != nil {
621+
return
622+
}
623+
if !p.clock.Now().Before(deadline) {
624+
logger.Debug(ctx,
625+
"workspace MCP cache primer gave up waiting for tools",
626+
slog.F("chat_id", chatID),
627+
slog.F("attempts", attempt),
628+
)
629+
return
630+
}
631+
timer := p.clock.NewTimer(workspaceMCPPrimeRetryInterval, "chatd", "workspace-mcp-prime")
632+
select {
633+
case <-timer.C:
634+
case <-ctx.Done():
635+
timer.Stop()
636+
return
637+
}
638+
}
639+
}
640+
571641
type turnWorkspaceContext struct {
572642
server *Server
573643
chatStateMu *sync.Mutex
@@ -6457,6 +6527,11 @@ type rootChatToolsOptions struct {
64576527
resolvePlanPath func(context.Context) (string, string, error)
64586528
storeFile chattool.StoreFileFunc
64596529
isPlanModeTurn bool
6530+
// primerCtx scopes the workspace MCP cache primer goroutines
6531+
// that onChatUpdated launches. runChat cancels it before
6532+
// workspaceCtx.close() so an in-flight primer cannot dial a
6533+
// fresh conn after the cached one was released.
6534+
primerCtx context.Context
64606535
}
64616536

64626537
func (p *Server) loadPlanModeInstructions(
@@ -6520,6 +6595,50 @@ func (p *Server) appendRootChatTools(
65206595
}
65216596
}
65226597
}
6598+
6599+
// Prime the workspace MCP tools cache while the create_workspace
6600+
// or start_workspace tool is still running. The AgentID guard
6601+
// below restricts the primer to the post-ready callback, when
6602+
// the agent is reachable. ListMCPTools may still return an
6603+
// empty list on the first try when the agent's MCP Connect is
6604+
// racing with agent startup; primeWorkspaceMCPCache retries
6605+
// with a short backoff up to workspaceMCPPrimeMaxWait. Priming
6606+
// here lets the next LLM step's PrepareTools hit the cache
6607+
// instead of dialing again on a separate timeout budget.
6608+
//
6609+
// Run asynchronously: the tool itself must not block on the
6610+
// primer because the agent may not advertise any MCP tools at
6611+
// all (e.g. minimal templates), in which case the primer waits
6612+
// the full budget before giving up. PrepareTools on the next
6613+
// step covers the cache miss path; the primer is purely an
6614+
// optimization that warms the cache while the LLM is thinking.
6615+
// inflight tracking ensures server shutdown still waits for any
6616+
// in-progress primer.
6617+
//
6618+
// Guard on both WorkspaceID and AgentID being valid:
6619+
// create_workspace and start_workspace each fire onChatUpdated
6620+
// twice for a new build (binding before waitForAgentReady;
6621+
// post-ready after it), and stop_workspace fires it with a nil
6622+
// agent. Only the post-ready callback has a live AgentID, so
6623+
// the pre-build and stop-side firings would otherwise spawn a
6624+
// primer goroutine that dials a missing or dying agent and
6625+
// burns the full budget for nothing.
6626+
//
6627+
// Read the snapshot from workspaceCtx rather than the
6628+
// updatedChat parameter: persistInstructionFiles above runs
6629+
// ensureWorkspaceAgent which calls persistBuildAgentBinding and
6630+
// setCurrentChat, so by the time we get here the in-memory
6631+
// snapshot has the freshly bound AgentID even when the
6632+
// updatedChat parameter (read from the DB before the binding
6633+
// was persisted) does not.
6634+
snapshot := opts.workspaceCtx.currentChatSnapshot()
6635+
if snapshot.WorkspaceID.Valid && snapshot.AgentID.Valid {
6636+
p.inflight.Add(1)
6637+
go func() {
6638+
defer p.inflight.Done()
6639+
p.primeWorkspaceMCPCache(opts.primerCtx, p.logger, snapshot.ID, opts.workspaceCtx)
6640+
}()
6641+
}
65236642
}
65246643

65256644
tools = append(tools,
@@ -6852,7 +6971,16 @@ func (p *Server) runChat(
68526971
currentChat: &currentChat,
68536972
loadChatSnapshot: loadChatSnapshot,
68546973
}
6855-
defer workspaceCtx.close()
6974+
// primerCtx scopes the workspace MCP cache primer goroutines that
6975+
// onChatUpdated launches. We cancel it before workspaceCtx.close()
6976+
// so an in-flight primer cannot wake from its retry backoff,
6977+
// observe a cleared cached conn, dial a fresh one, and leak it
6978+
// when no subsequent close() runs.
6979+
primerCtx, primerCancel := context.WithCancel(ctx)
6980+
defer func() {
6981+
primerCancel()
6982+
workspaceCtx.close()
6983+
}()
68566984

68576985
planPathFn := func(ctx context.Context) (string, string, error) {
68586986
conn, err := workspaceCtx.getWorkspaceConn(ctx)
@@ -7435,6 +7563,7 @@ func (p *Server) runChat(
74357563
resolvePlanPath: resolvePlanPathForTools,
74367564
storeFile: storeChatAttachment,
74377565
isPlanModeTurn: isPlanModeTurn,
7566+
primerCtx: primerCtx,
74387567
})
74397568
}
74407569

@@ -7766,25 +7895,39 @@ func (p *Server) runChat(
77667895
},
77677896
PrepareTools: func(currentTools []fantasy.AgentTool) []fantasy.AgentTool {
77687897
// Mid-turn workspace MCP discovery for chats that bind a
7769-
// workspace via create_workspace or start_workspace
7770-
// after the turn has already started. The top-of-turn
7771-
// discovery path is gated on chat.WorkspaceID.Valid; this
7772-
// callback bridges the gap so the LLM sees workspace MCP
7773-
// tools on the very next step instead of the turn after.
7898+
// workspace via create_workspace or start_workspace after the
7899+
// turn has already started. The top-of-turn discovery path is
7900+
// gated on chat.WorkspaceID.Valid; this callback bridges the
7901+
// gap so the LLM sees workspace MCP tools on the very next
7902+
// step instead of the turn after.
7903+
//
7904+
// create_workspace and start_workspace prime
7905+
// workspaceMCPToolsCache via onChatUpdated after
7906+
// waitForAgentReady returns, so the call below is almost
7907+
// always a cache hit. The primer's bounded wait means the
7908+
// dial fallback here only runs when priming itself failed.
77747909
if workspaceMCPDiscovered || isExploreSubagent {
77757910
return nil
77767911
}
77777912
snapshot := workspaceCtx.currentChatSnapshot()
77787913
if !snapshot.WorkspaceID.Valid {
77797914
return nil
77807915
}
7781-
workspaceMCPDiscovered = true
77827916
discovered := p.discoverWorkspaceMCPTools(
77837917
ctx, loopLogger, chat.ID, &workspaceCtx,
77847918
)
77857919
if len(discovered) == 0 {
7920+
// Leave workspaceMCPDiscovered false so a subsequent
7921+
// step retries discovery. PrepareTools fires once per
7922+
// LLM step, so retries are unbounded for the rest of
7923+
// the turn. Per-step cost is one
7924+
// GetWorkspaceAgentsInLatestBuildByWorkspaceID query
7925+
// plus one ListMCPTools RPC, both fast against a live
7926+
// conn. The primer's 30s budget applies to its own
7927+
// loop only.
77867928
return nil
77877929
}
7930+
workspaceMCPDiscovered = true
77887931
return append(slices.Clone(currentTools), discovered...)
77897932
},
77907933
PrepareMessages: func(msgs []fantasy.Message) []fantasy.Message {

0 commit comments

Comments
 (0)