fix(coderd/x/chatd): prime workspace MCP cache after create/start by kylecarbs · Pull Request #25298 · coder/coder

kylecarbs · 2026-05-13T15:42:24Z

Problem

Mid-turn workspace MCP discovery was broken when an agent was still cold-starting. PrepareTools in chatd.go flipped workspaceMCPDiscovered = true before calling discoverWorkspaceMCPTools, so a failed discovery attempt permanently blocked retries within the turn.

Customer-reported repro:

New chat with no pre-selected workspace.
LLM calls create_workspace mid-turn at 23:35:05.
PrepareTools fires, dials the agent with a 30s timeout, dial times out at 23:38:15, discoverWorkspaceMCPTools returns empty.
Agent connects at 23:38:29, 14 seconds later.
workspaceMCPDiscovered was already true, so PrepareTools never retried for the rest of the turn. MCP tools only appeared on the next user message.

A naive retry loop in PrepareTools would also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to call PrepareTools between LLM steps.

Fix

Do the workspace MCP discovery from inside the tool that already waits for the agent. chattool.CreateWorkspace and chattool.StartWorkspace call waitForAgentReady, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fire OnChatUpdated, the agent is Ready. The chatd onChatUpdated callback now launches an async primeWorkspaceMCPCache goroutine on every bind that has a valid workspace ID:

The primer calls discoverWorkspaceMCPTools until it returns a non-empty list or workspaceMCPPrimeMaxWait (30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCP Connect settling.
The primer runs asynchronously so the tool itself never blocks. Some templates simply do not advertise MCP tools, in which case the primer would otherwise spend its full budget for nothing.
The primer shares the chat ctx (not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn after runChat's deferred workspaceCtx.close() and leak that conn.
inflight.Add(1) ensures server shutdown still waits for any in-progress primer.
PrepareTools is simplified back to a single discovery call. It now only sets workspaceMCPDiscovered = true on success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss.

Tests

All in coderd/x/chatd/chatd_internal_test.go:

TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt — single ListMCPTools call returning tools populates the cache.
TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear — first call empty, second returns tools; primer retries past the backoff and writes the cache. Uses quartz.Mock.Trap on NewTimer.
TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline — ListMCPTools always empty; primer stops at workspaceMCPPrimeMaxWait and refuses to cache the empty result so PrepareTools can retry on the next step.

The existing integration test TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace continues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool.

go test ./coderd/x/chatd/... -count=1
go test ./coderd/x/chatd/ -race -count=1
make pre-commit

Design notes

The first iteration of this PR added retry+cooldown+failure-cap logic inside PrepareTools. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, because PrepareTools only fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that.
Why async: a primer that ran synchronously inside the OnChatUpdated callback blocked the create_workspace tool from returning for up to workspaceMCPPrimeMaxWait, which broke TestCreateWorkspaceTool_EndToEnd and would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step.
Why share the chat ctx rather than context.WithoutCancel(ctx) (the title-generation pattern): the primer touches workspaceCtx.getWorkspaceConn, which runChat's deferred workspaceCtx.close() invalidates. A detached primer outliving the chat would dial a fresh conn and leak it.
The constant naming distinguishes workspaceMCPDiscoveryTimeout (35s per-call dial budget, unchanged from fix(coderd/x/chatd): discover workspace MCP tools mid-turn after create_workspace #25169) from workspaceMCPPrimeMaxWait (30s total budget for the post-ready primer loop) and workspaceMCPPrimeRetryInterval (2s between empty-result retries).

Follow-up to #25169.

This pull request was generated by Coder Agents.

kylecarbs · 2026-05-13T15:51:04Z

/coder-agents-review

coder-agents-review

The fix is well-targeted: moving workspaceMCPDiscovered to "succeeded" semantics eliminates the class of bug where a failed discovery permanently blocks retries. The cooldown mechanism prevents the obvious follow-on problem of retries stalling every LLM step. Clean scope, no drive-bys, proportional to the reported regression. Pariston tried to build a case against the diagnosis and could not: "I tried to find the right thing being built wrong. I could not."

1 P2, 3 P3, 3 Nit, 3 Note.

The P2 is a test coverage gap: the top-of-turn cold-start retry path (workspace exists at turn start, agent cold) has zero test coverage, so the original bug could be silently reintroduced for that code path without any test catching it. The three P3s address cooldown enforcement testing, retry capping, and operational logging. DEREM-1 and DEREM-2 could be addressed by a single test with a mock clock and non-zero cooldown starting from a valid workspace.

coderd/x/chatd/chatloop/chatloop.go:182

Note [DEREM-8] The comment says "The chatloop tracks whether tools have already been replaced so PrepareTools is not retried on subsequent steps once it has returned a non-nil slice." The chatloop code at line 409 calls opts.PrepareTools(opts.Tools) unconditionally on every step with no tracking flag. The "not retried" behavior comes entirely from the callback's internal state (workspaceMCPDiscovered in chatd.go). This PR's retry mechanism depends on PrepareTools being called on every step even after returning nil. If someone reads this comment and adds the tracking it describes, the retry breaks silently.

(Bisky, Ryosuke)

kylecarbs · 2026-05-13T16:55:11Z

DEREM-8: Updated the PrepareTools doc comment in coderd/x/chatd/chatloop/chatloop.go to say it is invoked unconditionally before every step and that callers must track stop-conditions inside their callback. This protects the new retry mechanism from anyone adding tracking that matches the old (incorrect) doc.

kylecarbs · 2026-05-13T16:55:16Z

/coder-agents-review

coder-agents-review

All 8 fixed findings from R1 verified against HEAD. The renames, retry cap, cooldown test, top-of-turn test, diagnostic log, and comment fixes all address their root causes. The panel found three new items introduced by the fix commit, all minor.

1 P3, 2 Nit.

The P3 is a fix-chain regression: the Debug log added for DEREM-4 says "will retry after cooldown" on the terminal failure that reaches the cap, promising a retry that never happens. The two Nits are test comments that describe scenarios the test code does not implement.

kylecarbs · 2026-05-13T17:48:43Z

/coder-agents-review

coder-agents-review

13 findings across 3 rounds: 1 P2, 4 P3, 5 Nit, 3 Note. 11 fixed, 2 acknowledged. No open findings.

The fix correctly moves workspaceMCPDiscoverySucceeded from "attempted" to "succeeded" semantics, eliminating the class of bug where a failed discovery permanently blocks retries. The cooldown (10s) prevents per-step stalls, the failure cap (3) bounds worst-case retry cost, and the Debug log distinguishes terminal exhaustion from transient retries. Four tests cover all behavioral properties: mid-turn retry, top-of-turn cold-start retry, cooldown blocking, and failure cap exhaustion. Each test pins one property and would fail if that property regressed.

Pariston tried three alternative framings (async discovery, agent-readiness gate, chatloop contract fix) and concluded the chosen approach is at the right causal level. Bisky confirmed mutation coverage: inverting the cooldown comparison, the cap comparison, or the flag derivation each breaks exactly one test.

Replaces the mid-turn retry+cooldown loop in PrepareTools with an async cache primer that runs from the create_workspace and start_workspace tools' OnChatUpdated callback. Those tools already wait for the workspace agent to become reachable via waitForAgentReady, so by the time the primer runs the agent is Ready: a single ListMCPTools call usually populates the cache, and the very next PrepareTools step is a cache hit. When the agent's MCP Connect is still racing with agent startup ListMCPTools returns an empty list (no error). The primer retries with a short backoff up to workspaceMCPPrimeMaxWait so the LLM step that follows the tool call sees the workspace MCP tools in the cache. The primer runs asynchronously: the tool itself must not block because some templates do not advertise any MCP tools at all, in which case the primer waits the full budget before giving up. The existing PrepareTools fallback covers the cache miss path; the primer is purely an optimization that warms the cache while the LLM is thinking. The primer shares the chat ctx so it is canceled with the chat, preventing a dangling primer from re-dialing the workspace conn after runChat's deferred close. inflight tracking ensures server shutdown still waits for any in-progress primer. PrepareTools is simplified: workspaceMCPDiscovered is now only set to true after a non-empty discovery, so an empty result no longer permanently blocks discovery within the turn.

kylecarbs · 2026-05-14T22:50:47Z

Pivoted the implementation after reviewing the customer scenario more carefully. The previous retry/cooldown/failure-cap loop in PrepareTools only handled the ~30s cold-start race; an EC2 workspace that takes several minutes to come up would still miss MCP discovery entirely. The chatloop only invokes PrepareTools between LLM steps, so it cannot keep retrying through a multi-minute boot.

The new approach moves discovery into the tool that already waits for the agent. chattool.CreateWorkspace and chattool.StartWorkspace call waitForAgentReady (2-minute agent-online budget plus 10-minute startup-script budget) and only fire OnChatUpdated afterward, when the agent is Ready. chatd's onChatUpdated callback now launches an async primeWorkspaceMCPCache goroutine that loops discoverWorkspaceMCPTools until it returns a non-empty list or workspaceMCPPrimeMaxWait (30s) elapses. PrepareTools is simplified back to a single cache-hit-or-dial call and only sets workspaceMCPDiscovered = true on success.

Key design choices documented in the PR description and in the source comments:

The primer is async so it cannot block tool completion when the template advertises no MCP tools.
The primer shares the chat ctx (not a detached one) so it is canceled with the chat, preventing a dangling primer from re-dialing the workspace conn after runChat's deferred close.
inflight.Add(1) ensures server shutdown still waits for any in-progress primer.

Force-pushed to replace the old commits. Tests: 3 new unit tests for the primer plus the existing mid-turn integration test, all passing under -race.

/coder-agents-review

This comment was generated by Coder Agents.

kylecarbs · 2026-05-14T23:08:47Z

/coder-agents-review

coder-agents-review

The pivot to an async cache primer launched from the tool's onChatUpdated callback is a better architecture than the PrepareTools retry loop. Moving discovery to the tool that already waits for agent readiness eliminates the fundamental timing problem. Hisoka put it well: "The async primer is the right architecture."

1 P2, 4 P3, 1 Note.

The P2 is a connection leak: the primer can re-dial after workspaceCtx.close() fires but before cancel(nil) in processChat. The fix is a 3-line defer reordering to cancel a child context before closing the workspace conn. Four reviewers independently identified the close-to-cancel ordering issue.

The strongest P3 (DEREM-15) has 4-way convergence: the primer fires on every onChatUpdated callback, including pre-ready callbacks and stop_workspace. The guard updatedChat.WorkspaceID.Valid passes all firings, spawning redundant primer goroutines that waste 30s each. Proposed fix: updatedChat.WorkspaceID.Valid && updatedChat.AgentID.Valid.

DEREM-14 (P2): Plumb a primerCtx through rootChatToolsOptions and cancel it before workspaceCtx.close() runs. Previously the primer captured the chat ctx directly, which is canceled by processChat after runChat returns and workspaceCtx.close has already released the cached agent conn. A primer waking from its retry backoff in that window would observe a nil conn, dial a fresh one, and leak it because no later close() runs. DEREM-15 (P3): Gate the primer goroutine on both WorkspaceID.Valid and AgentID.Valid. The OnChatUpdated callback fires from create_workspace at binding, build, and post-ready, plus once from stop_workspace. Only the post-ready callback has a live AgentID, so the prior guard spawned multiple primer goroutines that wasted their 30s budget dialing missing or dying agents. DEREM-18 (P3): Rewrite the PrepareTools empty-result comment. The primer's 30s budget applies to its own loop; PrepareTools is unbounded but cheap per step (one DB query plus one ListMCPTools RPC against a cached conn). DEREM-19 (Note): Clarify that workspaceMCPPrimeMaxWait caps when new retries can start, not when an in-flight call must finish. Worst-case wall time is workspaceMCPPrimeMaxWait + dialTimeout + workspaceMCPDiscoveryTimeout when the last attempt is slow. DEREM-16 (P3): Add TestRunChat_PrepareToolsRetriesAfterEmptyDiscovery which seeds ListMCPTools to return empty for the first two calls and the workspace tool for the third+. The chat takes multiple steps after create_workspace and the test asserts that a post-step streamed model call eventually advertises the workspace tool. Inverting the workspaceMCPDiscovered flag-flip (back to true before the empty check) breaks this test. DEREM-17 (P3): Add TestPrimeWorkspaceMCPCache_ExitsOnContextCancel which arms the primer's first retry timer, cancels the primer ctx, and asserts the goroutine exits without caching. The case <-ctx.Done() branch in the retry loop is now covered.

kylecarbs · 2026-05-15T00:00:29Z

All six round-1 findings (1 P2, 4 P3, 1 Note) addressed in 9d9fe1d. Resolved each thread with a reply summarizing the fix. Verified the new test TestRunChat_PrepareToolsRetriesAfterEmptyDiscovery fails when the workspaceMCPDiscovered flag-flip is reverted to the original buggy ordering.

/coder-agents-review

This comment was generated by Coder Agents.

coder-agents-review

All 6 R4 fixes verified. The primerCtx lifecycle, AgentID guard, PrepareTools retry test, ctx-cancel test, comment fixes, and constant doc update all address their root causes correctly. One new P3.

1 P3.

The DEREM-15 fix (adding updatedChat.AgentID.Valid to the primer guard) inadvertently made the primer unreachable for the primary scenario: new workspace creation mid-turn. The guard checks the stale updatedChat parameter, but AgentID is set inside the callback body by persistBuildAgentBinding, after the parameter was read from the DB. The system works because PrepareTools retries via the flag-flip fix, so this is a dead optimization, not a correctness issue.

DEREM-20 (P3): The previous DEREM-15 fix narrowed the primer guard to also require updatedChat.AgentID.Valid, but the updatedChat parameter is the row read by the tool before its OnChatUpdated callback runs. create_workspace persists the workspace binding with a null AgentID; persistBuildAgentBinding (which fills in the AgentID) only runs inside the callback, via persistInstructionFiles -> getWorkspaceAgent -> ensureWorkspaceAgent. The stale parameter made the primer unreachable on the primary scenario (new workspace mid-turn). Read the snapshot from workspaceCtx.currentChatSnapshot() instead. ensureWorkspaceAgent calls setCurrentChat after persisting the binding, so the snapshot reflects the freshly bound AgentID by the time the primer guard runs. The chat ID is also taken from the snapshot for consistency. Existing integration tests already exercise the bind-then-prime path; verified all existing chatd tests still pass.

kylecarbs · 2026-05-15T12:57:27Z

DEREM-20 addressed in 31968f5. Switched the primer guard to read the in-memory chat snapshot rather than the stale updatedChat parameter, so the primer now fires on the create_workspace post-ready path as intended. Thread resolved.

/coder-agents-review

This comment was generated by Coder Agents.

coder-agents-review

21 findings across 6 rounds (2 P2, 8 P3, 6 Nit, 4 Note, plus 1 new Nit). 18 fixed, 2 acknowledged (moot after implementation pivot), 1 open Nit.

The async cache primer launched from onChatUpdated after waitForAgentReady is the right architecture. The primer warms the cache concurrently with the next LLM step, PrepareTools falls back to a direct dial on cache miss, and the flag-flip fix ensures empty discovery never permanently blocks retries. The primerCtx lifecycle prevents conn leaks, the AgentID guard prevents premature primer launches, and the snapshot-based guard reads fresh state after persistBuildAgentBinding runs.

Five tests cover the primer (success, retry, deadline, ctx-cancel) plus an integration test that pins the PrepareTools flag-flip regression. Bisky confirmed mutation coverage across all behavioral properties.

DEREM-21 (Nit): The previous comment said create_workspace fires onChatUpdated three times, but no single invocation produces that many. A new-build path fires twice (binding before waitForAgentReady, post-ready after); the existing-workspace branch in handleExistingWorkspace is a separate path that fires once. Rewrite to describe the per-tool firing pattern accurately. Comment-only change. No behavior changes.

kylecarbs · 2026-05-15T13:30:35Z

DEREM-21 Nit fixed in 4a95c94 (comment-only). Thread resolved. All review findings across the six rounds are now closed.

/coder-agents-review

This comment was generated by Coder Agents.

coder-agents-review

22 findings across 7 rounds (2 P2, 8 P3, 7 Nit, 4 Note, 1 P4). 19 fixed, 2 acknowledged (moot after pivot), 1 open P4.

All prior fixes verified. The open P4 is a test assertion tightness suggestion that does not affect correctness. The async cache primer architecture is sound: primer warms the cache concurrently, PrepareTools retries on cache miss via the flag-flip fix, primerCtx prevents conn leaks, AgentID guard prevents premature launches, and the snapshot-based guard reads fresh state.

DEREM-22 (P4): The prior assertion required listCalls > 1, which passes even if the primer exits after a second call (e.g. if the deadline were miscomputed as now + retryInterval rather than now + maxWait). Replace with a high-water-mark assertion using workspaceMCPPrimeMaxWait / workspaceMCPPrimeRetryInterval / 2 as the floor, so the test catches deadline shortenings while leaving slack for off-by-one boundary effects. The 'cache not populated' assertion remains the load-bearing property; this secondary check now reinforces it.

kylecarbs · 2026-05-15T14:06:45Z

DEREM-22 P4 fixed in 89e10e6. Tightened the deadline test assertion to a high-water mark so a primer that exited the loop after only a couple of attempts would now fail the test. Thread resolved.

/coder-agents-review

This comment was generated by Coder Agents.

coder-agents-review

23 findings across 8 rounds (2 P2, 8 P3, 1 P4, 8 Nit, 4 Note). 20 fixed, 2 acknowledged (moot after pivot), 1 open Nit.

DEREM-22 fix verified: assertion tightened from > 1 to >= 7 (half the expected ~15 retries). All prior fixes remain solid. One new Nit: the opening comment at line 6601 says the callback fires only after waitForAgentReady, but the guard explanation at line 6619 correctly describes it firing twice (binding before + post-ready after). The guard logic is correct; the introductory sentence is misleading.

DEREM-23 (Nit): The opening sentence said "The tool only fires this callback after waitForAgentReady returns" which contradicts the later comment block that correctly notes create_workspace and start_workspace each fire onChatUpdated twice (binding + post-ready). The AgentID guard further down is what filters to the post-ready firing, not the tool itself. Comment-only change. No behavior changes.

kylecarbs · 2026-05-15T14:30:33Z

DEREM-23 Nit fixed in 155f8c0 (comment-only). Thread resolved.

/coder-agents-review

This comment was generated by Coder Agents.

coder-agents-review

23 findings across 9 rounds. All closed: 21 fixed, 2 acknowledged.

No new findings. The panel verified all prior fixes hold, the primer lifecycle is correct (primerCancel before close, inflight tracking, mutex-protected conn sharing), the flag-flip fix prevents permanent discovery blocking, and the test suite covers all four primer exit modes plus the end-to-end PrepareTools retry regression.

github-actions Bot assigned kylecarbs May 13, 2026

coder-agents-review Bot reviewed May 13, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go Outdated

Comment thread coderd/x/chatd/chatd_test.go Outdated

Comment thread coderd/x/chatd/chatd_test.go Outdated

coder-agents-review Bot approved these changes May 13, 2026

View reviewed changes

kylecarbs force-pushed the fix/prepare-tools-mcp-retry branch from 384a705 to e01f79d Compare May 14, 2026 22:50

kylecarbs changed the title ~~fix(coderd/x/chatd): retry workspace MCP discovery after mid-turn failures~~ fix(coderd/x/chatd): prime workspace MCP cache after create/start May 14, 2026

coder-agents-review Bot reviewed May 14, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd.go Outdated

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd.go Outdated

Comment thread coderd/x/chatd/chatd.go

coder-agents-review Bot reviewed May 15, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go Outdated

coder-agents-review Bot approved these changes May 15, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go Outdated

coder-agents-review Bot approved these changes May 15, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd_internal_test.go Outdated

coder-agents-review Bot approved these changes May 15, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go Outdated

coder-agents-review Bot approved these changes May 15, 2026

View reviewed changes

kylecarbs requested a review from ibetitsmike May 18, 2026 11:04

ibetitsmike approved these changes May 18, 2026

View reviewed changes

kylecarbs merged commit 1590896 into main May 18, 2026
26 checks passed

kylecarbs deleted the fix/prepare-tools-mcp-retry branch May 18, 2026 11:56

github-actions Bot locked and limited conversation to collaborators May 18, 2026

Conversation

kylecarbs commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

kylecarbs commented May 13, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylecarbs commented May 13, 2026

Uh oh!

kylecarbs commented May 13, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylecarbs commented May 13, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kylecarbs commented May 14, 2026

Uh oh!

kylecarbs commented May 14, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylecarbs commented May 15, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylecarbs commented May 15, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylecarbs commented May 15, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylecarbs commented May 15, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylecarbs commented May 15, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

kylecarbs commented May 13, 2026 •

edited

Loading