Skip to content

fix(coderd/x/chatd): prime workspace MCP cache after create/start#25298

Merged
kylecarbs merged 6 commits into
mainfrom
fix/prepare-tools-mcp-retry
May 18, 2026
Merged

fix(coderd/x/chatd): prime workspace MCP cache after create/start#25298
kylecarbs merged 6 commits into
mainfrom
fix/prepare-tools-mcp-retry

Conversation

@kylecarbs
Copy link
Copy Markdown
Member

@kylecarbs kylecarbs commented May 13, 2026

Problem

Mid-turn workspace MCP discovery was broken when an agent was still cold-starting. PrepareTools in chatd.go flipped workspaceMCPDiscovered = true before calling discoverWorkspaceMCPTools, so a failed discovery attempt permanently blocked retries within the turn.

Customer-reported repro:

  • New chat with no pre-selected workspace.
  • LLM calls create_workspace mid-turn at 23:35:05.
  • PrepareTools fires, dials the agent with a 30s timeout, dial times out at 23:38:15, discoverWorkspaceMCPTools returns empty.
  • Agent connects at 23:38:29, 14 seconds later.
  • workspaceMCPDiscovered was already true, so PrepareTools never retried for the rest of the turn. MCP tools only appeared on the next user message.

A naive retry loop in PrepareTools would also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to call PrepareTools between LLM steps.

Fix

Do the workspace MCP discovery from inside the tool that already waits for the agent. chattool.CreateWorkspace and chattool.StartWorkspace call waitForAgentReady, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fire OnChatUpdated, the agent is Ready. The chatd onChatUpdated callback now launches an async primeWorkspaceMCPCache goroutine on every bind that has a valid workspace ID:

  • The primer calls discoverWorkspaceMCPTools until it returns a non-empty list or workspaceMCPPrimeMaxWait (30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCP Connect settling.
  • The primer runs asynchronously so the tool itself never blocks. Some templates simply do not advertise MCP tools, in which case the primer would otherwise spend its full budget for nothing.
  • The primer shares the chat ctx (not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn after runChat's deferred workspaceCtx.close() and leak that conn.
  • inflight.Add(1) ensures server shutdown still waits for any in-progress primer.
  • PrepareTools is simplified back to a single discovery call. It now only sets workspaceMCPDiscovered = true on success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss.

Tests

All in coderd/x/chatd/chatd_internal_test.go:

  • TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt — single ListMCPTools call returning tools populates the cache.
  • TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear — first call empty, second returns tools; primer retries past the backoff and writes the cache. Uses quartz.Mock.Trap on NewTimer.
  • TestPrimeWorkspaceMCPCache_GivesUpAfterDeadlineListMCPTools always empty; primer stops at workspaceMCPPrimeMaxWait and refuses to cache the empty result so PrepareTools can retry on the next step.

The existing integration test TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspace continues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool.

go test ./coderd/x/chatd/... -count=1
go test ./coderd/x/chatd/ -race -count=1
make pre-commit
Design notes
  • The first iteration of this PR added retry+cooldown+failure-cap logic inside PrepareTools. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, because PrepareTools only fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that.
  • Why async: a primer that ran synchronously inside the OnChatUpdated callback blocked the create_workspace tool from returning for up to workspaceMCPPrimeMaxWait, which broke TestCreateWorkspaceTool_EndToEnd and would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step.
  • Why share the chat ctx rather than context.WithoutCancel(ctx) (the title-generation pattern): the primer touches workspaceCtx.getWorkspaceConn, which runChat's deferred workspaceCtx.close() invalidates. A detached primer outliving the chat would dial a fresh conn and leak it.
  • The constant naming distinguishes workspaceMCPDiscoveryTimeout (35s per-call dial budget, unchanged from fix(coderd/x/chatd): discover workspace MCP tools mid-turn after create_workspace #25169) from workspaceMCPPrimeMaxWait (30s total budget for the post-ready primer loop) and workspaceMCPPrimeRetryInterval (2s between empty-result retries).

Follow-up to #25169.


This pull request was generated by Coder Agents.

Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is well-targeted: moving workspaceMCPDiscovered to "succeeded" semantics eliminates the class of bug where a failed discovery permanently blocks retries. The cooldown mechanism prevents the obvious follow-on problem of retries stalling every LLM step. Clean scope, no drive-bys, proportional to the reported regression. Pariston tried to build a case against the diagnosis and could not: "I tried to find the right thing being built wrong. I could not."

1 P2, 3 P3, 3 Nit, 3 Note.

The P2 is a test coverage gap: the top-of-turn cold-start retry path (workspace exists at turn start, agent cold) has zero test coverage, so the original bug could be silently reintroduced for that code path without any test catching it. The three P3s address cooldown enforcement testing, retry capping, and operational logging. DEREM-1 and DEREM-2 could be addressed by a single test with a mock clock and non-zero cooldown starting from a valid workspace.


coderd/x/chatd/chatloop/chatloop.go:182

Note [DEREM-8] The comment says "The chatloop tracks whether tools have already been replaced so PrepareTools is not retried on subsequent steps once it has returned a non-nil slice." The chatloop code at line 409 calls opts.PrepareTools(opts.Tools) unconditionally on every step with no tracking flag. The "not retried" behavior comes entirely from the callback's internal state (workspaceMCPDiscovered in chatd.go). This PR's retry mechanism depends on PrepareTools being called on every step even after returning nil. If someone reads this comment and adds the tracking it describes, the retry breaks silently.

(Bisky, Ryosuke)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd_test.go Outdated
Comment thread coderd/x/chatd/chatd_test.go Outdated
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Copy link
Copy Markdown
Member Author

DEREM-8: Updated the PrepareTools doc comment in coderd/x/chatd/chatloop/chatloop.go to say it is invoked unconditionally before every step and that callers must track stop-conditions inside their callback. This protects the new retry mechanism from anyone adding tracking that matches the old (incorrect) doc.

Coder Agents reply.

Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 8 fixed findings from R1 verified against HEAD. The renames, retry cap, cooldown test, top-of-turn test, diagnostic log, and comment fixes all address their root causes. The panel found three new items introduced by the fix commit, all minor.

1 P3, 2 Nit.

The P3 is a fix-chain regression: the Debug log added for DEREM-4 says "will retry after cooldown" on the terminal failure that reaches the cap, promising a retry that never happens. The two Nits are test comments that describe scenarios the test code does not implement.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd_test.go Outdated
Comment thread coderd/x/chatd/chatd_test.go Outdated
Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

13 findings across 3 rounds: 1 P2, 4 P3, 5 Nit, 3 Note. 11 fixed, 2 acknowledged. No open findings.

The fix correctly moves workspaceMCPDiscoverySucceeded from "attempted" to "succeeded" semantics, eliminating the class of bug where a failed discovery permanently blocks retries. The cooldown (10s) prevents per-step stalls, the failure cap (3) bounds worst-case retry cost, and the Debug log distinguishes terminal exhaustion from transient retries. Four tests cover all behavioral properties: mid-turn retry, top-of-turn cold-start retry, cooldown blocking, and failure cap exhaustion. Each test pins one property and would fail if that property regressed.

Pariston tried three alternative framings (async discovery, agent-readiness gate, chatloop contract fix) and concluded the chosen approach is at the right causal level. Bisky confirmed mutation coverage: inverting the cooldown comparison, the cap comparison, or the flag derivation each breaks exactly one test.

🤖 This review was automatically generated with Coder Agents.

Replaces the mid-turn retry+cooldown loop in PrepareTools with an
async cache primer that runs from the create_workspace and
start_workspace tools' OnChatUpdated callback. Those tools already
wait for the workspace agent to become reachable via
waitForAgentReady, so by the time the primer runs the agent is
Ready: a single ListMCPTools call usually populates the cache, and
the very next PrepareTools step is a cache hit.

When the agent's MCP Connect is still racing with agent startup
ListMCPTools returns an empty list (no error). The primer retries
with a short backoff up to workspaceMCPPrimeMaxWait so the LLM step
that follows the tool call sees the workspace MCP tools in the
cache.

The primer runs asynchronously: the tool itself must not block
because some templates do not advertise any MCP tools at all, in
which case the primer waits the full budget before giving up. The
existing PrepareTools fallback covers the cache miss path; the
primer is purely an optimization that warms the cache while the LLM
is thinking. The primer shares the chat ctx so it is canceled with
the chat, preventing a dangling primer from re-dialing the workspace
conn after runChat's deferred close. inflight tracking ensures
server shutdown still waits for any in-progress primer.

PrepareTools is simplified: workspaceMCPDiscovered is now only set
to true after a non-empty discovery, so an empty result no longer
permanently blocks discovery within the turn.
@kylecarbs kylecarbs force-pushed the fix/prepare-tools-mcp-retry branch from 384a705 to e01f79d Compare May 14, 2026 22:50
@kylecarbs kylecarbs changed the title fix(coderd/x/chatd): retry workspace MCP discovery after mid-turn failures fix(coderd/x/chatd): prime workspace MCP cache after create/start May 14, 2026
Copy link
Copy Markdown
Member Author

Pivoted the implementation after reviewing the customer scenario more carefully. The previous retry/cooldown/failure-cap loop in PrepareTools only handled the ~30s cold-start race; an EC2 workspace that takes several minutes to come up would still miss MCP discovery entirely. The chatloop only invokes PrepareTools between LLM steps, so it cannot keep retrying through a multi-minute boot.

The new approach moves discovery into the tool that already waits for the agent. chattool.CreateWorkspace and chattool.StartWorkspace call waitForAgentReady (2-minute agent-online budget plus 10-minute startup-script budget) and only fire OnChatUpdated afterward, when the agent is Ready. chatd's onChatUpdated callback now launches an async primeWorkspaceMCPCache goroutine that loops discoverWorkspaceMCPTools until it returns a non-empty list or workspaceMCPPrimeMaxWait (30s) elapses. PrepareTools is simplified back to a single cache-hit-or-dial call and only sets workspaceMCPDiscovered = true on success.

Key design choices documented in the PR description and in the source comments:

  • The primer is async so it cannot block tool completion when the template advertises no MCP tools.
  • The primer shares the chat ctx (not a detached one) so it is canceled with the chat, preventing a dangling primer from re-dialing the workspace conn after runChat's deferred close.
  • inflight.Add(1) ensures server shutdown still waits for any in-progress primer.

Force-pushed to replace the old commits. Tests: 3 new unit tests for the primer plus the existing mid-turn integration test, all passing under -race.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pivot to an async cache primer launched from the tool's onChatUpdated callback is a better architecture than the PrepareTools retry loop. Moving discovery to the tool that already waits for agent readiness eliminates the fundamental timing problem. Hisoka put it well: "The async primer is the right architecture."

1 P2, 4 P3, 1 Note.

The P2 is a connection leak: the primer can re-dial after workspaceCtx.close() fires but before cancel(nil) in processChat. The fix is a 3-line defer reordering to cancel a child context before closing the workspace conn. Four reviewers independently identified the close-to-cancel ordering issue.

The strongest P3 (DEREM-15) has 4-way convergence: the primer fires on every onChatUpdated callback, including pre-ready callbacks and stop_workspace. The guard updatedChat.WorkspaceID.Valid passes all firings, spawning redundant primer goroutines that waste 30s each. Proposed fix: updatedChat.WorkspaceID.Valid && updatedChat.AgentID.Valid.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go
DEREM-14 (P2): Plumb a primerCtx through rootChatToolsOptions and
cancel it before workspaceCtx.close() runs. Previously the primer
captured the chat ctx directly, which is canceled by processChat
after runChat returns and workspaceCtx.close has already released
the cached agent conn. A primer waking from its retry backoff in
that window would observe a nil conn, dial a fresh one, and leak
it because no later close() runs.

DEREM-15 (P3): Gate the primer goroutine on both WorkspaceID.Valid
and AgentID.Valid. The OnChatUpdated callback fires from
create_workspace at binding, build, and post-ready, plus once from
stop_workspace. Only the post-ready callback has a live AgentID,
so the prior guard spawned multiple primer goroutines that wasted
their 30s budget dialing missing or dying agents.

DEREM-18 (P3): Rewrite the PrepareTools empty-result comment. The
primer's 30s budget applies to its own loop; PrepareTools is
unbounded but cheap per step (one DB query plus one ListMCPTools
RPC against a cached conn).

DEREM-19 (Note): Clarify that workspaceMCPPrimeMaxWait caps when
new retries can start, not when an in-flight call must finish.
Worst-case wall time is workspaceMCPPrimeMaxWait + dialTimeout +
workspaceMCPDiscoveryTimeout when the last attempt is slow.

DEREM-16 (P3): Add TestRunChat_PrepareToolsRetriesAfterEmptyDiscovery
which seeds ListMCPTools to return empty for the first two calls
and the workspace tool for the third+. The chat takes multiple
steps after create_workspace and the test asserts that a post-step
streamed model call eventually advertises the workspace tool.
Inverting the workspaceMCPDiscovered flag-flip (back to true before
the empty check) breaks this test.

DEREM-17 (P3): Add TestPrimeWorkspaceMCPCache_ExitsOnContextCancel
which arms the primer's first retry timer, cancels the primer ctx,
and asserts the goroutine exits without caching. The case <-ctx.Done()
branch in the retry loop is now covered.
Copy link
Copy Markdown
Member Author

All six round-1 findings (1 P2, 4 P3, 1 Note) addressed in 9d9fe1d. Resolved each thread with a reply summarizing the fix. Verified the new test TestRunChat_PrepareToolsRetriesAfterEmptyDiscovery fails when the workspaceMCPDiscovered flag-flip is reverted to the original buggy ordering.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 6 R4 fixes verified. The primerCtx lifecycle, AgentID guard, PrepareTools retry test, ctx-cancel test, comment fixes, and constant doc update all address their root causes correctly. One new P3.

1 P3.

The DEREM-15 fix (adding updatedChat.AgentID.Valid to the primer guard) inadvertently made the primer unreachable for the primary scenario: new workspace creation mid-turn. The guard checks the stale updatedChat parameter, but AgentID is set inside the callback body by persistBuildAgentBinding, after the parameter was read from the DB. The system works because PrepareTools retries via the flag-flip fix, so this is a dead optimization, not a correctness issue.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go Outdated
DEREM-20 (P3): The previous DEREM-15 fix narrowed the primer guard
to also require updatedChat.AgentID.Valid, but the updatedChat
parameter is the row read by the tool before its OnChatUpdated
callback runs. create_workspace persists the workspace binding
with a null AgentID; persistBuildAgentBinding (which fills in the
AgentID) only runs inside the callback, via persistInstructionFiles
-> getWorkspaceAgent -> ensureWorkspaceAgent. The stale parameter
made the primer unreachable on the primary scenario (new workspace
mid-turn).

Read the snapshot from workspaceCtx.currentChatSnapshot() instead.
ensureWorkspaceAgent calls setCurrentChat after persisting the
binding, so the snapshot reflects the freshly bound AgentID by the
time the primer guard runs. The chat ID is also taken from the
snapshot for consistency.

Existing integration tests already exercise the bind-then-prime
path; verified all existing chatd tests still pass.
Copy link
Copy Markdown
Member Author

DEREM-20 addressed in 31968f5. Switched the primer guard to read the in-memory chat snapshot rather than the stale updatedChat parameter, so the primer now fires on the create_workspace post-ready path as intended. Thread resolved.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 findings across 6 rounds (2 P2, 8 P3, 6 Nit, 4 Note, plus 1 new Nit). 18 fixed, 2 acknowledged (moot after implementation pivot), 1 open Nit.

The async cache primer launched from onChatUpdated after waitForAgentReady is the right architecture. The primer warms the cache concurrently with the next LLM step, PrepareTools falls back to a direct dial on cache miss, and the flag-flip fix ensures empty discovery never permanently blocks retries. The primerCtx lifecycle prevents conn leaks, the AgentID guard prevents premature primer launches, and the snapshot-based guard reads fresh state after persistBuildAgentBinding runs.

Five tests cover the primer (success, retry, deadline, ctx-cancel) plus an integration test that pins the PrepareTools flag-flip regression. Bisky confirmed mutation coverage across all behavioral properties.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go Outdated
DEREM-21 (Nit): The previous comment said create_workspace fires
onChatUpdated three times, but no single invocation produces that
many. A new-build path fires twice (binding before
waitForAgentReady, post-ready after); the existing-workspace branch
in handleExistingWorkspace is a separate path that fires once.
Rewrite to describe the per-tool firing pattern accurately.

Comment-only change. No behavior changes.
Copy link
Copy Markdown
Member Author

DEREM-21 Nit fixed in 4a95c94 (comment-only). Thread resolved. All review findings across the six rounds are now closed.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

22 findings across 7 rounds (2 P2, 8 P3, 7 Nit, 4 Note, 1 P4). 19 fixed, 2 acknowledged (moot after pivot), 1 open P4.

All prior fixes verified. The open P4 is a test assertion tightness suggestion that does not affect correctness. The async cache primer architecture is sound: primer warms the cache concurrently, PrepareTools retries on cache miss via the flag-flip fix, primerCtx prevents conn leaks, AgentID guard prevents premature launches, and the snapshot-based guard reads fresh state.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
DEREM-22 (P4): The prior assertion required listCalls > 1, which
passes even if the primer exits after a second call (e.g. if the
deadline were miscomputed as now + retryInterval rather than now +
maxWait). Replace with a high-water-mark assertion using
workspaceMCPPrimeMaxWait / workspaceMCPPrimeRetryInterval / 2 as
the floor, so the test catches deadline shortenings while leaving
slack for off-by-one boundary effects.

The 'cache not populated' assertion remains the load-bearing
property; this secondary check now reinforces it.
Copy link
Copy Markdown
Member Author

DEREM-22 P4 fixed in 89e10e6. Tightened the deadline test assertion to a high-water mark so a primer that exited the loop after only a couple of attempts would now fail the test. Thread resolved.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 findings across 8 rounds (2 P2, 8 P3, 1 P4, 8 Nit, 4 Note). 20 fixed, 2 acknowledged (moot after pivot), 1 open Nit.

DEREM-22 fix verified: assertion tightened from > 1 to >= 7 (half the expected ~15 retries). All prior fixes remain solid. One new Nit: the opening comment at line 6601 says the callback fires only after waitForAgentReady, but the guard explanation at line 6619 correctly describes it firing twice (binding before + post-ready after). The guard logic is correct; the introductory sentence is misleading.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go Outdated
DEREM-23 (Nit): The opening sentence said "The tool only fires this
callback after waitForAgentReady returns" which contradicts the
later comment block that correctly notes create_workspace and
start_workspace each fire onChatUpdated twice (binding + post-ready).
The AgentID guard further down is what filters to the post-ready
firing, not the tool itself.

Comment-only change. No behavior changes.
Copy link
Copy Markdown
Member Author

DEREM-23 Nit fixed in 155f8c0 (comment-only). Thread resolved.

/coder-agents-review


This comment was generated by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

23 findings across 9 rounds. All closed: 21 fixed, 2 acknowledged.

No new findings. The panel verified all prior fixes hold, the primer lifecycle is correct (primerCancel before close, inflight tracking, mutex-protected conn sharing), the flag-flip fix prevents permanent discovery blocking, and the test suite covers all four primer exit modes plus the end-to-end PrepareTools retry regression.

🤖 This review was automatically generated with Coder Agents.

@kylecarbs kylecarbs requested a review from ibetitsmike May 18, 2026 11:04
@kylecarbs kylecarbs merged commit 1590896 into main May 18, 2026
26 checks passed
@kylecarbs kylecarbs deleted the fix/prepare-tools-mcp-retry branch May 18, 2026 11:56
@github-actions github-actions Bot locked and limited conversation to collaborators May 18, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants