fix(coderd/x/chatd): prime workspace MCP cache after create/start#25298
Conversation
|
/coder-agents-review |
There was a problem hiding this comment.
The fix is well-targeted: moving workspaceMCPDiscovered to "succeeded" semantics eliminates the class of bug where a failed discovery permanently blocks retries. The cooldown mechanism prevents the obvious follow-on problem of retries stalling every LLM step. Clean scope, no drive-bys, proportional to the reported regression. Pariston tried to build a case against the diagnosis and could not: "I tried to find the right thing being built wrong. I could not."
1 P2, 3 P3, 3 Nit, 3 Note.
The P2 is a test coverage gap: the top-of-turn cold-start retry path (workspace exists at turn start, agent cold) has zero test coverage, so the original bug could be silently reintroduced for that code path without any test catching it. The three P3s address cooldown enforcement testing, retry capping, and operational logging. DEREM-1 and DEREM-2 could be addressed by a single test with a mock clock and non-zero cooldown starting from a valid workspace.
coderd/x/chatd/chatloop/chatloop.go:182
Note [DEREM-8] The comment says "The chatloop tracks whether tools have already been replaced so PrepareTools is not retried on subsequent steps once it has returned a non-nil slice." The chatloop code at line 409 calls opts.PrepareTools(opts.Tools) unconditionally on every step with no tracking flag. The "not retried" behavior comes entirely from the callback's internal state (workspaceMCPDiscovered in chatd.go). This PR's retry mechanism depends on PrepareTools being called on every step even after returning nil. If someone reads this comment and adds the tracking it describes, the retry breaks silently.
(Bisky, Ryosuke)
🤖
🤖 This review was automatically generated with Coder Agents.
|
DEREM-8: Updated the
|
|
/coder-agents-review |
There was a problem hiding this comment.
All 8 fixed findings from R1 verified against HEAD. The renames, retry cap, cooldown test, top-of-turn test, diagnostic log, and comment fixes all address their root causes. The panel found three new items introduced by the fix commit, all minor.
1 P3, 2 Nit.
The P3 is a fix-chain regression: the Debug log added for DEREM-4 says "will retry after cooldown" on the terminal failure that reaches the cap, promising a retry that never happens. The two Nits are test comments that describe scenarios the test code does not implement.
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
13 findings across 3 rounds: 1 P2, 4 P3, 5 Nit, 3 Note. 11 fixed, 2 acknowledged. No open findings.
The fix correctly moves workspaceMCPDiscoverySucceeded from "attempted" to "succeeded" semantics, eliminating the class of bug where a failed discovery permanently blocks retries. The cooldown (10s) prevents per-step stalls, the failure cap (3) bounds worst-case retry cost, and the Debug log distinguishes terminal exhaustion from transient retries. Four tests cover all behavioral properties: mid-turn retry, top-of-turn cold-start retry, cooldown blocking, and failure cap exhaustion. Each test pins one property and would fail if that property regressed.
Pariston tried three alternative framings (async discovery, agent-readiness gate, chatloop contract fix) and concluded the chosen approach is at the right causal level. Bisky confirmed mutation coverage: inverting the cooldown comparison, the cap comparison, or the flag derivation each breaks exactly one test.
🤖 This review was automatically generated with Coder Agents.
Replaces the mid-turn retry+cooldown loop in PrepareTools with an async cache primer that runs from the create_workspace and start_workspace tools' OnChatUpdated callback. Those tools already wait for the workspace agent to become reachable via waitForAgentReady, so by the time the primer runs the agent is Ready: a single ListMCPTools call usually populates the cache, and the very next PrepareTools step is a cache hit. When the agent's MCP Connect is still racing with agent startup ListMCPTools returns an empty list (no error). The primer retries with a short backoff up to workspaceMCPPrimeMaxWait so the LLM step that follows the tool call sees the workspace MCP tools in the cache. The primer runs asynchronously: the tool itself must not block because some templates do not advertise any MCP tools at all, in which case the primer waits the full budget before giving up. The existing PrepareTools fallback covers the cache miss path; the primer is purely an optimization that warms the cache while the LLM is thinking. The primer shares the chat ctx so it is canceled with the chat, preventing a dangling primer from re-dialing the workspace conn after runChat's deferred close. inflight tracking ensures server shutdown still waits for any in-progress primer. PrepareTools is simplified: workspaceMCPDiscovered is now only set to true after a non-empty discovery, so an empty result no longer permanently blocks discovery within the turn.
384a705 to
e01f79d
Compare
|
Pivoted the implementation after reviewing the customer scenario more carefully. The previous retry/cooldown/failure-cap loop in The new approach moves discovery into the tool that already waits for the agent. Key design choices documented in the PR description and in the source comments:
Force-pushed to replace the old commits. Tests: 3 new unit tests for the primer plus the existing mid-turn integration test, all passing under /coder-agents-review This comment was generated by Coder Agents. |
|
/coder-agents-review |
There was a problem hiding this comment.
The pivot to an async cache primer launched from the tool's onChatUpdated callback is a better architecture than the PrepareTools retry loop. Moving discovery to the tool that already waits for agent readiness eliminates the fundamental timing problem. Hisoka put it well: "The async primer is the right architecture."
1 P2, 4 P3, 1 Note.
The P2 is a connection leak: the primer can re-dial after workspaceCtx.close() fires but before cancel(nil) in processChat. The fix is a 3-line defer reordering to cancel a child context before closing the workspace conn. Four reviewers independently identified the close-to-cancel ordering issue.
The strongest P3 (DEREM-15) has 4-way convergence: the primer fires on every onChatUpdated callback, including pre-ready callbacks and stop_workspace. The guard updatedChat.WorkspaceID.Valid passes all firings, spawning redundant primer goroutines that waste 30s each. Proposed fix: updatedChat.WorkspaceID.Valid && updatedChat.AgentID.Valid.
🤖 This review was automatically generated with Coder Agents.
DEREM-14 (P2): Plumb a primerCtx through rootChatToolsOptions and cancel it before workspaceCtx.close() runs. Previously the primer captured the chat ctx directly, which is canceled by processChat after runChat returns and workspaceCtx.close has already released the cached agent conn. A primer waking from its retry backoff in that window would observe a nil conn, dial a fresh one, and leak it because no later close() runs. DEREM-15 (P3): Gate the primer goroutine on both WorkspaceID.Valid and AgentID.Valid. The OnChatUpdated callback fires from create_workspace at binding, build, and post-ready, plus once from stop_workspace. Only the post-ready callback has a live AgentID, so the prior guard spawned multiple primer goroutines that wasted their 30s budget dialing missing or dying agents. DEREM-18 (P3): Rewrite the PrepareTools empty-result comment. The primer's 30s budget applies to its own loop; PrepareTools is unbounded but cheap per step (one DB query plus one ListMCPTools RPC against a cached conn). DEREM-19 (Note): Clarify that workspaceMCPPrimeMaxWait caps when new retries can start, not when an in-flight call must finish. Worst-case wall time is workspaceMCPPrimeMaxWait + dialTimeout + workspaceMCPDiscoveryTimeout when the last attempt is slow. DEREM-16 (P3): Add TestRunChat_PrepareToolsRetriesAfterEmptyDiscovery which seeds ListMCPTools to return empty for the first two calls and the workspace tool for the third+. The chat takes multiple steps after create_workspace and the test asserts that a post-step streamed model call eventually advertises the workspace tool. Inverting the workspaceMCPDiscovered flag-flip (back to true before the empty check) breaks this test. DEREM-17 (P3): Add TestPrimeWorkspaceMCPCache_ExitsOnContextCancel which arms the primer's first retry timer, cancels the primer ctx, and asserts the goroutine exits without caching. The case <-ctx.Done() branch in the retry loop is now covered.
|
All six round-1 findings (1 P2, 4 P3, 1 Note) addressed in 9d9fe1d. Resolved each thread with a reply summarizing the fix. Verified the new test /coder-agents-review This comment was generated by Coder Agents. |
There was a problem hiding this comment.
All 6 R4 fixes verified. The primerCtx lifecycle, AgentID guard, PrepareTools retry test, ctx-cancel test, comment fixes, and constant doc update all address their root causes correctly. One new P3.
1 P3.
The DEREM-15 fix (adding updatedChat.AgentID.Valid to the primer guard) inadvertently made the primer unreachable for the primary scenario: new workspace creation mid-turn. The guard checks the stale updatedChat parameter, but AgentID is set inside the callback body by persistBuildAgentBinding, after the parameter was read from the DB. The system works because PrepareTools retries via the flag-flip fix, so this is a dead optimization, not a correctness issue.
🤖 This review was automatically generated with Coder Agents.
DEREM-20 (P3): The previous DEREM-15 fix narrowed the primer guard to also require updatedChat.AgentID.Valid, but the updatedChat parameter is the row read by the tool before its OnChatUpdated callback runs. create_workspace persists the workspace binding with a null AgentID; persistBuildAgentBinding (which fills in the AgentID) only runs inside the callback, via persistInstructionFiles -> getWorkspaceAgent -> ensureWorkspaceAgent. The stale parameter made the primer unreachable on the primary scenario (new workspace mid-turn). Read the snapshot from workspaceCtx.currentChatSnapshot() instead. ensureWorkspaceAgent calls setCurrentChat after persisting the binding, so the snapshot reflects the freshly bound AgentID by the time the primer guard runs. The chat ID is also taken from the snapshot for consistency. Existing integration tests already exercise the bind-then-prime path; verified all existing chatd tests still pass.
|
DEREM-20 addressed in 31968f5. Switched the primer guard to read the in-memory chat snapshot rather than the stale /coder-agents-review This comment was generated by Coder Agents. |
There was a problem hiding this comment.
21 findings across 6 rounds (2 P2, 8 P3, 6 Nit, 4 Note, plus 1 new Nit). 18 fixed, 2 acknowledged (moot after implementation pivot), 1 open Nit.
The async cache primer launched from onChatUpdated after waitForAgentReady is the right architecture. The primer warms the cache concurrently with the next LLM step, PrepareTools falls back to a direct dial on cache miss, and the flag-flip fix ensures empty discovery never permanently blocks retries. The primerCtx lifecycle prevents conn leaks, the AgentID guard prevents premature primer launches, and the snapshot-based guard reads fresh state after persistBuildAgentBinding runs.
Five tests cover the primer (success, retry, deadline, ctx-cancel) plus an integration test that pins the PrepareTools flag-flip regression. Bisky confirmed mutation coverage across all behavioral properties.
🤖 This review was automatically generated with Coder Agents.
DEREM-21 (Nit): The previous comment said create_workspace fires onChatUpdated three times, but no single invocation produces that many. A new-build path fires twice (binding before waitForAgentReady, post-ready after); the existing-workspace branch in handleExistingWorkspace is a separate path that fires once. Rewrite to describe the per-tool firing pattern accurately. Comment-only change. No behavior changes.
|
DEREM-21 Nit fixed in 4a95c94 (comment-only). Thread resolved. All review findings across the six rounds are now closed. /coder-agents-review This comment was generated by Coder Agents. |
There was a problem hiding this comment.
22 findings across 7 rounds (2 P2, 8 P3, 7 Nit, 4 Note, 1 P4). 19 fixed, 2 acknowledged (moot after pivot), 1 open P4.
All prior fixes verified. The open P4 is a test assertion tightness suggestion that does not affect correctness. The async cache primer architecture is sound: primer warms the cache concurrently, PrepareTools retries on cache miss via the flag-flip fix, primerCtx prevents conn leaks, AgentID guard prevents premature launches, and the snapshot-based guard reads fresh state.
🤖 This review was automatically generated with Coder Agents.
DEREM-22 (P4): The prior assertion required listCalls > 1, which passes even if the primer exits after a second call (e.g. if the deadline were miscomputed as now + retryInterval rather than now + maxWait). Replace with a high-water-mark assertion using workspaceMCPPrimeMaxWait / workspaceMCPPrimeRetryInterval / 2 as the floor, so the test catches deadline shortenings while leaving slack for off-by-one boundary effects. The 'cache not populated' assertion remains the load-bearing property; this secondary check now reinforces it.
|
DEREM-22 P4 fixed in 89e10e6. Tightened the deadline test assertion to a high-water mark so a primer that exited the loop after only a couple of attempts would now fail the test. Thread resolved. /coder-agents-review This comment was generated by Coder Agents. |
There was a problem hiding this comment.
23 findings across 8 rounds (2 P2, 8 P3, 1 P4, 8 Nit, 4 Note). 20 fixed, 2 acknowledged (moot after pivot), 1 open Nit.
DEREM-22 fix verified: assertion tightened from > 1 to >= 7 (half the expected ~15 retries). All prior fixes remain solid. One new Nit: the opening comment at line 6601 says the callback fires only after waitForAgentReady, but the guard explanation at line 6619 correctly describes it firing twice (binding before + post-ready after). The guard logic is correct; the introductory sentence is misleading.
🤖 This review was automatically generated with Coder Agents.
DEREM-23 (Nit): The opening sentence said "The tool only fires this callback after waitForAgentReady returns" which contradicts the later comment block that correctly notes create_workspace and start_workspace each fire onChatUpdated twice (binding + post-ready). The AgentID guard further down is what filters to the post-ready firing, not the tool itself. Comment-only change. No behavior changes.
|
DEREM-23 Nit fixed in 155f8c0 (comment-only). Thread resolved. /coder-agents-review This comment was generated by Coder Agents. |
There was a problem hiding this comment.
23 findings across 9 rounds. All closed: 21 fixed, 2 acknowledged.
No new findings. The panel verified all prior fixes hold, the primer lifecycle is correct (primerCancel before close, inflight tracking, mutex-protected conn sharing), the flag-flip fix prevents permanent discovery blocking, and the test suite covers all four primer exit modes plus the end-to-end PrepareTools retry regression.
🤖 This review was automatically generated with Coder Agents.
Problem
Mid-turn workspace MCP discovery was broken when an agent was still cold-starting.
PrepareToolsinchatd.goflippedworkspaceMCPDiscovered = truebefore callingdiscoverWorkspaceMCPTools, so a failed discovery attempt permanently blocked retries within the turn.Customer-reported repro:
create_workspacemid-turn at23:35:05.PrepareToolsfires, dials the agent with a 30s timeout, dial times out at23:38:15,discoverWorkspaceMCPToolsreturns empty.23:38:29, 14 seconds later.workspaceMCPDiscoveredwas already true, soPrepareToolsnever retried for the rest of the turn. MCP tools only appeared on the next user message.A naive retry loop in
PrepareToolswould also miss the bigger picture: a workspace boot can take several minutes (EC2 cold start, 10 min startup scripts), and the chatloop only gets a chance to callPrepareToolsbetween LLM steps.Fix
Do the workspace MCP discovery from inside the tool that already waits for the agent.
chattool.CreateWorkspaceandchattool.StartWorkspacecallwaitForAgentReady, which has a 2 min agent-online budget plus a 10 min startup-script budget. By the time they fireOnChatUpdated, the agent isReady. The chatdonChatUpdatedcallback now launches an asyncprimeWorkspaceMCPCachegoroutine on every bind that has a valid workspace ID:discoverWorkspaceMCPToolsuntil it returns a non-empty list orworkspaceMCPPrimeMaxWait(30s) elapses, with a 2s backoff between attempts. The bounded wait handles the short race between agent-online and the agent's MCPConnectsettling.ctx(not a detached one) so it is canceled together with the chat. A dangling primer would re-dial the workspace conn afterrunChat's deferredworkspaceCtx.close()and leak that conn.inflight.Add(1)ensures server shutdown still waits for any in-progress primer.PrepareToolsis simplified back to a single discovery call. It now only setsworkspaceMCPDiscovered = trueon success, so an empty result no longer permanently blocks discovery within the turn. The cache hit warmed by the primer makes that call cheap in the common case; the dial fallback handles the rare cache miss.Tests
All in
coderd/x/chatd/chatd_internal_test.go:TestPrimeWorkspaceMCPCache_SuccessOnFirstAttempt— singleListMCPToolscall returning tools populates the cache.TestPrimeWorkspaceMCPCache_RetriesUntilToolsAppear— first call empty, second returns tools; primer retries past the backoff and writes the cache. Usesquartz.Mock.TraponNewTimer.TestPrimeWorkspaceMCPCache_GivesUpAfterDeadline—ListMCPToolsalways empty; primer stops atworkspaceMCPPrimeMaxWaitand refuses to cache the empty result so PrepareTools can retry on the next step.The existing integration test
TestRunChat_WorkspaceMCPDiscoveryAfterMidTurnCreateWorkspacecontinues to pass and now also exercises the async-primer path end-to-end via the create_workspace tool.Design notes
PrepareTools. It worked for the customer's ~30s race window but did not help workspaces that take several minutes to boot, becausePrepareToolsonly fires between LLM steps. Reviewer pointed out the right place to handle this is the tool itself; the current implementation does that.OnChatUpdatedcallback blocked the create_workspace tool from returning for up toworkspaceMCPPrimeMaxWait, which brokeTestCreateWorkspaceTool_EndToEndand would hurt any template that does not expose MCP tools. Decoupling lets the tool return immediately and lets the primer warm the cache concurrently with the next LLM step.ctxrather thancontext.WithoutCancel(ctx)(the title-generation pattern): the primer touchesworkspaceCtx.getWorkspaceConn, whichrunChat's deferredworkspaceCtx.close()invalidates. A detached primer outliving the chat would dial a fresh conn and leak it.workspaceMCPDiscoveryTimeout(35s per-call dial budget, unchanged from fix(coderd/x/chatd): discover workspace MCP tools mid-turn after create_workspace #25169) fromworkspaceMCPPrimeMaxWait(30s total budget for the post-ready primer loop) andworkspaceMCPPrimeRetryInterval(2s between empty-result retries).Follow-up to #25169.
This pull request was generated by Coder Agents.