fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn#24336
Conversation
mafredri
left a comment
There was a problem hiding this comment.
Clean, well-scoped fix for a real problem (indefinite hang on disconnected agents). The dial-timeout implementation is well done: the capture-before-cancel pattern, the parent-cancel propagation guard, and the regression test for misclassification are all carefully reasoned. The test suite covers the full decision matrix (cache hit, cache miss, connecting, timeout, parent cancel, non-timeout error). Killua confirmed the performance profile is negligible. Mafu-san verified every claim in the PR description against the codebase.
Two P1s, four P3s, six nits, two notes.
Both P1s are on the cache-hit status check at line 576. (1) c.agent and c.agentLoaded are read outside c.mu while executeTools dispatches parallel goroutines, creating a data race on the multi-word WorkspaceAgent struct. Eight reviewers independently flagged this. (2) The cached agent row's LastConnectedAt freezes at fetch time. After ~30s (the default agentInactiveDisconnectTimeout), Status() computes Now() - cachedLastConnectedAt > 30s and classifies any healthy agent as disconnected, tearing down a live connection and telling the user to restart a working workspace.
"Delete
|| status.Status == database.WorkspaceAgentStatusTimeoutfromisAgentDisconnected()and the full suite still passes. The branch is untested dead weight to CI." -- Bisky
🤖 This review was automatically generated with Coder Agents.
Add agent status check and dial timeout to getWorkspaceConn to prevent tool calls from hanging when a workspace agent disconnects. Two changes: 1. Status check: call agent.Status() on every getWorkspaceConn call (both cache-hit and cache-miss paths). When the agent is disconnected or timed out, return a sentinel error immediately instead of attempting to dial a dead agent. The status check uses timestamp arithmetic on the cached agent row (zero cost, no DB re-fetch). Connecting agents are allowed through. 2. Dial timeout: wrap dialWithLazyValidation in a 30s timeout (matching the convention used by 8 other server-side AgentConn callers). When the timeout fires and the parent context is still alive, convert to a dial timeout sentinel. Parent context cancellation propagates unchanged so the chatloop can detect ErrInterrupted. Both sentinels tell the LLM to inform the user that the workspace agent is disconnected and may need to be restarted from the Coder dashboard. Neither suggests start_workspace (which cannot recover a disconnected agent on a running workspace).
- F1: Make isAgentUnreachable a pure function, read agent ID under lock on cache-hit path to eliminate data race. - F2: Re-fetch agent row on cache hit for fresh LastConnectedAt, preventing false-positive disconnect detection on healthy agents. - F3: Add test for WorkspaceAgentStatusTimeout branch. - F4: Strengthen DialTimeoutParentCanceled with positive assertion. - F5: Use context.WithTimeoutCause to eliminate manual capture-before-cancel ordering dependency. - F6: Simplify error messages to match errChatHasNoWorkspaceAgent style. - F7: Rename isAgentDisconnected to isAgentUnreachable. - F8: Remove dialTimedOut variable (superseded by WithTimeoutCause). - F9: Add comment explaining why defaultDialTimeout is 30s. - F10: Use defaultDialTimeout constant in tests. - F11: Replace TDD process labels with behavioral descriptions. - F12: Add dialTimeout to disconnected-agent test Server structs.
64611a0 to
5de390a
Compare
There was a problem hiding this comment.
All 13 R1 findings addressed cleanly. The two P1s (data race, stale cache false-positive) got structural fixes: isAgentUnreachable extracted as a pure function eliminates the race, PK re-fetch on cache hit eliminates the stale-timestamp false positive. context.WithTimeoutCause replaced the fragile capture-before-cancel pattern. Mafu-san confirmed every claim in the PR description matches the delivered code. Mafuuu traced all four modes (cache hit, cache miss, dial timeout, parent cancel) and found no contract violations. Pariston confirmed the diagnosis chain holds.
Four P3s and two nits new this round. The highest-severity new finding is a TOCTOU window on the cache-hit path where concurrent tool goroutines can return a released connection. Blast radius is one failed tool call with an opaque transport error instead of the sentinel; self-corrects on retry. Two test coverage gaps: the cache-hit happy path and the DB error fallthrough branch.
CI: test-go-pg failed. Worth checking if it's related to this change or a flaky base.
"Shall I show you the interleaving?" -- Hisoka, on the TOCTOU race
🤖 This review was automatically generated with Coder Agents.
- DEREM-17: Merge two lock sections on cache-hit path into one, eliminating TOCTOU where concurrent tool goroutines could return a released connection. - DEREM-18: Add test for cache-hit with healthy agent. - DEREM-19: Add test for cache-hit DB error fallthrough. - AMREM-13: Pass clock.Now() to isAgentUnreachable instead of hardcoding dbtime.Now(), enabling deterministic testing. - AMREM-14: Fix stale doc comment. - DEREM-20: Fix misleading comment about dial timeout on cache-hit path.
There was a problem hiding this comment.
Status: APPROVED
All 20 findings from R1 and R2 addressed. Netero and 6 panel reviewers found no new issues.
Three rounds refined this change into a solid fix. R1 caught a data race (8 reviewers converged) and a stale-cache false positive; both got structural fixes. R2 caught a TOCTOU window and two test gaps; all addressed. R3 is clean.
Meruem confirmed the concurrency model is sound: lock held only to snapshot, DB call outside lock, clearCachedWorkspaceState internally acquires its own lock. Pariston tried to build a case against the change and could not. Hisoka pulled every thread and found nothing. Razor verified every claim. Bisky called the test suite "the genuine article."
CI green. No open findings. No contested or deferred items.
"This one held up." -- Hisoka
🤖 This review was automatically generated with Coder Agents.
johnstcn
left a comment
There was a problem hiding this comment.
Nice fix! Some comments below.
- AMREM-15: Log a warning on DB error when re-fetching the agent row on the cache-hit path. - AMREM-16: Add comment explaining why agentID is captured under the same lock as currentConn (TOCTOU prevention). - AMREM-17: Refactor six status-check getWorkspaceConn tests into a single table-driven TestGetWorkspaceConn_StatusCheck.
32582d1 to
6595d1d
Compare
Problem
When a workspace agent disconnects but the workspace remains "running,"
getWorkspaceConneither hangs indefinitely onAwaitReachable(cache miss)or returns a dead cached connection whose transport errors the LLM
misinterprets as transient (cache hit). Both paths leave the chat unusable
until stale recovery kicks in (~5 minutes) or the user clicks interrupt.
Closes https://linear.app/coder/issue/CODAGT-149
Fix
Add two defenses in
getWorkspaceConn:Status check on every call (both cache-hit and cache-miss paths). On
cache miss, the freshly fetched agent row is checked via
isAgentUnreachable(timestamp arithmetic, zero cost). On cache hit, the agent row is re-fetched
by PK to get fresh
LastConnectedAtbefore checking.disconnectedandtimeoutagents return a sentinel immediately;connectingagents proceedto dial normally. Uses the server's
quartz.Clockfor deterministic testing.Dial timeout (30s, matching 8 other server-side
AgentConncallers).Wraps
dialWithLazyValidationin acontext.WithTimeoutCause. Parentcontext cancellation propagates unchanged so the chatloop can still detect
ErrInterrupted.Both sentinels instruct the LLM that the agent is unreachable and the workspace
may need restarting from the dashboard. Neither suggests
start_workspace(which cannot recover a disconnected agent on a running workspace).
Implementation plan and decision log
Plan:
docs/plans/agent-disconnect-hang.mdResearch:
docs/plans/agent-disconnect-hang-research.mdKey decisions:
disconnectedandtimeouttrigger the error;connectingproceeds.Server.dialTimeoutfield for tests.start_workspacebehavior out of scope.R1 amendments:
isAgentUnreachableis a pure function to eliminate data race on concurrent tool dispatch.context.WithTimeoutCausereplaces manual capture-before-cancel pattern.R2 amendments:
isAgentUnreachableacceptstime.Timeand uses server clock for testability.