fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn by mafredri · Pull Request #24336 · coder/coder

mafredri · 2026-04-14T15:28:10Z

Problem

When a workspace agent disconnects but the workspace remains "running,"
getWorkspaceConn either hangs indefinitely on AwaitReachable (cache miss)
or returns a dead cached connection whose transport errors the LLM
misinterprets as transient (cache hit). Both paths leave the chat unusable
until stale recovery kicks in (~5 minutes) or the user clicks interrupt.

Closes https://linear.app/coder/issue/CODAGT-149

Fix

Add two defenses in getWorkspaceConn:

Status check on every call (both cache-hit and cache-miss paths). On
cache miss, the freshly fetched agent row is checked via isAgentUnreachable
(timestamp arithmetic, zero cost). On cache hit, the agent row is re-fetched
by PK to get fresh LastConnectedAt before checking. disconnected and
timeout agents return a sentinel immediately; connecting agents proceed
to dial normally. Uses the server's quartz.Clock for deterministic testing.
Dial timeout (30s, matching 8 other server-side AgentConn callers).
Wraps dialWithLazyValidation in a context.WithTimeoutCause. Parent
context cancellation propagates unchanged so the chatloop can still detect
ErrInterrupted.

Both sentinels instruct the LLM that the agent is unreachable and the workspace
may need restarting from the dashboard. Neither suggests start_workspace
(which cannot recover a disconnected agent on a running workspace).

Implementation plan and decision log

Plan: docs/plans/agent-disconnect-hang.md
Research: docs/plans/agent-disconnect-hang-research.md

Key decisions:

D1: Approaches A + B (status check + dial timeout).
D2: Only disconnected and timeout trigger the error; connecting proceeds.
D3: 30s dial timeout, injectable via Server.dialTimeout field for tests.
D4: Two distinct error messages (disconnected vs timeout).
D5: start_workspace behavior out of scope.

R1 amendments:

Cache-hit re-fetches agent row (PK lookup) to avoid stale-timestamp false positives.
isAgentUnreachable is a pure function to eliminate data race on concurrent tool dispatch.
context.WithTimeoutCause replaces manual capture-before-cancel pattern.

R2 amendments:

Merged two lock sections on cache-hit into one to eliminate TOCTOU.
isAgentUnreachable accepts time.Time and uses server clock for testability.
Added tests for cache-hit happy path and DB error fallthrough.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

mafredri

Clean, well-scoped fix for a real problem (indefinite hang on disconnected agents). The dial-timeout implementation is well done: the capture-before-cancel pattern, the parent-cancel propagation guard, and the regression test for misclassification are all carefully reasoned. The test suite covers the full decision matrix (cache hit, cache miss, connecting, timeout, parent cancel, non-timeout error). Killua confirmed the performance profile is negligible. Mafu-san verified every claim in the PR description against the codebase.

Two P1s, four P3s, six nits, two notes.

Both P1s are on the cache-hit status check at line 576. (1) c.agent and c.agentLoaded are read outside c.mu while executeTools dispatches parallel goroutines, creating a data race on the multi-word WorkspaceAgent struct. Eight reviewers independently flagged this. (2) The cached agent row's LastConnectedAt freezes at fetch time. After ~30s (the default agentInactiveDisconnectTimeout), Status() computes Now() - cachedLastConnectedAt > 30s and classifies any healthy agent as disconnected, tearing down a live connection and telling the user to restart a working workspace.

"Delete || status.Status == database.WorkspaceAgentStatusTimeout from isAgentDisconnected() and the full suite still passes. The branch is untested dead weight to CI." -- Bisky

🤖 This review was automatically generated with Coder Agents.

Add agent status check and dial timeout to getWorkspaceConn to prevent tool calls from hanging when a workspace agent disconnects. Two changes: 1. Status check: call agent.Status() on every getWorkspaceConn call (both cache-hit and cache-miss paths). When the agent is disconnected or timed out, return a sentinel error immediately instead of attempting to dial a dead agent. The status check uses timestamp arithmetic on the cached agent row (zero cost, no DB re-fetch). Connecting agents are allowed through. 2. Dial timeout: wrap dialWithLazyValidation in a 30s timeout (matching the convention used by 8 other server-side AgentConn callers). When the timeout fires and the parent context is still alive, convert to a dial timeout sentinel. Parent context cancellation propagates unchanged so the chatloop can detect ErrInterrupted. Both sentinels tell the LLM to inform the user that the workspace agent is disconnected and may need to be restarted from the Coder dashboard. Neither suggests start_workspace (which cannot recover a disconnected agent on a running workspace).

- F1: Make isAgentUnreachable a pure function, read agent ID under lock on cache-hit path to eliminate data race. - F2: Re-fetch agent row on cache hit for fresh LastConnectedAt, preventing false-positive disconnect detection on healthy agents. - F3: Add test for WorkspaceAgentStatusTimeout branch. - F4: Strengthen DialTimeoutParentCanceled with positive assertion. - F5: Use context.WithTimeoutCause to eliminate manual capture-before-cancel ordering dependency. - F6: Simplify error messages to match errChatHasNoWorkspaceAgent style. - F7: Rename isAgentDisconnected to isAgentUnreachable. - F8: Remove dialTimedOut variable (superseded by WithTimeoutCause). - F9: Add comment explaining why defaultDialTimeout is 30s. - F10: Use defaultDialTimeout constant in tests. - F11: Replace TDD process labels with behavioral descriptions. - F12: Add dialTimeout to disconnected-agent test Server structs.

coder-agents-review

All 13 R1 findings addressed cleanly. The two P1s (data race, stale cache false-positive) got structural fixes: isAgentUnreachable extracted as a pure function eliminates the race, PK re-fetch on cache hit eliminates the stale-timestamp false positive. context.WithTimeoutCause replaced the fragile capture-before-cancel pattern. Mafu-san confirmed every claim in the PR description matches the delivered code. Mafuuu traced all four modes (cache hit, cache miss, dial timeout, parent cancel) and found no contract violations. Pariston confirmed the diagnosis chain holds.

Four P3s and two nits new this round. The highest-severity new finding is a TOCTOU window on the cache-hit path where concurrent tool goroutines can return a released connection. Blast radius is one failed tool call with an opaque transport error instead of the sentinel; self-corrects on retry. Two test coverage gaps: the cache-hit happy path and the DB error fallthrough branch.

CI: test-go-pg failed. Worth checking if it's related to this change or a flaky base.

"Shall I show you the interleaving?" -- Hisoka, on the TOCTOU race

🤖 This review was automatically generated with Coder Agents.

- DEREM-17: Merge two lock sections on cache-hit path into one, eliminating TOCTOU where concurrent tool goroutines could return a released connection. - DEREM-18: Add test for cache-hit with healthy agent. - DEREM-19: Add test for cache-hit DB error fallthrough. - AMREM-13: Pass clock.Now() to isAgentUnreachable instead of hardcoding dbtime.Now(), enabling deterministic testing. - AMREM-14: Fix stale doc comment. - DEREM-20: Fix misleading comment about dial timeout on cache-hit path.

coder-agents-review

Status: APPROVED

All 20 findings from R1 and R2 addressed. Netero and 6 panel reviewers found no new issues.

Three rounds refined this change into a solid fix. R1 caught a data race (8 reviewers converged) and a stale-cache false positive; both got structural fixes. R2 caught a TOCTOU window and two test gaps; all addressed. R3 is clean.

Meruem confirmed the concurrency model is sound: lock held only to snapshot, DB call outside lock, clearCachedWorkspaceState internally acquires its own lock. Pariston tried to build a case against the change and could not. Hisoka pulled every thread and found nothing. Razor verified every claim. Bisky called the test suite "the genuine article."

CI green. No open findings. No contested or deferred items.

"This one held up." -- Hisoka

🤖 This review was automatically generated with Coder Agents.

johnstcn

Nice fix! Some comments below.

- AMREM-15: Log a warning on DB error when re-fetching the agent row on the cache-hit path. - AMREM-16: Add comment explaining why agentID is captured under the same lock as currentConn (TOCTOU prevention). - AMREM-17: Refactor six status-check getWorkspaceConn tests into a single table-driven TestGetWorkspaceConn_StatusCheck.

github-actions Bot assigned mafredri Apr 14, 2026

mafredri commented Apr 15, 2026

View reviewed changes

mafredri added 2 commits April 21, 2026 12:10

mafredri force-pushed the mathias/agent-disconnect-hang branch from 64611a0 to 5de390a Compare April 21, 2026 12:26

coder-agents-review Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd_internal_test.go Outdated

Comment thread coderd/x/chatd/chatd.go Outdated

Comment thread coderd/x/chatd/chatd.go Outdated

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd.go Outdated

coder-agents-review Bot reviewed Apr 21, 2026

View reviewed changes

mafredri marked this pull request as ready for review April 21, 2026 17:49

mafredri requested review from ethanndickson and johnstcn April 21, 2026 17:49

johnstcn reviewed Apr 21, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd.go

Comment thread coderd/x/chatd/chatd_internal_test.go Outdated

Comment thread coderd/x/chatd/chatd.go

johnstcn approved these changes Apr 22, 2026

View reviewed changes

mafredri force-pushed the mathias/agent-disconnect-hang branch from 32582d1 to 6595d1d Compare April 22, 2026 11:59

mafredri enabled auto-merge (squash) April 22, 2026 12:05

mafredri merged commit 78d9a22 into main Apr 22, 2026
26 checks passed

mafredri deleted the mathias/agent-disconnect-hang branch April 22, 2026 12:10

github-actions Bot locked and limited conversation to collaborators Apr 22, 2026

Conversation

mafredri commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mafredri commented Apr 14, 2026 •

edited

Loading