Skip to content

fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn#24336

Merged
mafredri merged 4 commits into
mainfrom
mathias/agent-disconnect-hang
Apr 22, 2026
Merged

fix(coderd/x/chatd): detect disconnected agents in getWorkspaceConn#24336
mafredri merged 4 commits into
mainfrom
mathias/agent-disconnect-hang

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented Apr 14, 2026

Problem

When a workspace agent disconnects but the workspace remains "running,"
getWorkspaceConn either hangs indefinitely on AwaitReachable (cache miss)
or returns a dead cached connection whose transport errors the LLM
misinterprets as transient (cache hit). Both paths leave the chat unusable
until stale recovery kicks in (~5 minutes) or the user clicks interrupt.

Closes https://linear.app/coder/issue/CODAGT-149

Fix

Add two defenses in getWorkspaceConn:

  1. Status check on every call (both cache-hit and cache-miss paths). On
    cache miss, the freshly fetched agent row is checked via isAgentUnreachable
    (timestamp arithmetic, zero cost). On cache hit, the agent row is re-fetched
    by PK to get fresh LastConnectedAt before checking. disconnected and
    timeout agents return a sentinel immediately; connecting agents proceed
    to dial normally. Uses the server's quartz.Clock for deterministic testing.

  2. Dial timeout (30s, matching 8 other server-side AgentConn callers).
    Wraps dialWithLazyValidation in a context.WithTimeoutCause. Parent
    context cancellation propagates unchanged so the chatloop can still detect
    ErrInterrupted.

Both sentinels instruct the LLM that the agent is unreachable and the workspace
may need restarting from the dashboard. Neither suggests start_workspace
(which cannot recover a disconnected agent on a running workspace).

Implementation plan and decision log

Plan: docs/plans/agent-disconnect-hang.md
Research: docs/plans/agent-disconnect-hang-research.md

Key decisions:

  • D1: Approaches A + B (status check + dial timeout).
  • D2: Only disconnected and timeout trigger the error; connecting proceeds.
  • D3: 30s dial timeout, injectable via Server.dialTimeout field for tests.
  • D4: Two distinct error messages (disconnected vs timeout).
  • D5: start_workspace behavior out of scope.

R1 amendments:

  • Cache-hit re-fetches agent row (PK lookup) to avoid stale-timestamp false positives.
  • isAgentUnreachable is a pure function to eliminate data race on concurrent tool dispatch.
  • context.WithTimeoutCause replaces manual capture-before-cancel pattern.

R2 amendments:

  • Merged two lock sections on cache-hit into one to eliminate TOCTOU.
  • isAgentUnreachable accepts time.Time and uses server clock for testability.
  • Added tests for cache-hit happy path and DB error fallthrough.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

Copy link
Copy Markdown
Member Author

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix for a real problem (indefinite hang on disconnected agents). The dial-timeout implementation is well done: the capture-before-cancel pattern, the parent-cancel propagation guard, and the regression test for misclassification are all carefully reasoned. The test suite covers the full decision matrix (cache hit, cache miss, connecting, timeout, parent cancel, non-timeout error). Killua confirmed the performance profile is negligible. Mafu-san verified every claim in the PR description against the codebase.

Two P1s, four P3s, six nits, two notes.

Both P1s are on the cache-hit status check at line 576. (1) c.agent and c.agentLoaded are read outside c.mu while executeTools dispatches parallel goroutines, creating a data race on the multi-word WorkspaceAgent struct. Eight reviewers independently flagged this. (2) The cached agent row's LastConnectedAt freezes at fetch time. After ~30s (the default agentInactiveDisconnectTimeout), Status() computes Now() - cachedLastConnectedAt > 30s and classifies any healthy agent as disconnected, tearing down a live connection and telling the user to restart a working workspace.

"Delete || status.Status == database.WorkspaceAgentStatusTimeout from isAgentDisconnected() and the full suite still passes. The branch is untested dead weight to CI." -- Bisky

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
Comment thread coderd/x/chatd/chatd_internal_test.go
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
Comment thread coderd/x/chatd/chatd_internal_test.go
Add agent status check and dial timeout to getWorkspaceConn to
prevent tool calls from hanging when a workspace agent disconnects.

Two changes:

1. Status check: call agent.Status() on every getWorkspaceConn
   call (both cache-hit and cache-miss paths). When the agent is
   disconnected or timed out, return a sentinel error immediately
   instead of attempting to dial a dead agent. The status check
   uses timestamp arithmetic on the cached agent row (zero cost,
   no DB re-fetch). Connecting agents are allowed through.

2. Dial timeout: wrap dialWithLazyValidation in a 30s timeout
   (matching the convention used by 8 other server-side AgentConn
   callers). When the timeout fires and the parent context is
   still alive, convert to a dial timeout sentinel. Parent context
   cancellation propagates unchanged so the chatloop can detect
   ErrInterrupted.

Both sentinels tell the LLM to inform the user that the workspace
agent is disconnected and may need to be restarted from the Coder
dashboard. Neither suggests start_workspace (which cannot recover
a disconnected agent on a running workspace).
- F1: Make isAgentUnreachable a pure function, read agent ID
  under lock on cache-hit path to eliminate data race.
- F2: Re-fetch agent row on cache hit for fresh LastConnectedAt,
  preventing false-positive disconnect detection on healthy agents.
- F3: Add test for WorkspaceAgentStatusTimeout branch.
- F4: Strengthen DialTimeoutParentCanceled with positive assertion.
- F5: Use context.WithTimeoutCause to eliminate manual
  capture-before-cancel ordering dependency.
- F6: Simplify error messages to match errChatHasNoWorkspaceAgent
  style.
- F7: Rename isAgentDisconnected to isAgentUnreachable.
- F8: Remove dialTimedOut variable (superseded by WithTimeoutCause).
- F9: Add comment explaining why defaultDialTimeout is 30s.
- F10: Use defaultDialTimeout constant in tests.
- F11: Replace TDD process labels with behavioral descriptions.
- F12: Add dialTimeout to disconnected-agent test Server structs.
@mafredri mafredri force-pushed the mathias/agent-disconnect-hang branch from 64611a0 to 5de390a Compare April 21, 2026 12:26
Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 13 R1 findings addressed cleanly. The two P1s (data race, stale cache false-positive) got structural fixes: isAgentUnreachable extracted as a pure function eliminates the race, PK re-fetch on cache hit eliminates the stale-timestamp false positive. context.WithTimeoutCause replaced the fragile capture-before-cancel pattern. Mafu-san confirmed every claim in the PR description matches the delivered code. Mafuuu traced all four modes (cache hit, cache miss, dial timeout, parent cancel) and found no contract violations. Pariston confirmed the diagnosis chain holds.

Four P3s and two nits new this round. The highest-severity new finding is a TOCTOU window on the cache-hit path where concurrent tool goroutines can return a released connection. Blast radius is one failed tool call with an opaque transport error instead of the sentinel; self-corrects on retry. Two test coverage gaps: the cache-hit happy path and the DB error fallthrough branch.

CI: test-go-pg failed. Worth checking if it's related to this change or a flaky base.

"Shall I show you the interleaving?" -- Hisoka, on the TOCTOU race

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go Outdated
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go Outdated
- DEREM-17: Merge two lock sections on cache-hit path into one,
  eliminating TOCTOU where concurrent tool goroutines could return
  a released connection.
- DEREM-18: Add test for cache-hit with healthy agent.
- DEREM-19: Add test for cache-hit DB error fallthrough.
- AMREM-13: Pass clock.Now() to isAgentUnreachable instead of
  hardcoding dbtime.Now(), enabling deterministic testing.
- AMREM-14: Fix stale doc comment.
- DEREM-20: Fix misleading comment about dial timeout on cache-hit
  path.
Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status: APPROVED

All 20 findings from R1 and R2 addressed. Netero and 6 panel reviewers found no new issues.

Three rounds refined this change into a solid fix. R1 caught a data race (8 reviewers converged) and a stale-cache false positive; both got structural fixes. R2 caught a TOCTOU window and two test gaps; all addressed. R3 is clean.

Meruem confirmed the concurrency model is sound: lock held only to snapshot, DB call outside lock, clearCachedWorkspaceState internally acquires its own lock. Pariston tried to build a case against the change and could not. Hisoka pulled every thread and found nothing. Razor verified every claim. Bisky called the test suite "the genuine article."

CI green. No open findings. No contested or deferred items.

"This one held up." -- Hisoka

🤖 This review was automatically generated with Coder Agents.

@mafredri mafredri marked this pull request as ready for review April 21, 2026 17:49
Copy link
Copy Markdown
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! Some comments below.

Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd.go
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
Comment thread coderd/x/chatd/chatd.go
- AMREM-15: Log a warning on DB error when re-fetching the
  agent row on the cache-hit path.
- AMREM-16: Add comment explaining why agentID is captured
  under the same lock as currentConn (TOCTOU prevention).
- AMREM-17: Refactor six status-check getWorkspaceConn tests
  into a single table-driven TestGetWorkspaceConn_StatusCheck.
@mafredri mafredri force-pushed the mathias/agent-disconnect-hang branch from 32582d1 to 6595d1d Compare April 22, 2026 11:59
@mafredri mafredri enabled auto-merge (squash) April 22, 2026 12:05
@mafredri mafredri merged commit 78d9a22 into main Apr 22, 2026
26 checks passed
@mafredri mafredri deleted the mathias/agent-disconnect-hang branch April 22, 2026 12:10
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants