Skip to content

fix(coderd/x/chatd): remove cache-miss check blocking agent recovery#24634

Merged
mafredri merged 3 commits into
mainfrom
mafredri/fix-stale-agent-liveness
Apr 22, 2026
Merged

fix(coderd/x/chatd): remove cache-miss check blocking agent recovery#24634
mafredri merged 3 commits into
mainfrom
mafredri/fix-stale-agent-liveness

Conversation

@mafredri
Copy link
Copy Markdown
Member

PR #24336 added an isAgentUnreachable check on the cache-miss path of getWorkspaceConn. When a workspace is rebuilt, the chat's stored agent ID points to a disconnected agent from the old build. The check rejects it before dialWithLazyValidation can discover the new agent, causing an infinite loop of "agent is disconnected" errors.

Remove the cache-miss check. The cache-hit check (re-fetches agent row for fresh heartbeat) remains. The dial timeout and dialWithLazyValidation already bound the cache-miss failure path.

Known tradeoff: genuinely stopped workspaces take ~5s (workspaceDialValidationDelay) to error on cache-miss instead of immediately. This is the pre-#24336 behavior.

Closes CODAGT-248

Implementation context

Approach: Remove the 7-line cache-miss isAgentUnreachable block (Approach A from research). The cache-miss path already had dialWithLazyValidation for stale binding recovery and a 30s dial timeout as absolute bound.

Test changes:

  • Added TestGetWorkspaceConn_StaleAgentRecovery: regression test that exercises the full recovery path (stale agent dial fails, validation discovers new agent, switch + persist binding). Red on pre-fix code, green after.
  • Added TimedOutAgentCacheHit: covers the timeout branch of the cache-hit status check (coverage gap found during review).
  • Removed DisconnectedAgentCacheMiss, TimedOutAgentCacheMiss, ConnectingAgentProceeds (tested removed behavior).
  • Simplified TestGetWorkspaceConn_StatusCheck to cache-hit-only cases.

Why ~5s not 30s for stopped workspaces: When the workspace is genuinely stopped, the dial to the old agent fails fast. Validation fires at workspaceDialValidationDelay (5s), finds no agents, and errChatHasNoWorkspaceAgent short-circuits dialWithLazyValidation without waiting for the full 30s dial timeout.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

Copy link
Copy Markdown
Member Author

@codex review
/coder-agents-review

🤖 This comment was automatically generated with Coder Agents.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix. The 7-line removal is proportional to the problem, the regression test is genuine (confirmed red on base SHA), and TestGetWorkspaceConn_StaleAgentRecovery exercises the full stale-binding recovery chain. TimedOutAgentCacheHit fills a pre-existing coverage gap. No mechanical issues (Netero clean). 1 P3, 2 Nits.

"I tried to build a case against this change and could not." - Pariston

PS. Commit subject is 84 characters (12 over the 72-char convention). Consider trimming: fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (69 chars). Nit [DEREM-4]

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd_internal_test.go
Comment thread coderd/x/chatd/chatd_internal_test.go Outdated
@mafredri mafredri changed the title fix(coderd/x/chatd): remove cache-miss status check that blocks stale agent recovery fix(coderd/x/chatd): remove cache-miss check blocking agent recovery Apr 22, 2026
Copy link
Copy Markdown
Member Author

In reply to AMREM-1 (Nit, coderd/x/chatd/chatd.go:1): #24634 (review)

Shortened PR title to fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (69 chars).

🤖 Posted using /amend-review skill via Coder Agents.

Copy link
Copy Markdown
Member Author

@codex review
/coder-agents-review

🤖 This comment was automatically generated with Coder Agents.

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEREM-1, DEREM-2, DEREM-4 all addressed cleanly. TestGetWorkspaceConn_SameBuildAgentCrash is well-constructed, the comment stutter is gone, and the title is within bounds. One new P3 from this round.

"If that assignment were removed, the test still passes. On the next cache-hit call, agentID := c.agent.ID would resolve to the old (disconnected) agent ID, the status check would find it unreachable, clear the cache, and force a full re-dial." - Bisky

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd_internal_test.go
… agent recovery

The cache-miss isAgentUnreachable check added in #24336 runs before
dialWithLazyValidation, preventing the existing switch mechanism from
discovering the new agent after a workspace rebuild. The chat's stale
agent binding is never repaired, causing an infinite loop of
'agent is disconnected' errors.

Remove the cache-miss check. The cache-hit check remains (it verifies
the agent behind an established connection). The dial timeout and
dialWithLazyValidation already bound the cache-miss failure path.

Closes CODAGT-248
…utter

Address review feedback:
- Add TestGetWorkspaceConn_SameBuildAgentCrash covering the path where
  a crashed agent is still in the latest build. Verifies the wrapped
  dial error propagates (not errChatAgentDisconnected) and cache state.
- Fix 'validation validation' comment stutter.
…ecovery

Assert that after successful agent switch, the cache holds the new
agent ID, agentLoaded is true, and the connection is cached. Without
this, a removed assignment would silently cause an extra error cycle
on the next cache-hit call.
@mafredri mafredri force-pushed the mafredri/fix-stale-agent-liveness branch from b6785d8 to 1c8b8c9 Compare April 22, 2026 17:33
@mafredri mafredri requested a review from johnstcn April 22, 2026 17:37
@mafredri mafredri marked this pull request as ready for review April 22, 2026 17:37
Copy link
Copy Markdown
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤞

@mafredri mafredri merged commit 1ace519 into main Apr 22, 2026
31 checks passed
@mafredri mafredri deleted the mafredri/fix-stale-agent-liveness branch April 22, 2026 18:49
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants