fix(coderd/x/chatd): remove cache-miss check blocking agent recovery#24634
Conversation
|
@codex review
|
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Clean, well-scoped fix. The 7-line removal is proportional to the problem, the regression test is genuine (confirmed red on base SHA), and TestGetWorkspaceConn_StaleAgentRecovery exercises the full stale-binding recovery chain. TimedOutAgentCacheHit fills a pre-existing coverage gap. No mechanical issues (Netero clean). 1 P3, 2 Nits.
"I tried to build a case against this change and could not." - Pariston
PS. Commit subject is 84 characters (12 over the 72-char convention). Consider trimming: fix(coderd/x/chatd): remove cache-miss check blocking agent recovery (69 chars). Nit [DEREM-4]
🤖 This review was automatically generated with Coder Agents.
|
In reply to AMREM-1 (Nit, Shortened PR title to
|
|
@codex review
|
|
Codex Review: Didn't find any major issues. Hooray! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
DEREM-1, DEREM-2, DEREM-4 all addressed cleanly. TestGetWorkspaceConn_SameBuildAgentCrash is well-constructed, the comment stutter is gone, and the title is within bounds. One new P3 from this round.
"If that assignment were removed, the test still passes. On the next cache-hit call,
agentID := c.agent.IDwould resolve to the old (disconnected) agent ID, the status check would find it unreachable, clear the cache, and force a full re-dial." - Bisky
🤖 This review was automatically generated with Coder Agents.
… agent recovery The cache-miss isAgentUnreachable check added in #24336 runs before dialWithLazyValidation, preventing the existing switch mechanism from discovering the new agent after a workspace rebuild. The chat's stale agent binding is never repaired, causing an infinite loop of 'agent is disconnected' errors. Remove the cache-miss check. The cache-hit check remains (it verifies the agent behind an established connection). The dial timeout and dialWithLazyValidation already bound the cache-miss failure path. Closes CODAGT-248
…utter Address review feedback: - Add TestGetWorkspaceConn_SameBuildAgentCrash covering the path where a crashed agent is still in the latest build. Verifies the wrapped dial error propagates (not errChatAgentDisconnected) and cache state. - Fix 'validation validation' comment stutter.
…ecovery Assert that after successful agent switch, the cache holds the new agent ID, agentLoaded is true, and the connection is cached. Without this, a removed assignment would silently cause an extra error cycle on the next cache-hit call.
b6785d8 to
1c8b8c9
Compare
PR #24336 added an
isAgentUnreachablecheck on the cache-miss path ofgetWorkspaceConn. When a workspace is rebuilt, the chat's stored agent ID points to a disconnected agent from the old build. The check rejects it beforedialWithLazyValidationcan discover the new agent, causing an infinite loop of "agent is disconnected" errors.Remove the cache-miss check. The cache-hit check (re-fetches agent row for fresh heartbeat) remains. The dial timeout and
dialWithLazyValidationalready bound the cache-miss failure path.Known tradeoff: genuinely stopped workspaces take ~5s (
workspaceDialValidationDelay) to error on cache-miss instead of immediately. This is the pre-#24336 behavior.Closes CODAGT-248
Implementation context
Approach: Remove the 7-line cache-miss
isAgentUnreachableblock (Approach A from research). The cache-miss path already haddialWithLazyValidationfor stale binding recovery and a 30s dial timeout as absolute bound.Test changes:
TestGetWorkspaceConn_StaleAgentRecovery: regression test that exercises the full recovery path (stale agent dial fails, validation discovers new agent, switch + persist binding). Red on pre-fix code, green after.TimedOutAgentCacheHit: covers the timeout branch of the cache-hit status check (coverage gap found during review).DisconnectedAgentCacheMiss,TimedOutAgentCacheMiss,ConnectingAgentProceeds(tested removed behavior).TestGetWorkspaceConn_StatusCheckto cache-hit-only cases.Why ~5s not 30s for stopped workspaces: When the workspace is genuinely stopped, the dial to the old agent fails fast. Validation fires at
workspaceDialValidationDelay(5s), finds no agents, anderrChatHasNoWorkspaceAgentshort-circuitsdialWithLazyValidationwithout waiting for the full 30s dial timeout.