fix: recover chatd from poisoned chain anchor on retry by johnstcn · Pull Request #25097 · coder/coder

johnstcn · 2026-05-08T21:39:50Z

When OpenAI's Responses API returns Previous response with id ... not found for a chained turn, classify it as a ChainBroken retry, clear previous_response_id, exit chain mode, reload full history, and let chatretry retry. Self-heals chats whose anchor was poisoned before #25074 stopped truncated streams from being persisted as a successful turn with a stored response id.

The new state is exposed via the existing coderd_chatd_stream_retries_total counter as a chain_broken="true"|"false" label. Aggregating queries (sum, rate over provider/model/kind) keep working without changes; raw-series matchers without aggregation will now see two series per (provider, model, kind) where they previously saw one. The metric is internal-only so the blast radius should be small, but if you have dashboards that index by exact label matchers without aggregation they will need an extra sum or an explicit chain_broken selector.

Investigation notes

Reproduced on dev.coder.com chat b2a15dd9-...: the anchor assistant message (id 2393729, 2026-05-05 15:53:18Z) was an empty stub, one empty reasoning part and usage with only context_limit set. Classic truncated-Responses-stream-as-success: the response.id got persisted as provider_response_id but OpenAI never finalized the response, so subsequent chained turns 404 against it forever. chaterror.Classify did not recognize that 404, so the chat was stuck on every follow-up turn until the row was manually patched.

#25074 prevents new poisoning. This change recovers chats that were already poisoned and protects against any future shape of "chain anchor we have is no longer valid".

cc @hugodutka

🤖 This PR was created with the help of Coder Agents, and needs a human review. 🧑‍💻

When OpenAI's Responses API returns 'Previous response with id ... not found' for a chained turn, classify it as ChainBroken, clear previous_response_id, exit chain mode, reload full history, and let chatretry retry. Self-heals chats whose anchor was poisoned before the truncated-stream fix in #25074.

github-actions · 2026-05-08T21:39:58Z

Docs preview

📖 View docs preview for docs/admin/integrations/prometheus.md

johnstcn · 2026-05-08T22:44:58Z

/coder-agents-review

coder-agents-review

Clean fix for a real production incident. The error classification is precise (two co-occurring substrings, not just "not found"), the recovery ordering is correct (DisableChainMode before ReloadMessages), degradation is graceful on reload failure, and the test suite covers positive, negative, and defensive cases at a 1:4 code-to-test ratio. The metric label design is a good addition.

Three P3 findings, three nits. All are test or documentation quality; no production defects.

"finishingStream() returns FinishReasonStop, which sets stoppedByModel = true and breaks the outer step loop after one step. There is no second step. The test only proves the retry within the same step sees cleared options, which TestRun_ChainBrokenRecovers already covers." (Bisky)

coderd/x/chatd/chatloop/metrics_test.go:571

Nit [CODER-3] This assertion checks provider, model, and kind but omits chain_broken. Since requireCounter does subset matching, it passes silently. The other updated call sites (lines 314-318, line 38) include the new label. Adding "chain_broken": "false" would be consistent and catch a future label-value regression. (Bisky)

🤖

🤖 This review was automatically generated with Coder Agents.

…en recovery - Use slices.Clone instead of make+copy when rebuilding the prompt. - Compare full reloaded history in TestRun_ChainBrokenRecovers, not just length, so a future copy bug surfaces. - Add chain_broken="false" to TestRun_StreamRetry_RecordsMetric. - Document that recoverFromChainBroken bypasses PrepareMessages, that this is safe today because PrepareMessages already ran for the step and ReloadMessages handles the snapshot, and to re-evaluate if either callback gains responsibility the other does not duplicate. - Drop the redundant opts.ProviderOptions sync in recoverFromChainBroken: the existing post-step chain-exit path in Run already clears it. Pass opts by value instead of by pointer. - Rename TestRun_ChainBrokenPersistsAcrossSteps to TestRun_ChainBrokenComposesWithPostStepChainExit to match what it actually proves; force a real second step via a tool-call response so the test exercises the cross-step path.

Copilot

Pull request overview

This PR adds a self-healing path for OpenAI Responses “chain mode” when the stored previous_response_id becomes invalid (“Previous response with id … not found”), allowing chatd to recover poisoned chain anchors via retry while also exposing the new condition in internal metrics.

Changes:

Detect OpenAI chain-anchor 404s in chaterror.Classify via a new internal ChainBroken flag (keeps user-visible ChatErrorKind stable).
On retry when ChainBroken is set, clear previous_response_id, disable chain mode, and reload full message history before retrying.
Add a chain_broken label to coderd_chatd_stream_retries_total and update docs + generated metrics output accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
scripts/metricsdocgen/generated_metrics	Updates generated metric series to include the new `chain_broken` label.
docs/admin/integrations/prometheus.md	Documents the additional `chain_broken` label on `coderd_chatd_stream_retries_total`.
coderd/x/chatd/chatloop/metrics.go	Adds `chain_broken` label to stream retry counter and records it from classification.
coderd/x/chatd/chatloop/metrics_test.go	Updates metrics tests to account for the new `chain_broken` label.
coderd/x/chatd/chatloop/chatloop.go	Implements best-effort chain-broken recovery during retry (clear anchor, exit chain mode, reload history).
coderd/x/chatd/chatloop/chain_broken_test.go	Adds focused tests for chain-broken retry recovery behavior and edge cases.
coderd/x/chatd/chaterror/classify.go	Adds `ChainBroken` to classification and a matcher for the OpenAI “previous response … not found” error.
coderd/x/chatd/chaterror/classify_test.go	Adds test coverage ensuring chain-broken detection, retryability, and wrap/round-trip behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mafredri · 2026-05-11T11:10:09Z

+//     ReloadMessages refreshes the same state (instruction injection,
+//     advisor snapshot). Revisit if either callback diverges.
+//   - Bypasses the Anthropic prompt-cache and sanitize passes. Safe
+//     while chain mode is OpenAI-only; extend if chain mode does.


These caveats are buried kind of deep, I'm sure an LLM can discover these but a human may not. Also, what happens if ReloadMessages is set but not DisableChainMode?

Refactored to make it more explicit.

mafredri · 2026-05-11T11:14:48Z

 				classified = classified.WithProvider(provider)
 				opts.Metrics.RecordStreamRetry(provider, modelName, classified)
+				if classified.ChainBroken {
+					recoverFromChainBroken(ctx, opts, &call, &messages)


Earlier we copied messages into prepared, here we ignore changes to prepared (e.g. sanitize) and overwrite call.Prompt with messages that have not been. Also, I'd recommend returning messages rather than mutating as this gets hard to reason about.

Edit: I don't know which behavior is correct, but if this is the right behavior, a comment may be warranted explaining the reasoning.

Refactored to instead return an updated slice instead of mutating.

johnstcn · 2026-05-11T16:42:59Z

Smoke-tested ✅

github-actions Bot assigned johnstcn May 8, 2026

coder-agents-review Bot reviewed May 8, 2026

View reviewed changes

johnstcn added 3 commits May 9, 2026 07:29

cleanup

fce9ebf

Merge branch 'main' into cian/fix-chatd-chain-broken-recovery

53327f3

johnstcn marked this pull request as ready for review May 11, 2026 10:05

Copilot AI review requested due to automatic review settings May 11, 2026 10:05

Copilot started reviewing on behalf of johnstcn May 11, 2026 10:06 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread coderd/x/chatd/chaterror/classify.go Outdated

Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated

address nit

d57f0bf

johnstcn requested review from ethanndickson, hugodutka and ibetitsmike May 11, 2026 10:34

ethanndickson approved these changes May 11, 2026

View reviewed changes

Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated

unlintification

806c45e

mafredri reviewed May 11, 2026

View reviewed changes

johnstcn force-pushed the cian/fix-chatd-chain-broken-recovery branch from 379bd1f to d175a6f Compare May 11, 2026 14:49

coder deleted a comment from github-actions Bot May 11, 2026

address feedback

1d4d74d

johnstcn force-pushed the cian/fix-chatd-chain-broken-recovery branch from d175a6f to 1d4d74d Compare May 11, 2026 14:51

coder deleted a comment from github-actions Bot May 11, 2026

johnstcn merged commit e8508b2 into main May 11, 2026
47 of 49 checks passed

johnstcn deleted the cian/fix-chatd-chain-broken-recovery branch May 11, 2026 16:43

github-actions Bot locked and limited conversation to collaborators May 11, 2026

Conversation

johnstcn commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 8, 2026

Docs preview

Uh oh!

johnstcn commented May 8, 2026

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri May 11, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn May 11, 2026

Choose a reason for hiding this comment

Uh oh!

mafredri May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

johnstcn May 11, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn commented May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

johnstcn commented May 8, 2026 •

edited

Loading

mafredri May 11, 2026 •

edited

Loading