fix: recover chatd from poisoned chain anchor on retry#25097
Conversation
When OpenAI's Responses API returns 'Previous response with id ... not found' for a chained turn, classify it as ChainBroken, clear previous_response_id, exit chain mode, reload full history, and let chatretry retry. Self-heals chats whose anchor was poisoned before the truncated-stream fix in #25074.
Docs preview📖 View docs preview for |
|
/coder-agents-review |
There was a problem hiding this comment.
Clean fix for a real production incident. The error classification is precise (two co-occurring substrings, not just "not found"), the recovery ordering is correct (DisableChainMode before ReloadMessages), degradation is graceful on reload failure, and the test suite covers positive, negative, and defensive cases at a 1:4 code-to-test ratio. The metric label design is a good addition.
Three P3 findings, three nits. All are test or documentation quality; no production defects.
"finishingStream() returns FinishReasonStop, which sets stoppedByModel = true and breaks the outer step loop after one step. There is no second step. The test only proves the retry within the same step sees cleared options, which TestRun_ChainBrokenRecovers already covers." (Bisky)
coderd/x/chatd/chatloop/metrics_test.go:571
Nit [CODER-3] This assertion checks provider, model, and kind but omits chain_broken. Since requireCounter does subset matching, it passes silently. The other updated call sites (lines 314-318, line 38) include the new label. Adding "chain_broken": "false" would be consistent and catch a future label-value regression. (Bisky)
🤖
🤖 This review was automatically generated with Coder Agents.
…en recovery - Use slices.Clone instead of make+copy when rebuilding the prompt. - Compare full reloaded history in TestRun_ChainBrokenRecovers, not just length, so a future copy bug surfaces. - Add chain_broken="false" to TestRun_StreamRetry_RecordsMetric. - Document that recoverFromChainBroken bypasses PrepareMessages, that this is safe today because PrepareMessages already ran for the step and ReloadMessages handles the snapshot, and to re-evaluate if either callback gains responsibility the other does not duplicate. - Drop the redundant opts.ProviderOptions sync in recoverFromChainBroken: the existing post-step chain-exit path in Run already clears it. Pass opts by value instead of by pointer. - Rename TestRun_ChainBrokenPersistsAcrossSteps to TestRun_ChainBrokenComposesWithPostStepChainExit to match what it actually proves; force a real second step via a tool-call response so the test exercises the cross-step path.
There was a problem hiding this comment.
Pull request overview
This PR adds a self-healing path for OpenAI Responses “chain mode” when the stored previous_response_id becomes invalid (“Previous response with id … not found”), allowing chatd to recover poisoned chain anchors via retry while also exposing the new condition in internal metrics.
Changes:
- Detect OpenAI chain-anchor 404s in
chaterror.Classifyvia a new internalChainBrokenflag (keeps user-visibleChatErrorKindstable). - On retry when
ChainBrokenis set, clearprevious_response_id, disable chain mode, and reload full message history before retrying. - Add a
chain_brokenlabel tocoderd_chatd_stream_retries_totaland update docs + generated metrics output accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/metricsdocgen/generated_metrics | Updates generated metric series to include the new chain_broken label. |
| docs/admin/integrations/prometheus.md | Documents the additional chain_broken label on coderd_chatd_stream_retries_total. |
| coderd/x/chatd/chatloop/metrics.go | Adds chain_broken label to stream retry counter and records it from classification. |
| coderd/x/chatd/chatloop/metrics_test.go | Updates metrics tests to account for the new chain_broken label. |
| coderd/x/chatd/chatloop/chatloop.go | Implements best-effort chain-broken recovery during retry (clear anchor, exit chain mode, reload history). |
| coderd/x/chatd/chatloop/chain_broken_test.go | Adds focused tests for chain-broken retry recovery behavior and edge cases. |
| coderd/x/chatd/chaterror/classify.go | Adds ChainBroken to classification and a matcher for the OpenAI “previous response … not found” error. |
| coderd/x/chatd/chaterror/classify_test.go | Adds test coverage ensuring chain-broken detection, retryability, and wrap/round-trip behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // ReloadMessages refreshes the same state (instruction injection, | ||
| // advisor snapshot). Revisit if either callback diverges. | ||
| // - Bypasses the Anthropic prompt-cache and sanitize passes. Safe | ||
| // while chain mode is OpenAI-only; extend if chain mode does. |
There was a problem hiding this comment.
These caveats are buried kind of deep, I'm sure an LLM can discover these but a human may not. Also, what happens if ReloadMessages is set but not DisableChainMode?
There was a problem hiding this comment.
Refactored to make it more explicit.
| classified = classified.WithProvider(provider) | ||
| opts.Metrics.RecordStreamRetry(provider, modelName, classified) | ||
| if classified.ChainBroken { | ||
| recoverFromChainBroken(ctx, opts, &call, &messages) |
There was a problem hiding this comment.
Earlier we copied messages into prepared, here we ignore changes to prepared (e.g. sanitize) and overwrite call.Prompt with messages that have not been. Also, I'd recommend returning messages rather than mutating as this gets hard to reason about.
Edit: I don't know which behavior is correct, but if this is the right behavior, a comment may be warranted explaining the reasoning.
There was a problem hiding this comment.
Refactored to instead return an updated slice instead of mutating.
379bd1f to
d175a6f
Compare
d175a6f to
1d4d74d
Compare
|
Smoke-tested ✅ |
When OpenAI's Responses API returns
Previous response with id ... not foundfor a chained turn, classify it as aChainBrokenretry, clearprevious_response_id, exit chain mode, reload full history, and letchatretryretry. Self-heals chats whose anchor was poisoned before #25074 stopped truncated streams from being persisted as a successful turn with a stored response id.The new state is exposed via the existing
coderd_chatd_stream_retries_totalcounter as achain_broken="true"|"false"label. Aggregating queries (sum,rateoverprovider/model/kind) keep working without changes; raw-series matchers without aggregation will now see two series per(provider, model, kind)where they previously saw one. The metric is internal-only so the blast radius should be small, but if you have dashboards that index by exact label matchers without aggregation they will need an extrasumor an explicitchain_brokenselector.Investigation notes
Reproduced on
dev.coder.comchatb2a15dd9-...: the anchor assistant message (id2393729, 2026-05-05 15:53:18Z) was an empty stub, one emptyreasoningpart and usage with onlycontext_limitset. Classic truncated-Responses-stream-as-success: theresponse.idgot persisted asprovider_response_idbut OpenAI never finalized the response, so subsequent chained turns 404 against it forever.chaterror.Classifydid not recognize that 404, so the chat was stuck on every follow-up turn until the row was manually patched.#25074 prevents new poisoning. This change recovers chats that were already poisoned and protects against any future shape of "chain anchor we have is no longer valid".
cc @hugodutka