Skip to content

fix: recover chatd from poisoned chain anchor on retry#25097

Merged
johnstcn merged 7 commits into
mainfrom
cian/fix-chatd-chain-broken-recovery
May 11, 2026
Merged

fix: recover chatd from poisoned chain anchor on retry#25097
johnstcn merged 7 commits into
mainfrom
cian/fix-chatd-chain-broken-recovery

Conversation

@johnstcn
Copy link
Copy Markdown
Member

@johnstcn johnstcn commented May 8, 2026

When OpenAI's Responses API returns Previous response with id ... not found for a chained turn, classify it as a ChainBroken retry, clear previous_response_id, exit chain mode, reload full history, and let chatretry retry. Self-heals chats whose anchor was poisoned before #25074 stopped truncated streams from being persisted as a successful turn with a stored response id.

The new state is exposed via the existing coderd_chatd_stream_retries_total counter as a chain_broken="true"|"false" label. Aggregating queries (sum, rate over provider/model/kind) keep working without changes; raw-series matchers without aggregation will now see two series per (provider, model, kind) where they previously saw one. The metric is internal-only so the blast radius should be small, but if you have dashboards that index by exact label matchers without aggregation they will need an extra sum or an explicit chain_broken selector.

Investigation notes

Reproduced on dev.coder.com chat b2a15dd9-...: the anchor assistant message (id 2393729, 2026-05-05 15:53:18Z) was an empty stub, one empty reasoning part and usage with only context_limit set. Classic truncated-Responses-stream-as-success: the response.id got persisted as provider_response_id but OpenAI never finalized the response, so subsequent chained turns 404 against it forever. chaterror.Classify did not recognize that 404, so the chat was stuck on every follow-up turn until the row was manually patched.

#25074 prevents new poisoning. This change recovers chats that were already poisoned and protects against any future shape of "chain anchor we have is no longer valid".

cc @hugodutka

🤖 This PR was created with the help of Coder Agents, and needs a human review. 🧑‍💻

When OpenAI's Responses API returns 'Previous response with id ... not found' for a chained turn, classify it as ChainBroken, clear previous_response_id, exit chain mode, reload full history, and let chatretry retry. Self-heals chats whose anchor was poisoned before the truncated-stream fix in #25074.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Docs preview

📖 View docs preview for docs/admin/integrations/prometheus.md

@johnstcn
Copy link
Copy Markdown
Member Author

johnstcn commented May 8, 2026

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for a real production incident. The error classification is precise (two co-occurring substrings, not just "not found"), the recovery ordering is correct (DisableChainMode before ReloadMessages), degradation is graceful on reload failure, and the test suite covers positive, negative, and defensive cases at a 1:4 code-to-test ratio. The metric label design is a good addition.

Three P3 findings, three nits. All are test or documentation quality; no production defects.

"finishingStream() returns FinishReasonStop, which sets stoppedByModel = true and breaks the outer step loop after one step. There is no second step. The test only proves the retry within the same step sees cleared options, which TestRun_ChainBrokenRecovers already covers." (Bisky)


coderd/x/chatd/chatloop/metrics_test.go:571

Nit [CODER-3] This assertion checks provider, model, and kind but omits chain_broken. Since requireCounter does subset matching, it passes silently. The other updated call sites (lines 314-318, line 38) include the new label. Adding "chain_broken": "false" would be consistent and catch a future label-value regression. (Bisky)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated
Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated
Comment thread coderd/x/chatd/chatloop/metrics.go
Comment thread coderd/x/chatd/chatloop/chatloop.go Outdated
Comment thread coderd/x/chatd/chatloop/chatloop.go Outdated
johnstcn added 3 commits May 9, 2026 07:29
…en recovery

- Use slices.Clone instead of make+copy when rebuilding the prompt.
- Compare full reloaded history in TestRun_ChainBrokenRecovers, not just length, so a future copy bug surfaces.
- Add chain_broken="false" to TestRun_StreamRetry_RecordsMetric.
- Document that recoverFromChainBroken bypasses PrepareMessages, that this is safe today because PrepareMessages already ran for the step and ReloadMessages handles the snapshot, and to re-evaluate if either callback gains responsibility the other does not duplicate.
- Drop the redundant opts.ProviderOptions sync in recoverFromChainBroken: the existing post-step chain-exit path in Run already clears it. Pass opts by value instead of by pointer.
- Rename TestRun_ChainBrokenPersistsAcrossSteps to TestRun_ChainBrokenComposesWithPostStepChainExit to match what it actually proves; force a real second step via a tool-call response so the test exercises the cross-step path.
@johnstcn johnstcn marked this pull request as ready for review May 11, 2026 10:05
Copilot AI review requested due to automatic review settings May 11, 2026 10:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a self-healing path for OpenAI Responses “chain mode” when the stored previous_response_id becomes invalid (“Previous response with id … not found”), allowing chatd to recover poisoned chain anchors via retry while also exposing the new condition in internal metrics.

Changes:

  • Detect OpenAI chain-anchor 404s in chaterror.Classify via a new internal ChainBroken flag (keeps user-visible ChatErrorKind stable).
  • On retry when ChainBroken is set, clear previous_response_id, disable chain mode, and reload full message history before retrying.
  • Add a chain_broken label to coderd_chatd_stream_retries_total and update docs + generated metrics output accordingly.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/metricsdocgen/generated_metrics Updates generated metric series to include the new chain_broken label.
docs/admin/integrations/prometheus.md Documents the additional chain_broken label on coderd_chatd_stream_retries_total.
coderd/x/chatd/chatloop/metrics.go Adds chain_broken label to stream retry counter and records it from classification.
coderd/x/chatd/chatloop/metrics_test.go Updates metrics tests to account for the new chain_broken label.
coderd/x/chatd/chatloop/chatloop.go Implements best-effort chain-broken recovery during retry (clear anchor, exit chain mode, reload history).
coderd/x/chatd/chatloop/chain_broken_test.go Adds focused tests for chain-broken retry recovery behavior and edge cases.
coderd/x/chatd/chaterror/classify.go Adds ChainBroken to classification and a matcher for the OpenAI “previous response … not found” error.
coderd/x/chatd/chaterror/classify_test.go Adds test coverage ensuring chain-broken detection, retryability, and wrap/round-trip behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread coderd/x/chatd/chaterror/classify.go Outdated
Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated
Comment thread coderd/x/chatd/chatloop/chain_broken_test.go Outdated
Comment thread coderd/x/chatd/chatloop/chatloop.go Outdated
// ReloadMessages refreshes the same state (instruction injection,
// advisor snapshot). Revisit if either callback diverges.
// - Bypasses the Anthropic prompt-cache and sanitize passes. Safe
// while chain mode is OpenAI-only; extend if chain mode does.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These caveats are buried kind of deep, I'm sure an LLM can discover these but a human may not. Also, what happens if ReloadMessages is set but not DisableChainMode?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to make it more explicit.

Comment thread coderd/x/chatd/chatloop/chatloop.go Outdated
classified = classified.WithProvider(provider)
opts.Metrics.RecordStreamRetry(provider, modelName, classified)
if classified.ChainBroken {
recoverFromChainBroken(ctx, opts, &call, &messages)
Copy link
Copy Markdown
Member

@mafredri mafredri May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier we copied messages into prepared, here we ignore changes to prepared (e.g. sanitize) and overwrite call.Prompt with messages that have not been. Also, I'd recommend returning messages rather than mutating as this gets hard to reason about.

Edit: I don't know which behavior is correct, but if this is the right behavior, a comment may be warranted explaining the reasoning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to instead return an updated slice instead of mutating.

@johnstcn johnstcn force-pushed the cian/fix-chatd-chain-broken-recovery branch from 379bd1f to d175a6f Compare May 11, 2026 14:49
@coder coder deleted a comment from github-actions Bot May 11, 2026
@johnstcn johnstcn force-pushed the cian/fix-chatd-chain-broken-recovery branch from d175a6f to 1d4d74d Compare May 11, 2026 14:51
@coder coder deleted a comment from github-actions Bot May 11, 2026
@johnstcn
Copy link
Copy Markdown
Member Author

Smoke-tested ✅

@johnstcn johnstcn merged commit e8508b2 into main May 11, 2026
47 of 49 checks passed
@johnstcn johnstcn deleted the cian/fix-chatd-chain-broken-recovery branch May 11, 2026 16:43
@github-actions github-actions Bot locked and limited conversation to collaborators May 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants