Skip to content

fix(coderd/x/chatd/chatdebug): record SSE attempts on EOF#24565

Merged
ThomasK33 merged 6 commits into
mainfrom
fix/chatdebug-sse-record-on-eof
Apr 22, 2026
Merged

fix(coderd/x/chatd/chatdebug): record SSE attempts on EOF#24565
ThomasK33 merged 6 commits into
mainfrom
fix/chatdebug-sse-record-on-eof

Conversation

@ThomasK33
Copy link
Copy Markdown
Member

@ThomasK33 ThomasK33 commented Apr 21, 2026

chat_turn debug steps persist with attempts: [] even when the
streaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
for stream.Next() and abandons the body without calling Close(),
so RecordingTransport's Close-only recording path never fires and
the attempt is lost. Non-streaming runs (quickgen,
title_generation) go through model.Generate(...) and are
unaffected.

Record on io.EOF for text/event-stream bodies specifically.
Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-Close() error semantics are
preserved. record() is already sync.Once-guarded, so a later
Close() is a no-op for recording.

Copy link
Copy Markdown
Member Author

@codex review

@ThomasK33 ThomasK33 requested a review from mafredri April 21, 2026 12:20
@ThomasK33 ThomasK33 marked this pull request as ready for review April 21, 2026 12:20
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cb37c15fd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread coderd/x/chatd/chatdebug/transport.go
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-scoped fix: SSE bodies consumed to EOF without Close() now record the attempt eagerly, solving the empty attempts: [] in chat_turn debug steps. The recordOnce guard, content-type gating, and test coverage are all sound. 16 reviewers, all tests pass under -race.

1 P1 (CI blocker), 1 P3, 2 Nits.

The P1 is a missing //nolint:bodyclose on the deliberately-unclosed response. CI will reject the PR as-is.

"The one thread worth pulling is the body leak hiding behind the recording fix." -- Hisoka

Multiple reviewers (Mafu-san, Pariston, Meruem) verified the root cause is universal across all fantasy SSE providers, not just Anthropic as the description states. The code fix is correctly scoped to all text/event-stream bodies, so it already covers them all. Consider updating the description to reflect this.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatdebug/transport_test.go Outdated
Comment thread coderd/x/chatd/chatdebug/transport_test.go
Comment thread coderd/x/chatd/chatdebug/transport.go Outdated
Comment thread coderd/x/chatd/chatdebug/transport.go Outdated
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

@ThomasK33 ThomasK33 requested a review from mafredri April 21, 2026 13:03
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3b3f01c83

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread coderd/x/chatd/chatdebug/transport.go Outdated
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four R1 findings addressed cleanly. The author also introduced a provisional/upgrade recording mechanism (responding to the Codex bot's P2), which is a meaningful expansion of the original change.

R2: 1 P3, 2 Nits.

The P3 is a narrow race window in the new recordProvisional mechanism. Takumi, Meruem, and Hisoka all confirmed the interleaving is real but requires concurrent Read+Close from different goroutines, which the SSE adapter never does. Hisoka also confirmed the alternative ordering (flag-before-record) would be strictly worse, producing duplicate entries. Knov agrees the current design is the correct pragmatic choice. Debug recording only; no functional impact.

"The current ordering (append first, flag second) avoids the duplicate at the cost of a window where Close() misses the flag and its error is silently swallowed via a no-op recordOnce.Do." -- Hisoka

The buildAttempt refactor is clean. replaceByNumber is minimal and safe under sink.mu. Tests are genuine: all three new tests fail if the feature is reverted.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatdebug/transport.go
Comment thread coderd/x/chatd/chatdebug/transport.go Outdated
Comment thread coderd/x/chatd/chatdebug/transport_test.go Outdated
@ThomasK33 ThomasK33 force-pushed the fix/chatdebug-sse-record-on-eof branch from 46c686d to fa4e6d7 Compare April 21, 2026 14:38
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

@ThomasK33 ThomasK33 requested a review from mafredri April 21, 2026 14:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fa4e6d7b14

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread coderd/x/chatd/chatdebug/transport.go
Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three R2 findings addressed. The race fix (DEREM-7) serializes recordProvisional and Close's closeErr branch under r.mu, which is correct. But the fix introduced a new lock-ordering inversion between record() and recordProvisional() that deadlocks under concurrent Read/Close.

R3: 1 P2.

record() enters recordOnce.Do first then acquires r.mu (via buildAttempt). recordProvisional() acquires r.mu first then enters recordOnce.Do. Close's switch block (lines 324-340) calls record() without holding r.mu, so it can race with recordProvisional on the Read goroutine. The fix is to make record() hold r.mu before entering recordOnce.Do, matching recordProvisional's order. Meruem provides the exact fix.

Note: Takumi, Mafuuu, and Kite all analyzed the recordProvisional vs Close closeErr branch interaction and confirmed it's correctly serialized. The deadlock is in a different path: Close's switch block (non-error, sawEOF case) calling r.record() without r.mu.

"This deadlock did not exist before this PR. The pre-existing record() was the only consumer of recordOnce, so no lock-ordering conflict was possible." -- Meruem

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatdebug/transport.go
@ThomasK33 ThomasK33 force-pushed the fix/chatdebug-sse-record-on-eof branch from fa4e6d7 to 6c896e5 Compare April 21, 2026 18:18
@ThomasK33
Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ThomasK33 ThomasK33 requested review from mafredri and removed request for mafredri April 21, 2026 19:05
Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status: APPROVED

All 11 findings across 4 rounds addressed. CI green. No open findings.

R4 panel (Takumi, Meruem, Mafuuu) verified the deadlock fix (DEREM-11): record() now acquires r.mu before recordOnce.Do, matching recordProvisional()'s order. Lock ordering is consistent (r.mu then s.mu) in all paths. Mafuuu confirmed both race orderings between recordProvisional and Close produce correct results. Regression test TestRecordingBody_SSEConcurrentReadCloseNoDeadlock runs 200 iterations under -race.

The fix evolved from a simple SSE EOF recording (R1) through a provisional/upgrade mechanism (R2), serialization of the provisional flag (R3), and consistent lock ordering (R4). Each round addressed the prior findings and each new mechanism was verified by the panel. The final implementation is clean, well-tested, and correctly scoped.

🤖 This review was automatically generated with Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status: APPROVED

Head SHA unchanged from R4 (6c896e5). No new code, no new substantive comments. Netero and panel (Mafuuu, Meruem, Takumi) all confirm no findings. R4 approval stands.

All 11 findings across 4 rounds addressed. CI green. Lock ordering verified consistent across 5 rounds of review.

🤖 This review was automatically generated with Coder Agents.

Fantasy's Anthropic SSE adapter iterates the response to EOF and
abandons the body without calling Close(), so RecordingTransport's
Close-only recording path never fires for chat_turn streams and the
step's attempts array persists as empty. Record on EOF for
text/event-stream bodies specifically — non-SSE responses keep the
Close-only path so JSON integrity checks, content-length validation,
and inner-Close error paths keep their existing semantics. record()
is already sync.Once-guarded, so a later Close() is a no-op for
recording.

Also add FIXES_TBD.md documenting this fix and the upstream
counterpart in fantasy (defer stream.Close() in the SSE Stream
adapter) that would let us revert to Close-only recording once the
upstream change ships.

Change-Id: I6aed3dcd8e7c63b03d10c85ab132e77e39c43d69
Signed-off-by: Thomas Kosiewski <tk@coder.com>
…cording

Recording SSE attempts eagerly on io.EOF in Read consumed recordOnce
before Close() ran, so a subsequent inner.Close() error was dropped
from the attempt even though Close() still returned it to the caller.
The attempt persisted as completed with an empty error, regressing
the close-error semantics the Close()-only path preserved for other
content types.

Treat the SSE EOF recording as provisional: flag it after the sink
append so a concurrent Close() cannot observe the flag before the
entry exists. When inner.Close() fails, build the final attempt from
that error and overwrite the provisional entry via a new
attemptSink.replaceByNumber helper. Non-provisional record calls and
every non-SSE path are unchanged.

Change-Id: I64945dee913a4e2639a0e9f60652865a847f9c4a
Signed-off-by: Thomas Kosiewski <tk@coder.com>
- Suppress bodyclose lint on SSE EOF-only recording test.
- Restore "why" in Read() non-EOF comment.
- Trim isSSEContentType doc to avoid duplicating call-site rationale.
- Cover zero-byte SSE body (immediate EOF) boundary.

Change-Id: I563caf334d2c9a6eb3856afc96b7c304432934d4
Signed-off-by: Thomas Kosiewski <tk@coder.com>
recordProvisional previously published the completed attempt via
recordOnce before setting recordedProvisional, so a concurrent Close
that returned an error could observe recordedProvisional=false and fall
through to r.record(closeErr), which became a no-op once recordOnce had
already fired. The close error was dropped and the attempt stayed
marked completed.

Hold r.mu across both the flag check/update and the sink publish in
recordProvisional and Close's closeErr branch so the publish and flag
transition commit together.

Change-Id: I6f6cfd6cf6a9c168ac3a387b899fbe11ee65d2e5
Signed-off-by: Thomas Kosiewski <tk@coder.com>
Change-Id: Id7f7e6d8e9135ed18a95ec29752ec5e95375c113
Signed-off-by: Thomas Kosiewski <tk@coder.com>
… deadlock

Change-Id: Ieea34beabc560976ed81a198c057710d2d474a66
Signed-off-by: Thomas Kosiewski <tk@coder.com>
@ThomasK33 ThomasK33 force-pushed the fix/chatdebug-sse-record-on-eof branch from 6c896e5 to efc7c88 Compare April 22, 2026 11:41
@ThomasK33 ThomasK33 dismissed stale reviews from mafredri and coder-agents-review[bot] April 22, 2026 11:42

stale

Copy link
Copy Markdown
Member Author

ThomasK33 commented Apr 22, 2026

Merge activity

  • Apr 22, 1:01 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 22, 1:02 PM UTC: @ThomasK33 merged this pull request with Graphite.

@ThomasK33 ThomasK33 merged commit 26b64fa into main Apr 22, 2026
23 checks passed
@ThomasK33 ThomasK33 deleted the fix/chatdebug-sse-record-on-eof branch April 22, 2026 13:02
johnstcn pushed a commit that referenced this pull request Apr 22, 2026
`chat_turn` debug steps persist with `attempts: []` even when the
streaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
`for stream.Next()` and abandons the body without calling `Close()`,
so `RecordingTransport`'s Close-only recording path never fires and
the attempt is lost. Non-streaming runs (`quickgen`,
`title_generation`) go through `model.Generate(...)` and are
unaffected.

Record on `io.EOF` for `text/event-stream` bodies specifically.
Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-`Close()` error semantics are
preserved. `record()` is already `sync.Once`-guarded, so a later
`Close()` is a no-op for recording.
ibetitsmike pushed a commit that referenced this pull request Apr 22, 2026
`chat_turn` debug steps persist with `attempts: []` even when the
streaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
`for stream.Next()` and abandons the body without calling `Close()`,
so `RecordingTransport`'s Close-only recording path never fires and
the attempt is lost. Non-streaming runs (`quickgen`,
`title_generation`) go through `model.Generate(...)` and are
unaffected.

Record on `io.EOF` for `text/event-stream` bodies specifically.
Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-`Close()` error semantics are
preserved. `record()` is already `sync.Once`-guarded, so a later
`Close()` is a no-op for recording.
sreya pushed a commit that referenced this pull request Apr 22, 2026
`chat_turn` debug steps persist with `attempts: []` even when the
streaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
`for stream.Next()` and abandons the body without calling `Close()`,
so `RecordingTransport`'s Close-only recording path never fires and
the attempt is lost. Non-streaming runs (`quickgen`,
`title_generation`) go through `model.Generate(...)` and are
unaffected.

Record on `io.EOF` for `text/event-stream` bodies specifically.
Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-`Close()` error semantics are
preserved. `record()` is already `sync.Once`-guarded, so a later
`Close()` is a no-op for recording.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants