fix(coderd/x/chatd/chatdebug): record SSE attempts on EOF#24565
Conversation
96246fe to
6cb37c1
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cb37c15fd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mafredri
left a comment
There was a problem hiding this comment.
Well-scoped fix: SSE bodies consumed to EOF without Close() now record the attempt eagerly, solving the empty attempts: [] in chat_turn debug steps. The recordOnce guard, content-type gating, and test coverage are all sound. 16 reviewers, all tests pass under -race.
1 P1 (CI blocker), 1 P3, 2 Nits.
The P1 is a missing //nolint:bodyclose on the deliberately-unclosed response. CI will reject the PR as-is.
"The one thread worth pulling is the body leak hiding behind the recording fix." -- Hisoka
Multiple reviewers (Mafu-san, Pariston, Meruem) verified the root cause is universal across all fantasy SSE providers, not just Anthropic as the description states. The code fix is correctly scoped to all text/event-stream bodies, so it already covers them all. Consider updating the description to reflect this.
🤖 This review was automatically generated with Coder Agents.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3b3f01c83
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
mafredri
left a comment
There was a problem hiding this comment.
All four R1 findings addressed cleanly. The author also introduced a provisional/upgrade recording mechanism (responding to the Codex bot's P2), which is a meaningful expansion of the original change.
R2: 1 P3, 2 Nits.
The P3 is a narrow race window in the new recordProvisional mechanism. Takumi, Meruem, and Hisoka all confirmed the interleaving is real but requires concurrent Read+Close from different goroutines, which the SSE adapter never does. Hisoka also confirmed the alternative ordering (flag-before-record) would be strictly worse, producing duplicate entries. Knov agrees the current design is the correct pragmatic choice. Debug recording only; no functional impact.
"The current ordering (append first, flag second) avoids the duplicate at the cost of a window where Close() misses the flag and its error is silently swallowed via a no-op recordOnce.Do." -- Hisoka
The buildAttempt refactor is clean. replaceByNumber is minimal and safe under sink.mu. Tests are genuine: all three new tests fail if the feature is reverted.
🤖 This review was automatically generated with Coder Agents.
46c686d to
fa4e6d7
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fa4e6d7b14
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
All three R2 findings addressed. The race fix (DEREM-7) serializes recordProvisional and Close's closeErr branch under r.mu, which is correct. But the fix introduced a new lock-ordering inversion between record() and recordProvisional() that deadlocks under concurrent Read/Close.
R3: 1 P2.
record() enters recordOnce.Do first then acquires r.mu (via buildAttempt). recordProvisional() acquires r.mu first then enters recordOnce.Do. Close's switch block (lines 324-340) calls record() without holding r.mu, so it can race with recordProvisional on the Read goroutine. The fix is to make record() hold r.mu before entering recordOnce.Do, matching recordProvisional's order. Meruem provides the exact fix.
Note: Takumi, Mafuuu, and Kite all analyzed the recordProvisional vs Close closeErr branch interaction and confirmed it's correctly serialized. The deadlock is in a different path: Close's switch block (non-error, sawEOF case) calling r.record() without r.mu.
"This deadlock did not exist before this PR. The pre-existing record() was the only consumer of recordOnce, so no lock-ordering conflict was possible." -- Meruem
🤖 This review was automatically generated with Coder Agents.
fa4e6d7 to
6c896e5
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. Keep them coming! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
There was a problem hiding this comment.
Status: APPROVED
All 11 findings across 4 rounds addressed. CI green. No open findings.
R4 panel (Takumi, Meruem, Mafuuu) verified the deadlock fix (DEREM-11): record() now acquires r.mu before recordOnce.Do, matching recordProvisional()'s order. Lock ordering is consistent (r.mu then s.mu) in all paths. Mafuuu confirmed both race orderings between recordProvisional and Close produce correct results. Regression test TestRecordingBody_SSEConcurrentReadCloseNoDeadlock runs 200 iterations under -race.
The fix evolved from a simple SSE EOF recording (R1) through a provisional/upgrade mechanism (R2), serialization of the provisional flag (R3), and consistent lock ordering (R4). Each round addressed the prior findings and each new mechanism was verified by the panel. The final implementation is clean, well-tested, and correctly scoped.
🤖 This review was automatically generated with Coder Agents.
There was a problem hiding this comment.
Status: APPROVED
Head SHA unchanged from R4 (6c896e5). No new code, no new substantive comments. Netero and panel (Mafuuu, Meruem, Takumi) all confirm no findings. R4 approval stands.
All 11 findings across 4 rounds addressed. CI green. Lock ordering verified consistent across 5 rounds of review.
🤖 This review was automatically generated with Coder Agents.
Fantasy's Anthropic SSE adapter iterates the response to EOF and abandons the body without calling Close(), so RecordingTransport's Close-only recording path never fires for chat_turn streams and the step's attempts array persists as empty. Record on EOF for text/event-stream bodies specifically — non-SSE responses keep the Close-only path so JSON integrity checks, content-length validation, and inner-Close error paths keep their existing semantics. record() is already sync.Once-guarded, so a later Close() is a no-op for recording. Also add FIXES_TBD.md documenting this fix and the upstream counterpart in fantasy (defer stream.Close() in the SSE Stream adapter) that would let us revert to Close-only recording once the upstream change ships. Change-Id: I6aed3dcd8e7c63b03d10c85ab132e77e39c43d69 Signed-off-by: Thomas Kosiewski <tk@coder.com>
…cording Recording SSE attempts eagerly on io.EOF in Read consumed recordOnce before Close() ran, so a subsequent inner.Close() error was dropped from the attempt even though Close() still returned it to the caller. The attempt persisted as completed with an empty error, regressing the close-error semantics the Close()-only path preserved for other content types. Treat the SSE EOF recording as provisional: flag it after the sink append so a concurrent Close() cannot observe the flag before the entry exists. When inner.Close() fails, build the final attempt from that error and overwrite the provisional entry via a new attemptSink.replaceByNumber helper. Non-provisional record calls and every non-SSE path are unchanged. Change-Id: I64945dee913a4e2639a0e9f60652865a847f9c4a Signed-off-by: Thomas Kosiewski <tk@coder.com>
- Suppress bodyclose lint on SSE EOF-only recording test. - Restore "why" in Read() non-EOF comment. - Trim isSSEContentType doc to avoid duplicating call-site rationale. - Cover zero-byte SSE body (immediate EOF) boundary. Change-Id: I563caf334d2c9a6eb3856afc96b7c304432934d4 Signed-off-by: Thomas Kosiewski <tk@coder.com>
recordProvisional previously published the completed attempt via recordOnce before setting recordedProvisional, so a concurrent Close that returned an error could observe recordedProvisional=false and fall through to r.record(closeErr), which became a no-op once recordOnce had already fired. The close error was dropped and the attempt stayed marked completed. Hold r.mu across both the flag check/update and the sink publish in recordProvisional and Close's closeErr branch so the publish and flag transition commit together. Change-Id: I6f6cfd6cf6a9c168ac3a387b899fbe11ee65d2e5 Signed-off-by: Thomas Kosiewski <tk@coder.com>
Change-Id: Id7f7e6d8e9135ed18a95ec29752ec5e95375c113 Signed-off-by: Thomas Kosiewski <tk@coder.com>
… deadlock Change-Id: Ieea34beabc560976ed81a198c057710d2d474a66 Signed-off-by: Thomas Kosiewski <tk@coder.com>
6c896e5 to
efc7c88
Compare
stale
Merge activity
|
`chat_turn` debug steps persist with `attempts: []` even when the streaming call to Anthropic completes successfully. Fantasy's Anthropic SSE adapter iterates the response to EOF via `for stream.Next()` and abandons the body without calling `Close()`, so `RecordingTransport`'s Close-only recording path never fires and the attempt is lost. Non-streaming runs (`quickgen`, `title_generation`) go through `model.Generate(...)` and are unaffected. Record on `io.EOF` for `text/event-stream` bodies specifically. Non-SSE responses stay on the Close-only path so JSON integrity, content-length validation, and inner-`Close()` error semantics are preserved. `record()` is already `sync.Once`-guarded, so a later `Close()` is a no-op for recording.
`chat_turn` debug steps persist with `attempts: []` even when the streaming call to Anthropic completes successfully. Fantasy's Anthropic SSE adapter iterates the response to EOF via `for stream.Next()` and abandons the body without calling `Close()`, so `RecordingTransport`'s Close-only recording path never fires and the attempt is lost. Non-streaming runs (`quickgen`, `title_generation`) go through `model.Generate(...)` and are unaffected. Record on `io.EOF` for `text/event-stream` bodies specifically. Non-SSE responses stay on the Close-only path so JSON integrity, content-length validation, and inner-`Close()` error semantics are preserved. `record()` is already `sync.Once`-guarded, so a later `Close()` is a no-op for recording.
`chat_turn` debug steps persist with `attempts: []` even when the streaming call to Anthropic completes successfully. Fantasy's Anthropic SSE adapter iterates the response to EOF via `for stream.Next()` and abandons the body without calling `Close()`, so `RecordingTransport`'s Close-only recording path never fires and the attempt is lost. Non-streaming runs (`quickgen`, `title_generation`) go through `model.Generate(...)` and are unaffected. Record on `io.EOF` for `text/event-stream` bodies specifically. Non-SSE responses stay on the Close-only path so JSON integrity, content-length validation, and inner-`Close()` error semantics are preserved. `record()` is already `sync.Once`-guarded, so a later `Close()` is a no-op for recording.

chat_turndebug steps persist withattempts: []even when thestreaming call to Anthropic completes successfully. Fantasy's
Anthropic SSE adapter iterates the response to EOF via
for stream.Next()and abandons the body without callingClose(),so
RecordingTransport's Close-only recording path never fires andthe attempt is lost. Non-streaming runs (
quickgen,title_generation) go throughmodel.Generate(...)and areunaffected.
Record on
io.EOFfortext/event-streambodies specifically.Non-SSE responses stay on the Close-only path so JSON integrity,
content-length validation, and inner-
Close()error semantics arepreserved.
record()is alreadysync.Once-guarded, so a laterClose()is a no-op for recording.