Skip to content

feat: add chatd debug endpoint#26551

Draft
johnstcn wants to merge 1 commit into
mainfrom
cj/debug-chat-snapshot-endpoint
Draft

feat: add chatd debug endpoint#26551
johnstcn wants to merge 1 commit into
mainfrom
cj/debug-chat-snapshot-endpoint

Conversation

@johnstcn

@johnstcn johnstcn commented Jun 19, 2026

Copy link
Copy Markdown
Member

Adds /api/experimental/chats/:chat/debug/snapshot endpoint

@johnstcn

Copy link
Copy Markdown
Member Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Chat: Review posted | View chat
Requested: 2026-06-19 18:54 UTC by @johnstcn
Spend: $54.29 / $100.00

Review history
  • R1 (2026-06-19): 21 reviewers, 6 Nit, 1 Note, 1 P2, 10 P3, 1 P4, COMMENT. Review

deep-review v0.8.0 | Round 1 | a12b051..0ca3f50

Last posted: Round 1, 19 findings (1 P2, 10 P3, 1 P4, 6 Nit, 1 Note), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P3 Open chatd.go:3278 Detached doc comment: // Config configures a chat processor. now attaches to WorkerID() instead of type Config struct R1 Netero P3, Kite Nit Yes
CRF-2 Nit Open exp_chats.go:7942 time.Since() bypasses quartz clock; should use api.Clock.Since() R1 Netero Yes
CRF-3 P2 Open chatd.go:3292 SnapshotRuntime reads chatWorker.manager without lock, racing with Close() which sets it to nil under w.mu R1 Hisoka P2, Komugi P3 Yes
CRF-4 P3 Open enterprise/coderd/coderd.go:228 Debug proxy has no loop protection unlike the stream parts relay R1 Hisoka P3, Kite P4, Meruem P3, Ryosuke P3 Yes
CRF-5 P3 Open enterprise/coderd/coderd.go:236 Proxy constructs URL from r.URL.Path, dropping query string R1 Kite P3, Meruem P3, Ryosuke P3, Razor P3 Yes
CRF-6 P3 Open enterprise/coderd/coderd.go:254 Proxy hardcodes Content-Type: application/json instead of forwarding upstream's R1 Knov P3, Mafuuu P3 Yes
CRF-7 P3 Open exp_chats_debug_test.go:191 Heartbeat test claims to verify staleness but never exercises is_stale=true path R1 Bisky Yes
CRF-8 P3 Open exp_chats.go:8025 Episode selection discards information; returns one best episode instead of full list R1 Zoro P3, Pariston Note Yes
CRF-9 P3 Open exp_chats.go:7951 Heartbeat lookup failure produces null heartbeat with no error indication R1 Chopper P3, Leorio Note Yes
CRF-10 P3 Open exp_chats.go:7890 chatDebugRunnerInfo is a field-for-field copy of RunnerSnapshot R1 Robin Yes
CRF-11 P3 Open exp_chats.go:7907 EXPERIMENTAL comment missing colon and not in standard format R1 Leorio Yes
CRF-12 P3 Open exp_chats.go:7981 ClassifyExecutionState called without transactional consistency its doc requires R1 Mafuuu P3, Razor Note Yes
CRF-13 P4 Open enterprise/coderd/coderd.go:256 Unbounded io.Copy from upstream replica response R1 Knov Yes
CRF-14 Nit Open exp_chats.go:7880 ByRole map uses int while Total and Deleted use int64 R1 Kite Nit, Mafuuu Nit, Zoro Nit Yes
CRF-15 Nit Open exp_chats.go:8030 ep := ep loop variable shadow is dead since Go 1.22 R1 Ging-Go Yes
CRF-16 Nit Open exp_chats.go:8017 r loop variable shadows the handler's r *http.Request parameter R1 Gon Yes
CRF-17 Nit Open exp_chats_debug_test.go:167 Discarded error from chatprompt.MarshalParts R1 Bisky Yes
CRF-18 Nit Open exp_chats.go:7873 Hardcoded (30s) in IsStale comment will go stale if constant changes R1 Gon Yes
CRF-19 Note Open queries/chats.sql:3182 Role counts hardcoded against current 4-role enum; new role won't get a bucket R1 Knuckle Note, Melody Note Yes

Round log

Round 1

Panel. Netero first pass clean (P3 and below). 1 P2, 10 P3, 1 P4, 5 Nit, 1 Note. 21 reviewers (19 trigger-matched + 2 wildcards). Reviewed against a12b051..0ca3f50.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Komugi flake/determinism
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, proportional feature: a read-only debug endpoint that aggregates DB + runtime state for troubleshooting stuck chats, with multi-replica proxy forwarding. Test coverage is strong (~1:1 ratio), the auth chain is correct, and the concurrency primitives (atomic.Pointer for activeTaskKind, lock-protected InspectChat) are well-chosen. The enterprise proxy follows the established stream-parts relay pattern.

Severity count: 1 P2, 10 P3, 1 P4, 5 Nit, 1 Note.

The P2 is a data race in SnapshotRuntime reading chatWorker.manager without the lock that Close() uses to nil it. The fix is a small wrapper method on chatWorker.

The proxy has three pattern gaps relative to the sibling stream-parts relay: no loop protection header, query string silently dropped, Content-Type hardcoded instead of forwarded. All are cheap one-line fixes.

Process note: the commit message is one line for 800 lines of code. A body explaining the purpose (developer troubleshooting without DB access), the design (combined DB + runtime snapshot), and the multi-replica proxy constraint would help the next reader who git-blames into this.

"I tried to build a case against this and couldn't. The problem is correctly understood, the solution is proportional, and the fix is at the right causal level." Pariston

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatd.go
}
// chatWorker.manager is only populated after Start(). Guard for tests that
// run with ChatWorkerDisabled=true.
if p.chatWorker != nil && p.chatWorker.manager != nil {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 [CRF-3] SnapshotRuntime reads p.chatWorker.manager without synchronization. Close() sets w.manager = nil under w.mu (worker.go:144). This is a data race under the Go memory model.

The handler goroutine can call SnapshotRuntime while Close() runs on another goroutine during shutdown. The race detector will flag it if a shutdown test ever overlaps with a handler call.

Fix: add an InspectChat method on chatWorker that reads manager under the lock before delegating:

func (w *chatWorker) InspectChat(chatID uuid.UUID) []RunnerSnapshot {
    w.mu.Lock()
    manager := w.manager
    w.mu.Unlock()
    if manager == nil {
        return nil
    }
    return manager.InspectChat(chatID)
}

Then call p.chatWorker.InspectChat(chatID) here.

(Hisoka P2, Komugi P3)

🤖

// Forward debug snapshot requests to the owning replica by resolving its
// address and proxying the full HTTP request. The owning pod handles it
// locally and returns its own DB + runtime response.
options.Options.ChatDebugProxy = func(rw http.ResponseWriter, r *http.Request, replicaID uuid.UUID) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-4] No forwarding-loop protection. The sibling stream-parts relay sets RelaySourceHeader to prevent a proxied request from bouncing back. This proxy has no equivalent guard.

If ownership transfers between the two DB reads (originator reads WorkerID = B, target re-reads WorkerID = A), the request bounces until context timeout. The window is narrow (ownership must flip during network transit), but the fix is one header check, matching the existing relay convention.

Fix: set a header like X-Coder-Debug-Forwarded: 1 on the outgoing request. On inbound, if the header is present, serve locally regardless of ownership.

(Hisoka P3, Kite P4, Meruem P3, Ryosuke P3)

🤖

})
return
}
req, err := http.NewRequestWithContext(r.Context(), http.MethodGet, address+r.URL.Path, nil)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-5] address + r.URL.Path drops query parameters. The endpoint has none today, but the comment on exp_chats.go:7917 says "forward the entire request there and return its response verbatim," and it doesn't.

Fix: address + r.URL.RequestURI()

(Kite P3, Meruem P3, Ryosuke P3, Razor P3)

🤖

return
}
defer resp.Body.Close()
rw.Header().Set("Content-Type", "application/json")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-6] Hardcodes Content-Type: application/json instead of forwarding the upstream's header. If an intermediary between replicas returns a non-JSON error (HTML 502, text/plain), the client receives it labeled as JSON.

Fix: rw.Header().Set("Content-Type", resp.Header.Get("Content-Type"))

(Knov P3, Mafuuu P3)

🤖

require.EqualValues(t, 1, byRole["tool"])
}

// TestGetChatDebugSnapshot_Heartbeat verifies heartbeat presence and staleness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-7] Test title says "verifies heartbeat presence and staleness" but only exercises is_stale == false. The heartbeat is always fresh because it was just upserted. The IsStale: age > float64(chatstate.HeartbeatStaleSeconds) branch at exp_chats.go:7944 is never reached.

To test the stale path: after upserting the heartbeat, advance the quartz clock past HeartbeatStaleSeconds (requires wiring api.Clock through the handler, which is currently blocked by CRF-2), or insert a heartbeat with a backdated timestamp and verify is_stale == true.

(Bisky)

🤖


// Commit 2 assistant + 1 tool message.
chatdCtx := dbauthz.AsChatd(ctx) //nolint:gocritic
content, _ := chatprompt.MarshalParts([]codersdk.ChatMessagePart{codersdk.ChatMessageText("hi")})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-17] content, _ := chatprompt.MarshalParts(...) swallows a potential serialization error. If MarshalParts ever fails, the test proceeds with nil content and the assertions could pass for the wrong reason. Use require.NoError like the other helpers in this file do.

(Bisky)

🤖

Comment thread coderd/exp_chats.go
RunnerID uuid.UUID `json:"runner_id"`
HeartbeatAt time.Time `json:"heartbeat_at"`
AgeSeconds float64 `json:"age_seconds"`
// IsStale is true when age exceeds chatstate.HeartbeatStaleSeconds (30s).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-18] (30s) hardcodes the value of HeartbeatStaleSeconds. If the constant changes, this comment goes stale silently. Drop the parenthetical: // IsStale is true when age exceeds chatstate.HeartbeatStaleSeconds.

(Gon)

🤖

Comment thread coderd/x/chatd/chatd.go
@@ -3277,6 +3276,25 @@ func BuildSingleUserChatMessageInsertParams(
}

// Config configures a chat processor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-1] The pre-existing // Config configures a chat processor. doc comment is now contiguous with the new WorkerID() method. Go godoc attaches it to WorkerID(), whose first sentence reads "Config configures a chat processor." Meanwhile type Config struct (now further down) has no doc comment.

Fix: move // Config configures a chat processor. to directly above type Config struct.

(Netero P3, Kite Nit)

🤖

Comment thread coderd/exp_chats.go
})
switch {
case hbErr == nil:
age := time.Since(row.HeartbeatAt).Seconds()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-2] time.Since(row.HeartbeatAt) bypasses quartz clock. The API struct carries Clock quartz.Clock, and this file already uses api.Clock.Now() at line 6695. Should be api.Clock.Since(row.HeartbeatAt).Seconds().

(Netero)

🤖

COUNT(*)::bigint AS total,
COUNT(*) FILTER (WHERE deleted)::bigint AS deleted,
COUNT(*) FILTER (WHERE NOT deleted AND role = 'system')::bigint AS system_count,
COUNT(*) FILTER (WHERE NOT deleted AND role = 'user')::bigint AS user_count,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note [CRF-19] The FILTER clauses hardcode the four current roles (system, user, assistant, tool). The Go handler mirrors this with a fixed four-key map. If a fifth role is added to the chat_message_role enum, total will include it but no by_role bucket will, so the debug output quietly stops summing correctly. For a debug-only endpoint the cost is low, but the loan is worth tracking.

(Knuckle, Melody)

🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant