feat: add chatd debug endpoint#26551
Conversation
|
/coder-agents-review |
|
Chat: Review posted | View chat Review history
deep-review v0.8.0 | Round 1 | Last posted: Round 1, 19 findings (1 P2, 10 P3, 1 P4, 6 Nit, 1 Note), COMMENT. Review Finding inventoryFindings
Round logRound 1Panel. Netero first pass clean (P3 and below). 1 P2, 10 P3, 1 P4, 5 Nit, 1 Note. 21 reviewers (19 trigger-matched + 2 wildcards). Reviewed against a12b051..0ca3f50. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
Clean, proportional feature: a read-only debug endpoint that aggregates DB + runtime state for troubleshooting stuck chats, with multi-replica proxy forwarding. Test coverage is strong (~1:1 ratio), the auth chain is correct, and the concurrency primitives (atomic.Pointer for activeTaskKind, lock-protected InspectChat) are well-chosen. The enterprise proxy follows the established stream-parts relay pattern.
Severity count: 1 P2, 10 P3, 1 P4, 5 Nit, 1 Note.
The P2 is a data race in SnapshotRuntime reading chatWorker.manager without the lock that Close() uses to nil it. The fix is a small wrapper method on chatWorker.
The proxy has three pattern gaps relative to the sibling stream-parts relay: no loop protection header, query string silently dropped, Content-Type hardcoded instead of forwarded. All are cheap one-line fixes.
Process note: the commit message is one line for 800 lines of code. A body explaining the purpose (developer troubleshooting without DB access), the design (combined DB + runtime snapshot), and the multi-replica proxy constraint would help the next reader who git-blames into this.
"I tried to build a case against this and couldn't. The problem is correctly understood, the solution is proportional, and the fix is at the right causal level." Pariston
🤖 This review was automatically generated with Coder Agents.
| } | ||
| // chatWorker.manager is only populated after Start(). Guard for tests that | ||
| // run with ChatWorkerDisabled=true. | ||
| if p.chatWorker != nil && p.chatWorker.manager != nil { |
There was a problem hiding this comment.
P2 [CRF-3] SnapshotRuntime reads p.chatWorker.manager without synchronization. Close() sets w.manager = nil under w.mu (worker.go:144). This is a data race under the Go memory model.
The handler goroutine can call SnapshotRuntime while Close() runs on another goroutine during shutdown. The race detector will flag it if a shutdown test ever overlaps with a handler call.
Fix: add an InspectChat method on chatWorker that reads manager under the lock before delegating:
func (w *chatWorker) InspectChat(chatID uuid.UUID) []RunnerSnapshot {
w.mu.Lock()
manager := w.manager
w.mu.Unlock()
if manager == nil {
return nil
}
return manager.InspectChat(chatID)
}Then call p.chatWorker.InspectChat(chatID) here.
(Hisoka P2, Komugi P3)
🤖
| // Forward debug snapshot requests to the owning replica by resolving its | ||
| // address and proxying the full HTTP request. The owning pod handles it | ||
| // locally and returns its own DB + runtime response. | ||
| options.Options.ChatDebugProxy = func(rw http.ResponseWriter, r *http.Request, replicaID uuid.UUID) { |
There was a problem hiding this comment.
P3 [CRF-4] No forwarding-loop protection. The sibling stream-parts relay sets RelaySourceHeader to prevent a proxied request from bouncing back. This proxy has no equivalent guard.
If ownership transfers between the two DB reads (originator reads WorkerID = B, target re-reads WorkerID = A), the request bounces until context timeout. The window is narrow (ownership must flip during network transit), but the fix is one header check, matching the existing relay convention.
Fix: set a header like X-Coder-Debug-Forwarded: 1 on the outgoing request. On inbound, if the header is present, serve locally regardless of ownership.
(Hisoka P3, Kite P4, Meruem P3, Ryosuke P3)
🤖
| }) | ||
| return | ||
| } | ||
| req, err := http.NewRequestWithContext(r.Context(), http.MethodGet, address+r.URL.Path, nil) |
There was a problem hiding this comment.
P3 [CRF-5] address + r.URL.Path drops query parameters. The endpoint has none today, but the comment on exp_chats.go:7917 says "forward the entire request there and return its response verbatim," and it doesn't.
Fix: address + r.URL.RequestURI()
(Kite P3, Meruem P3, Ryosuke P3, Razor P3)
🤖
| return | ||
| } | ||
| defer resp.Body.Close() | ||
| rw.Header().Set("Content-Type", "application/json") |
There was a problem hiding this comment.
P3 [CRF-6] Hardcodes Content-Type: application/json instead of forwarding the upstream's header. If an intermediary between replicas returns a non-JSON error (HTML 502, text/plain), the client receives it labeled as JSON.
Fix: rw.Header().Set("Content-Type", resp.Header.Get("Content-Type"))
(Knov P3, Mafuuu P3)
🤖
| require.EqualValues(t, 1, byRole["tool"]) | ||
| } | ||
|
|
||
| // TestGetChatDebugSnapshot_Heartbeat verifies heartbeat presence and staleness. |
There was a problem hiding this comment.
P3 [CRF-7] Test title says "verifies heartbeat presence and staleness" but only exercises is_stale == false. The heartbeat is always fresh because it was just upserted. The IsStale: age > float64(chatstate.HeartbeatStaleSeconds) branch at exp_chats.go:7944 is never reached.
To test the stale path: after upserting the heartbeat, advance the quartz clock past HeartbeatStaleSeconds (requires wiring api.Clock through the handler, which is currently blocked by CRF-2), or insert a heartbeat with a backdated timestamp and verify is_stale == true.
(Bisky)
🤖
|
|
||
| // Commit 2 assistant + 1 tool message. | ||
| chatdCtx := dbauthz.AsChatd(ctx) //nolint:gocritic | ||
| content, _ := chatprompt.MarshalParts([]codersdk.ChatMessagePart{codersdk.ChatMessageText("hi")}) |
There was a problem hiding this comment.
Nit [CRF-17] content, _ := chatprompt.MarshalParts(...) swallows a potential serialization error. If MarshalParts ever fails, the test proceeds with nil content and the assertions could pass for the wrong reason. Use require.NoError like the other helpers in this file do.
(Bisky)
🤖
| RunnerID uuid.UUID `json:"runner_id"` | ||
| HeartbeatAt time.Time `json:"heartbeat_at"` | ||
| AgeSeconds float64 `json:"age_seconds"` | ||
| // IsStale is true when age exceeds chatstate.HeartbeatStaleSeconds (30s). |
There was a problem hiding this comment.
Nit [CRF-18] (30s) hardcodes the value of HeartbeatStaleSeconds. If the constant changes, this comment goes stale silently. Drop the parenthetical: // IsStale is true when age exceeds chatstate.HeartbeatStaleSeconds.
(Gon)
🤖
| @@ -3277,6 +3276,25 @@ func BuildSingleUserChatMessageInsertParams( | |||
| } | |||
|
|
|||
| // Config configures a chat processor. | |||
There was a problem hiding this comment.
P3 [CRF-1] The pre-existing // Config configures a chat processor. doc comment is now contiguous with the new WorkerID() method. Go godoc attaches it to WorkerID(), whose first sentence reads "Config configures a chat processor." Meanwhile type Config struct (now further down) has no doc comment.
Fix: move // Config configures a chat processor. to directly above type Config struct.
(Netero P3, Kite Nit)
🤖
| }) | ||
| switch { | ||
| case hbErr == nil: | ||
| age := time.Since(row.HeartbeatAt).Seconds() |
There was a problem hiding this comment.
Nit [CRF-2] time.Since(row.HeartbeatAt) bypasses quartz clock. The API struct carries Clock quartz.Clock, and this file already uses api.Clock.Now() at line 6695. Should be api.Clock.Since(row.HeartbeatAt).Seconds().
(Netero)
🤖
| COUNT(*)::bigint AS total, | ||
| COUNT(*) FILTER (WHERE deleted)::bigint AS deleted, | ||
| COUNT(*) FILTER (WHERE NOT deleted AND role = 'system')::bigint AS system_count, | ||
| COUNT(*) FILTER (WHERE NOT deleted AND role = 'user')::bigint AS user_count, |
There was a problem hiding this comment.
Note [CRF-19] The FILTER clauses hardcode the four current roles (system, user, assistant, tool). The Go handler mirrors this with a fixed four-key map. If a fifth role is added to the chat_message_role enum, total will include it but no by_role bucket will, so the debug output quietly stops summing correctly. For a debug-only endpoint the cost is low, but the loan is worth tracking.
(Knuckle, Melody)
🤖
Adds
/api/experimental/chats/:chat/debug/snapshotendpoint