Skip to content

feat(coderd/x/chatd): capture generic error debug logs by default#26557

Draft
johnstcn wants to merge 1 commit into
mainfrom
feat/chatd-capture-error-debug-logs
Draft

feat(coderd/x/chatd): capture generic error debug logs by default#26557
johnstcn wants to merge 1 commit into
mainfrom
feat/chatd-capture-error-debug-logs

Conversation

@johnstcn

Copy link
Copy Markdown
Member

What

Capture chat debug logs for unexpected errors even when debug logging is off, and surface the Debug tab whenever a chat has any captured runs.

Two behaviors:

  1. Errors-only default capture. When full debug logging is disabled, a chat turn that fails with an unclassified error still records a minimal debug run. Capture is gated to:

    • Outcome gate: terminal StatusError only (interruptions, cancellations, and step-level timeouts are skipped).
    • Kind gate: chaterror.Classify(err).Kind == ChatErrorKindGeneric only (auth, config, usage_limit, rate_limit, overloaded, timeout, etc. are skipped as known operational conditions).

    The run is lazily materialized on the first qualifying error via a context-threaded ensurer, so error-free chats write nothing. Captured data is minimal: the run plus one failing step with the normalized error payload only (no RecordHTTP, request/response/usage/attempts).

    Full debug logging behavior is unchanged (eager run/step creation, full payloads, all outcomes).

  2. Debug tab visibility. The tab now shows when debugLoggingEnabled || hasDebugRuns, reusing the existing chatDebugRuns query (shared key dedupes with the panel). No useEffect added.

Testing

  • New integration test: a generic 400 with logging off fails the turn fast and produces a minimal chat_turn error run+step.
  • New chatdebug unit tests covering generic-capture, non-generic skip, success skip, cancellation skip, no-ensurer skip, and at-most-once run creation.
  • 3 new Storybook play stories asserting Debug tab visibility (with runs / with logging / hidden when neither).
  • go test ./coderd/x/chatd/..., go vet, golangci-lint, frontend tsc + biome + circular-deps + react-compiler + knip, and make pre-commit all green.

Deferred (documented TODO(chatd-debug))

Kind-accurate errors-only capture for internal runs (KindQuickgen / KindTitleGeneration / KindCompaction) is left as a smaller follow-up. The user-visible signal is already satisfied: a failing turn (including a compaction failure, which propagates to the chat_turn error ensurer) surfaces the Debug tab. See the TODO(chatd-debug) notes in quickgen.go and chatloop/compaction.go.

Implementation plan
# Plan: Always capture chatd error debug logs; show Debug tab when logs exist

## Confirmed decisions
- Default capture trigger (logging OFF): persist a debug log only for errors that
  classify as generic (unclassified/unexpected). All other kinds are skipped.
- Outcome gate: additionally gated to genuine terminal StatusError outcomes, so
  anything resolving to StatusInterrupted (cancellation, step-level timeouts,
  incomplete streams, no-outcome chat turns) is never force-persisted.
  Net rule: persist iff StatusError AND Classify(err).Kind == generic.
- Captured detail when OFF: error-only (minimal). Run + failing step with
  status=error and the normalized error payload. No RecordHTTP, request/response,
  usage, attempts, or extra metadata.
- Tab visibility: reuse the existing debug runs query, always enabled, and show the
  tab when the runs list is non-empty OR logging is enabled.

## Design
- Two recording levels: Errors (default, always active when a debug Service
  exists) and Full (active when Service.IsEnabled). RecorderOptions.FullRecording
  selects between them.
- The model is always wrapped when a Service exists; the run is lazily created on
  the first qualifying error via a shared context error-run ensurer.
- Full level preserves eager run/step creation and full payload capture exactly.

## Scope boundary
Logging-OFF trigger = StatusError AND generic. Keeps the Debug tab's "logs exist"
signal meaning "something unexpected went wrong," not "interrupted" or "a known
operational condition occurred."

The full plan also scoped kind-accurate internal-run capture (quickgen/compaction/title) for the errors-only level. That was intentionally deferred to keep this change focused on the top-level chat_turn path plus the frontend; see the TODO(chatd-debug) notes.


Generated with the assistance of Coder Agents on behalf of @johnstcn.

Persist a minimal chat_turn debug run+step when a turn fails with an
unclassified (generic) terminal error, even when debug logging is off.
Show the Debug tab whenever a chat has captured runs.
@datadog-coder

datadog-coder Bot commented Jun 19, 2026

Copy link
Copy Markdown

Pipelines

⚠️ Warnings

🚦 2 Pipeline jobs failed

AI Code Review | AI Code Review   View in Datadog   GitHub Actions

contrib | title   View in Datadog   GitHub Actions

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8c866f6 | Docs | Give us feedback!

@coder-tasks coder-tasks Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Reviewed errors-only default capture for chatd debug logs and the frontend Debug tab visibility change. The design is well-structured: lazy run materialization via errorRunEnsurer, clean gating on StatusError && ChatErrorKindGeneric, and the sync.Once dedup in the ensurer is correct. One bug found.

Found 1 issue.

}

stepNum := nextStepNumber(rc.RunID)
step, err := h.svc.CreateStep(ctx, CreateStepParams{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 IMPORTANT: captureDeferredError passes the raw ctx to CreateStep and UpdateStep, but finish (the full-recording path) wraps it with stepFinalizeContext(ctx) before DB calls:

// finish does this:
updateCtx, cancel := stepFinalizeContext(ctx)
defer cancel()
h.svc.UpdateStep(updateCtx, ...)

stepFinalizeContext calls context.WithoutCancel(ctx) with a timeout, so finalization survives parent context cancellation. Without this, if the HTTP request context is canceled right after the model returns a generic error (e.g., client drops the connection at the same moment the provider sends a 400), both CreateStep and UpdateStep will fail with a canceled context error and the capture is silently skipped.

This is particularly likely in the stream path where captureDeferredError is called inside the iterator closure: by the time the stream error propagates and finalize runs, the request context may already be in the process of cancellation.

Suggested change
step, err := h.svc.CreateStep(ctx, CreateStepParams{
stepCtx, cancel := stepFinalizeContext(ctx)
defer cancel()
stepNum := nextStepNumber(rc.RunID)
step, err := h.svc.CreateStep(stepCtx, CreateStepParams{

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant