Skip to content

feat: surface model content-filter refusals as a blocked chat error#26294

Draft
jscottmiller wants to merge 5 commits into
mainfrom
scott/chat-content-filter-message
Draft

feat: surface model content-filter refusals as a blocked chat error#26294
jscottmiller wants to merge 5 commits into
mainfrom
scott/chat-content-filter-message

Conversation

@jscottmiller

@jscottmiller jscottmiller commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

When a provider blocks a turn with a content-filter stop reason and produces no
content, the chat turn previously ended silently: the user saw the "Thinking"
spinner and then nothing. This adds detection and surfacing so the turn ends as
a terminal content_filter error with the provider's explanation, which the UI
renders as a clear "Response blocked" message.

Observed with Anthropic's real-time safety classifiers (stop_reason: "refusal",
empty content), e.g. the Fable model on security-policy-violating prompts.

Changes

  • codersdk: add ChatErrorKindContentFilter (+ AllChatErrorKinds, generated TS).
  • chatloop: when a step finishes with FinishReasonContentFilter and no
    content, return a classified ErrContentFiltered instead of breaking
    silently, surfacing the Anthropic refusal category/explanation.
  • chaterror: default user-facing message for the new kind.
  • site: render "Response blocked" as the error title.
  • go.mod: point the fantasy fork at the refusal mapping change.
image

Dependency

Depends on coder/fantasy#41. The go.mod replace currently points at that PR's
branch commit (805f9dae1e20). Repoint it to the merged fantasy commit before
this merges.

Testing

  • go test ./coderd/x/chatd/chatloop/ ./coderd/x/chatd/chaterror/ ./codersdk/
    (new contentfilter_internal_test.go).
  • End-to-end against a local develop.sh instance routed through the
    dev.coder.com AI gateway: a Fable refusal now ends the turn as
    status=error, last_error.kind=content_filter with Anthropic's explanation
    and detail=cyber. A normal text refusal (e.g. Opus 4.8 declining in prose)
    is unaffected, since content is non-empty.
Decision log
  • Reproduced the silent turn and confirmed via the chat debug-run API that Fable
    returns HTTP 200 with stop_reason: "refusal" + stop_details and empty
    content; the fantasy SDK normalized it to FinishReasonUnknown and dropped
    the details, and chatloop's empty-content branch discarded it.
  • Chose to model the block as ChatStatusError + a new ChatErrorKind to reuse
    the existing classified-error flow and frontend rendering, rather than a new
    terminal status or message-part type.
  • Chose the precise path (carry Anthropic's explanation/category through
    ProviderMetadata) over a generic static message; the generic unknown
    signal is ambiguous and risks false positives on benign empty completions.
  • Reused FinishReasonContentFilter for refusal rather than adding a new
    finish reason.

Coder Agents generated.

When a provider blocks a turn with a content-filter stop reason and no
content (e.g. Anthropic's "refusal"), the turn previously ended silently
("Thinking" then nothing). Detect this in the chat loop and end the turn
as a terminal content_filter error carrying the provider's explanation,
so the UI shows a clear "Response blocked" message.

- codersdk: add ChatErrorKindContentFilter
- chatloop: return a classified ErrContentFiltered on empty
  content-filtered turns, surfacing Anthropic refusal category/explanation
- chaterror: default message for the new kind
- site: render "Response blocked" title; regenerate types
- go.mod: point fantasy fork at the refusal mapping change

Coder Agents generated.
Run make gen to add the content_filter enum value to swagger and the
generated API reference docs.

Coder Agents generated.
@github-actions

Copy link
Copy Markdown

Docs preview

📖 View docs preview for docs/reference/api/chats.md

Copy link
Copy Markdown
Contributor Author

Two known limitations of the empty-content check, noted during review:

  1. Blocked steps drop token usage. The turn ends before PersistStep, and usage is stored as columns on inserted assistant message rows, so a content-filter turn with no content has nowhere to record the refused step's input tokens (Anthropic bills input tokens on refusals). This matches every other terminal-error path today (stream errors, auth failures mid-turn). If usage accounting for failed turns matters, persisting usage-only steps needs its own design; follow-up issue candidate.

  2. Partial content + content-filter finish still ends silently. The new check only fires when the step produced no content. A mid-stream classifier block after partial output persists the partial text and ends as a normal stop with no error. Surfacing it would mean returning the classified error after PersistStep succeeds, which would also change UX for OpenAI/Google content-filter truncations that currently end silently; that is a product decision, so it is intentionally out of scope here.


Coder Agents generated (on behalf of @jscottmiller).

@jscottmiller

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Chat: Review in progress | View chat
Requested: 2026-06-11 21:37 UTC by @jscottmiller
Spend: $33.81 / $100.00

deep-review v0.7.1 | Round 2 | a4c867f..3343fac

Last posted: Round 2, 11 findings (1 P2, 4 P3, 6 Nit), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P3 Author fixed (3343fac) contentfilter_internal_test.go:41 WithoutMetadataUsesDefault only asserts non-empty Message, tautological for content_filter branch R1 Netero, Bisky P3 Yes
CRF-2 P3 Author fixed (3343fac) contentfilter_internal_test.go:29 WithRefusalMetadata does not assert classified.Detail R1 Netero, Bisky P3 Yes
CRF-3 Nit Author fixed (3343fac) chaterror/message.go:67 TestTerminalMessage table test missing content_filter case R1 Netero Yes
CRF-4 P2 Author fixed (3343fac) contentfilter_internal_test.go:13 No integration test exercises the content-filter branch in Run R1 Bisky P2, Hisoka P3, Mafuuu P3, Chopper P3, Meruem P3 Yes
CRF-5 P3 Author contested; panel closed R2 (2/3 accept) chatloop.go:57 Comment verbosity pattern: doc and inline comments restate what code shows R1 Gon P2 (downgraded by orchestrator) Yes
CRF-6 Nit Author contested; panel closed R2 (3/3 accept) chaterror/message.go:116 retryMessage has no content_filter case R1 Bisky, Mafuuu, Chopper, Leorio Yes
CRF-7 Nit Author fixed (3343fac) chatStatusHelpers.ts:49 No Storybook story for content_filter error state R1 Nami Yes
CRF-8 Nit Author fixed (3343fac) chatloop.go:72 contentFilterError omits explicit Retryable: false R1 Knov Yes
CRF-9 Nit Author fixed (3343fac) contentfilter_internal_test.go:30 Test errorf says "want explanation" instead of expected value R1 Gon Yes
CRF-10 P3 Open contentfilter_internal_test.go:57 No test for complement boundary: non-empty content with ContentFilter should persist normally R2 Bisky P3 Yes
CRF-11 Nit Open contentfilter_internal_test.go:54 Test doc comment restates what function name and body show R2 Gon P2 (downgraded by orchestrator) Yes

Contested and acknowledged

CRF-5 (P3, chatloop.go:57) - Comment verbosity

  • Finding: Doc and inline comments on new code (ErrContentFiltered, contentFilterError, inline at line 601) restate what the code already shows. Suggested trimming to one-line comments.
  • Author defense: Comments explain why the behavior matters, per repo comment guidance.
  • Panel closure (R2, 2/3): Mafuuu traced each comment to the repo rule ("Describe the behaviour of the code, not the reasoning the agent used to produce the change") and confirmed all three follow it. Razor verified all three describe behavioral contracts or design intent, not code mechanics. Gon narrowed the re-raise to line 68 specifically but was overridden by the majority.

CRF-6 (Nit, chaterror/message.go:116) - retryMessage missing content_filter case

  • Finding: retryMessage has no ChatErrorKindContentFilter case, unlike every other kind in terminalMessage.
  • Author defense: The case is unreachable because content_filter errors are never retryable. usage_limit already breaks the switch symmetry.
  • Panel closure (R2, 3/3): Bisky verified the unreachability and usage_limit precedent. Mafuuu traced the 4-step call chain from contentFilterError through chatretry.Retry confirming the path cannot execute. Razor independently verified the same chain with step-by-step evidence.

Round log

Round 1

Panel. 1 P2, 2 P3, 4 Nit, 2 Note. Reviewed against a4c867f..2514264. Panel: Bisky, Hisoka, Mafu-san, Mafuuu, Pariston, Chopper, Ging-Go, Ging-TS, Gon, Leorio, Kite, Nami, Meruem (wildcard), Knov (wildcard). Ging-Go, Ging-TS, Mafu-san, Pariston, Kite: no findings.

Round 2

Panel. CRF-1 through CRF-4, CRF-7 through CRF-9 verified fixed. CRF-5 closed by panel (2/3). CRF-6 closed by panel (3/3). 1 new P3 (CRF-10). 1 Nit dropped (follows file convention). Reviewed against a4c867f..3343fac. Panel: Bisky, Mafuuu, Gon, Razor (wildcard).

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped change that solves the right problem at the right layer. The error classification pipeline, the WithClassification/Classify round-trip, and the retry-loop exclusion all hold up under inspection. The sanitizer cannot create false positives (tool blocks require FinishReasonToolCalls, mutually exclusive with FinishReasonContentFilter), and the double classification through chatd.go is harmless because WithProvider short-circuits when the provider matches and Message is non-empty. Dependency management is explicit.

Pariston: "I tried to build a case against this change and could not find a material premise failure. The problem is correctly understood, the solution is proportional, and the fix is at the right causal level."

1 P2, 2 P3, 4 Nit. The P2 is an integration test gap: the only codepath that converts a silent empty turn into a visible error has no test through Run. The test infrastructure already supports this pattern (TestRun_* with chattest.FakeModel), and 50+ sibling tests exercise analogous paths. Five reviewers independently flagged it.

Notes (not posted inline): Refusal metadata extraction is Anthropic-only; Google and OpenAI content filters degrade to a generic message with no category or explanation. Partial content with a content-filter finish is silently persisted as normal, which the PR description acknowledges as intentional.


coderd/x/chatd/chaterror/message.go:116

Nit [CRF-6] retryMessage has no ChatErrorKindContentFilter case.

terminalMessage handles all eleven kinds including content_filter. retryMessage handles ten, with content_filter falling through to default ("returned an unexpected error"). Unreachable today because Retryable is false (zero value), but every sibling kind that has a terminalMessage case also has a retryMessage case (except usage_limit, which is a pre-existing gap). Adding a case keeps the two switches symmetric.

(Bisky, Mafuuu, Chopper, Leorio)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/x/chatd/chatloop/contentfilter_internal_test.go
Comment thread coderd/x/chatd/chatloop/contentfilter_internal_test.go Outdated
Comment thread coderd/x/chatd/chatloop/contentfilter_internal_test.go
Comment thread coderd/x/chatd/chatloop/chatloop.go Outdated
Comment thread coderd/x/chatd/chaterror/message.go
Comment thread coderd/x/chatd/chatloop/chatloop.go
Comment thread coderd/x/chatd/chatloop/contentfilter_internal_test.go Outdated
@linear-code

linear-code Bot commented Jun 11, 2026

Copy link
Copy Markdown

CODAGT-611

Adds a Run-level integration test for the empty-turn content-filter
branch, exact-string and Detail assertions in the unit tests, terminal
message table coverage, an explicit Retryable: false, and a Storybook
story for the Response blocked error state.

Copy link
Copy Markdown
Contributor Author

Addressed in 3343fac: CRF-4 (Run-level test TestRun_ContentFilterEmptyTurn, also asserts no step persists), CRF-1/CRF-2/CRF-9 (exact-string, Detail, and failure-output assertions), CRF-3 (terminal message table entries), CRF-7 (TerminalContentFilterError story, 16/16 passing), CRF-8 (explicit Retryable: false).

Intentionally skipped: CRF-5 (the comments explain why the behavior matters, per repo comment guidance) and CRF-6 (retryMessage case is unreachable; the error never enters the retry path, and usage_limit already breaks the switch symmetry).


Coder Agents generated (on behalf of @jscottmiller).

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

Copy link
Copy Markdown
Contributor Author

Correction to the note above: CRF-5 was subsequently addressed in 234d2ad (doc comment trims per the suggested wording). Only CRF-6 remains intentionally skipped.


Coder Agents generated (on behalf of @jscottmiller).

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seven R1 fixes verified. The integration test (TestRun_ContentFilterEmptyTurn) properly exercises the Run branch end-to-end, confirms the error sentinel and classification fields, and checks that no step is persisted. Test assertions are now exact-string. The Storybook story has 5 assertions covering heading, message, detail, and terminal behavior. Good turnaround.

Contested findings disposed:

CRF-5 (comment verbosity): Panel closed, 2/3 accept. Mafuuu and Razor independently verified the comments describe behavioral contracts per repo guidance. Gon narrowed the re-raise to contentFilterError (line 68) but was overridden by the panel majority.

CRF-6 (retryMessage missing case): Panel closed, 3/3 accept. All three reviewers independently traced the 4-step call chain confirming the path is unreachable, and confirmed usage_limit establishes the same precedent.

Mafuuu: "Contract fidelity checks performed: lifecycle, semantic honesty, adversarial path. All aligned."

1 new P3. The complement boundary (non-empty content with a ContentFilter finish reason should persist normally) is explicitly called out in the PR description as intentional behavior but has no test.

🤖 This review was automatically generated with Coder Agents.

// TestRun_ContentFilterEmptyTurn exercises the branch in Run that converts
// a content-filter finish with no content into a terminal classified error
// instead of a silent empty turn. Nothing is persisted for the blocked step.
func TestRun_ContentFilterEmptyTurn(t *testing.T) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-10] No test for the complement boundary: non-empty content with FinishReasonContentFilter should persist normally.

The fork at chatloop.go:601-607 is the core design decision: empty content + ContentFilter = error; non-empty content + ContentFilter = normal persist. Only the error half is tested. The complement is the behavior that protects prose refusals from being misclassified as blocks, explicitly called out in the PR description: "A normal text refusal is unaffected, since content is non-empty."

Sketch: copy TestRun_ContentFilterEmptyTurn, add a TextPart before the finish part, assert err == nil and the step is persisted.

(Bisky)

🤖

})
}

// TestRun_ContentFilterEmptyTurn exercises the branch in Run that converts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-11] Test doc comment restates what the function name and body show.

The name TestRun_ContentFilterEmptyTurn already says what's tested. The body's assertions carry the invariants. The convention in chatloop_run_internal_test.go is to omit doc comments on test functions unless they carry a non-obvious trap. Consider deleting the comment.

(Gon)

🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant