fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916
fix(opencode): consume Model.prefill + runtime-probe llama.cpp templates#27916feanor5555 wants to merge 2 commits into
Conversation
Anthropic-style providers accept (and rely on) an assistant message as
the last turn in a conversation ("response continuation" / "prefill"
for tool-use continuation). Most other thinking-on-by-default templates
reject it outright — llama.cpp returns HTTP 400 "Assistant response
prefill is incompatible with enable_thinking" on Qwen3-family templates,
and vLLM/TGI have equivalent behaviour for DeepSeek-R1, GLM-4.6 thinking,
Kimi-K2-Thinking, etc.
A first-class `prefill: boolean` on Model lets every host (opencode,
mastra, others) consult one canonical source of truth instead of
guessing from npm package + reasoning flag.
- packages/core/src/models.ts: add optional prefill field on Model
with a per-family list of templates known to reject prefill
(Qwen3 hybrid/3.5/3.6/Thinking-2507/VL, QwQ, DeepSeek-R1/R1-0528/V4,
GLM-4.6/4.7-thinking, Kimi-K2-Thinking, MiniMax-M2).
- packages/opencode/src/config/provider.ts: mirror the field on the
user-facing config schema with an annotation describing when to set
it (and what the auto-default is for openai-compatible+reasoning).
Default (undefined) is treated as `true` to keep all existing models
unaffected. Consumer-side logic lives in a follow-up PR.
Sister-PR to a sst/models.dev data PR that will populate prefill: false
on the affected per-model entries.
Closes the remaining ~25% of trailing-assistant 400s on llama.cpp /
vLLM / TGI that an empty-content filter alone cannot reach. The
MAX_STEPS prefill in session/prompt.ts is non-empty by design (it
delivers a user-visible "wrap up" instruction), so it survives the
empty filter and trips the same template-incompat 400.
Three coordinated changes:
1. ProviderTransform.canAcceptTrailingAssistant(model) — new helper.
Three-layer precedence:
(a) explicit model.capabilities.prefill wins (from models.dev or
user config),
(b) auto-inference: @ai-sdk/openai-compatible + reasoning:true
→ false (covers every known 2025-2026 thinking family even
before models.dev ships explicit values),
(c) default true (backwards compatible — Anthropic, Bedrock,
OpenAI, Google etc. unchanged).
2. session/prompt.ts MAX_STEPS routing now consults the helper:
role:"assistant" for prefill-capable providers, role:"user" for the
rest. Thinking stays enabled in the request body — only the role of
the synthetic wrap-up message changes from `assistant` to `user`,
so the model still thinks and writes its summary normally.
3. CapabilityProbe — runtime detection for self-hosted openai-compatible
servers. llama.cpp's `<root>/props` endpoint exposes the active
chat template; templates that branch on `enable_thinking` are exactly
the ones that reject prefill. The probe runs once per base URL
(cached), fail-silent (vLLM/TGI/mistral.rs have no /props and fall
through to the auto-inference path), short-timeout (1.5s).
User can always override per-model via opencode.json:
{
"provider": {
"my-llamacpp": {
"models": {
"qwen3.5-coder": { "reasoning": true, "prefill": false }
}
}
}
}
Affected behaviour:
- Anthropic, Bedrock, OpenAI, Google — unchanged (prefill stays
available).
- Thinking-on local models (Qwen3, DeepSeek-R1, GLM-thinking,
Kimi-K2-Thinking, MiniMax-M2): MAX_STEPS arrives as a user message.
Same instruction, same wrap-up behaviour, no template rejection.
Tests:
- transform.test.ts: 8-case canAcceptTrailingAssistant matrix
(explicit-overrides-everything, auto-inference for openai-compatible
+ reasoning class, unchanged defaults for Anthropic/OpenAI/Google/
Bedrock representatives).
- capability-probe.test.ts: 11 cases for the runtime probe
(enable_thinking detection, /v1-suffix normalisation, 404 fallback,
network-error fallback, empty baseURL, per-URL cache).
Real-world benchmark against an echomodus-sized Spring Boot project
on llama.cpp + Qwen3.5-9B with --reasoning on:
- Without this PR: 2.0 prefill-400s per run (3/3 runs).
- With this PR + reasoning:true in user config: 0 errors (3/3).
- With this PR + auto-probe (no user config): 0 errors (3/3).
Common misunderstanding: prefill:false does NOT disable thinking.
Thinking stays on for the whole request — only the role of the synthetic
MAX_STEPS message changes from `assistant` to `user`. The model then
thinks (with thinking enabled) and writes its wrap-up normally.
Builds on the Model.prefill capability introduced in the previous
commit. Sister-PR-1 (filter empty assistant content for
@ai-sdk/openai-compatible) handles the orthogonal empty-trailing case;
this PR handles the non-empty trailing case.
|
Hey! Your PR title Please update it to start with one of:
Where See CONTRIBUTING.md for details. |
|
The following comment was made by an LLM, it may be inaccurate: Based on the search results, I found related PRs that are part of the same feature work: Related PRs (Not Duplicates)
Note: PR #27916 (the current PR) is explicitly stacked on PR #27915 and represents the next logical piece in the same feature chain. These are coordinated changes, not duplicates. |
|
Thanks for updating your PR! It now meets our contributing guidelines. 👍 |
|
Thanks for your contribution! This PR doesn't have a linked issue. All PRs must reference an existing issue. Please:
See CONTRIBUTING.md for details. |
Issue for this PR
Closes #27920
Stacked on #27915 for the
Model.prefillcapability. Sister-PR #27914 handles the orthogonal empty-trailing case via the empty-content filter.Type of change
What does this PR do?
Closes the remaining ~25% of trailing-assistant 400s on llama.cpp / vLLM / TGI that #27914 cannot reach. The
MAX_STEPSprefill insession/prompt.tsis non-empty by design (it delivers a user-visible "wrap up" instruction), so it survives the empty-content filter and trips the same template-incompat 400.Three coordinated pieces:
1.
ProviderTransform.canAcceptTrailingAssistant(model)— new helper, three-layer precedence:model.capabilities.prefill(from models.dev or user config) wins.@ai-sdk/openai-compatible+reasoning:true→false. Covers every known 2025-2026 thinking family even before models.dev ships explicit values.true(backwards compatible).2. MAX_STEPS routing in
session/prompt.ts—role:"assistant"for prefill-capable providers,role:"user"for the rest. Thinking stays enabled in the request body — only the role of the synthetic wrap-up message changes, so the model still thinks and writes its summary normally.3.
CapabilityProbe— runtime detection for self-hosted openai-compatible servers. llama.cpp's<root>/propsendpoint exposes the active chat template; templates that branch onenable_thinkingare exactly the ones that reject prefill at runtime. The probe runs once per base URL (cached), fail-silent (vLLM/TGI/mistral.rs have no/propsand fall through to the auto-inference path), short-timeout (1.5s).Affected behaviour:
Common misunderstanding:
prefill: falsedoes not disable thinking — only the role of the synthetic MAX_STEPS message changes fromassistanttouser. The model thinks and writes its wrap-up normally.User can override per-model via
opencode.json:{ "provider": { "my-llamacpp": { "models": { "qwen3.5-coder": { "reasoning": true, "prefill": false } } } } }Related upstream: ggml-org/llama.cpp#20861, ggml-org/llama.cpp#21889, mastra-ai/mastra#15234.
How did you verify your code works?
bun test test/provider/transform.test.ts test/provider/capability-probe.test.ts: 243 pass, 0 fail.bun run typecheckclean.Real-world benchmark against a Spring Boot project on llama.cpp + Qwen3.5-9B with
--reasoning on, agent forced into MAX_STEPS viasteps: 3, 3 runs per variant:reasoning: truein user configTests:
transform.test.ts: 8-casecanAcceptTrailingAssistantmatrix (explicit-overrides-everything, auto-inference for openai-compatible + reasoning class, unchanged defaults for Anthropic/OpenAI/Google/Bedrock representatives).capability-probe.test.ts: 11 cases (enable_thinkingdetection,/v1-suffix normalisation, 404 fallback, network-error fallback, empty baseURL, per-URL cache,supports_preserve_reasoningsecondary signal).Screenshots / recordings
N/A — backend change.
Checklist