feat: update speculative decoding flags for llama.cpp v1.5+#956
feat: update speculative decoding flags for llama.cpp v1.5+#956HelloItMeMort wants to merge 2 commits into
Conversation
….cpp Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min flags and replace with the new --spec-draft-* and --spec-ngram-* flags introduced in recent llama.cpp versions. Changes: - Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min - Add --spec-draft-* flags for draft model CPU control (threads, affinity, priority, polling) - Add --spec-draft-* flags for draft model GPU/device control (override-tensor, cpu-moe) - Add --spec-draft-n-max, --spec-draft-n-min for draft token counts - Add --spec-draft-p-split, --spec-draft-p-min for draft probability thresholds - Add --spec-draft-device/-devd, --spec-draft-ngl/-ngld for draft GPU offloading - Add --spec-draft-type-k/-ctkd, --spec-draft-type-v/-ctvd for draft KV cache types - Add --spec-draft-model/-md for specifying a separate draft model - Add --spec-type for selecting speculative decoding method - Add --spec-ngram-* flags for all ngram-based speculative decoding variants Based on llama.cpp server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server
There was a problem hiding this comment.
Code Review
Review: Speculative Decoding Flags Update
Summary
This pull request expands the llama.cpp runtime flags allowlist to support a comprehensive set of new speculative decoding and ngram-related flags. While the expansion is necessary, there are critical backward-compatibility issues and potential runtime failures with the backend implementation.
Critical
pkg/inference/runtime_flags_allowlist.go:131-134: The legacy flag--draft-maxwas incorrectly replaced with--draft-n-max, and--draft-minwas omitted, breaking backward compatibility. Additionally, the backend inllamacpp.gostill hardcodes--draft-maxand--draft-p-min, which will fail on newer llama.cpp versions. Update the allowlist and dynamically select flags in the backend.pkg/inference/runtime_flags_allowlist_test.go:165-167: Update the test cases to use the corrected backward-compatible flags (--draft-maxand--draft-min).
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- There are now a lot of legacy/new alias pairs (e.g.
--spec-draft-n-maxand--draft-n-max); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases. - The speculative flag names are duplicated between
runtime_flags_allowlist.goandruntime_flags_allowlist_test.go; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- There are now a lot of legacy/new alias pairs (e.g. `--spec-draft-n-max` and `--draft-n-max`); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases.
- The speculative flag names are duplicated between `runtime_flags_allowlist.go` and `runtime_flags_allowlist_test.go`; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Restore --draft-max and --draft-min as backward-compatible aliases alongside the new --spec-draft-n-max and --spec-draft-p-min names.
|
My motivation for this merge is so I can use MTP models and pass the necessary --spec-type draft-mtp flag to |
Summary
Update the llama.cpp speculative decoding flag allowlist to match the latest llama.cpp server. The
--draft,--draft-n, and--draft-n-minflags were deprecated in llama.cpp and replaced with the new--spec-draft-*flag naming convention.Changes
Added (new spec-draft-* flags)
All new flags are safe: they accept only numeric values, booleans, or enum strings. No file paths are involved.
--spec-draft-cpu-mask,-Cd,--cpu-mask-draft,--spec-draft-cpu-range,-Crd,--cpu-range-draft,--cpu-strict-draft,--prio-draft,--poll-draft--cpu-mask-batch-draft,--cpu-strict-batch-draft,--prio-batch-draft,--poll-batch-draft--spec-draft-threads,-td,--threads-draft,--spec-draft-threads-batch,-tbd,--threads-batch-draft--spec-draft-device,-devd,--device-draft,--spec-draft-ngl,-ngld,--gpu-layers-draft,--n-gpu-layers-draft--spec-draft-n-max,--spec-draft-n-min,--spec-draft-p-split,--spec-draft-p-min,--spec-draft-backend-sampling,--spec-type--spec-draft-type-k,-ctkd,--cache-type-k-draft,--spec-draft-type-v,-ctvd,--cache-type-v-draft--spec-draft-cpu-moe,-cmoed,--cpu-moe-draft,--spec-draft-n-cpu-moe,--spec-draft-ncmoe,-ncmoed,--n-cpu-moe-draft--spec-draft-override-tensor,-otd,--override-tensor-draft--spec-ngram-mod-n-min,--spec-ngram-mod-n-max,--spec-ngram-mod-n-match--spec-ngram-simple-size-n,--spec-ngram-simple-size-m,--spec-ngram-simple-min-hits--spec-ngram-map-k-size-n,--spec-ngram-map-k-size-m,--spec-ngram-map-k-min-hits--spec-ngram-map-k4v-size-n,--spec-ngram-map-k4v-size-m,--spec-ngram-map-k4v-min-hitsTest updates
Updated
speculativecategory inruntime_flags_allowlist_test.goto cover the representative flags.Reference
llama.cpp server flags documentation:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md
See the Speculative decoding section in the server README for the full list of
--spec-draft-*and--spec-typeflags.