Skip to content

feat: update speculative decoding flags for llama.cpp v1.5+#956

Open
HelloItMeMort wants to merge 2 commits into
docker:mainfrom
HelloItMeMort:speculative-decoding-flags
Open

feat: update speculative decoding flags for llama.cpp v1.5+#956
HelloItMeMort wants to merge 2 commits into
docker:mainfrom
HelloItMeMort:speculative-decoding-flags

Conversation

@HelloItMeMort
Copy link
Copy Markdown

@HelloItMeMort HelloItMeMort commented Jun 5, 2026

Summary

Update the llama.cpp speculative decoding flag allowlist to match the latest llama.cpp server. The --draft, --draft-n, and --draft-n-min flags were deprecated in llama.cpp and replaced with the new --spec-draft-* flag naming convention.

Changes

Added (new spec-draft-* flags)

All new flags are safe: they accept only numeric values, booleans, or enum strings. No file paths are involved.

Category Flags
CPU control --spec-draft-cpu-mask, -Cd, --cpu-mask-draft, --spec-draft-cpu-range, -Crd, --cpu-range-draft, --cpu-strict-draft, --prio-draft, --poll-draft
Batch CPU --cpu-mask-batch-draft, --cpu-strict-batch-draft, --prio-batch-draft, --poll-batch-draft
Threading --spec-draft-threads, -td, --threads-draft, --spec-draft-threads-batch, -tbd, --threads-batch-draft
GPU/device --spec-draft-device, -devd, --device-draft, --spec-draft-ngl, -ngld, --gpu-layers-draft, --n-gpu-layers-draft
Speculation params --spec-draft-n-max, --spec-draft-n-min, --spec-draft-p-split, --spec-draft-p-min, --spec-draft-backend-sampling, --spec-type
Cache types --spec-draft-type-k, -ctkd, --cache-type-k-draft, --spec-draft-type-v, -ctvd, --cache-type-v-draft
MoE --spec-draft-cpu-moe, -cmoed, --cpu-moe-draft, --spec-draft-n-cpu-moe, --spec-draft-ncmoe, -ncmoed, --n-cpu-moe-draft
Override tensor --spec-draft-override-tensor, -otd, --override-tensor-draft
Ngram mod --spec-ngram-mod-n-min, --spec-ngram-mod-n-max, --spec-ngram-mod-n-match
Ngram simple --spec-ngram-simple-size-n, --spec-ngram-simple-size-m, --spec-ngram-simple-min-hits
Ngram map-k --spec-ngram-map-k-size-n, --spec-ngram-map-k-size-m, --spec-ngram-map-k-min-hits
Ngram map-k4v --spec-ngram-map-k4v-size-n, --spec-ngram-map-k4v-size-m, --spec-ngram-map-k4v-min-hits

Test updates

Updated speculative category in runtime_flags_allowlist_test.go to cover the representative flags.

Reference

llama.cpp server flags documentation:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

See the Speculative decoding section in the server README for the full list of --spec-draft-* and --spec-type flags.

….cpp

Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min
flags and replace with the new --spec-draft-* and --spec-ngram-* flags
introduced in recent llama.cpp versions.

Changes:
- Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min
- Add --spec-draft-* flags for draft model CPU control (threads, affinity, priority, polling)
- Add --spec-draft-* flags for draft model GPU/device control (override-tensor, cpu-moe)
- Add --spec-draft-n-max, --spec-draft-n-min for draft token counts
- Add --spec-draft-p-split, --spec-draft-p-min for draft probability thresholds
- Add --spec-draft-device/-devd, --spec-draft-ngl/-ngld for draft GPU offloading
- Add --spec-draft-type-k/-ctkd, --spec-draft-type-v/-ctvd for draft KV cache types
- Add --spec-draft-model/-md for specifying a separate draft model
- Add --spec-type for selecting speculative decoding method
- Add --spec-ngram-* flags for all ngram-based speculative decoding variants

Based on llama.cpp server docs:
https://github.com/ggml-org/llama.cpp/tree/master/tools/server
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Review: Speculative Decoding Flags Update

Summary

This pull request expands the llama.cpp runtime flags allowlist to support a comprehensive set of new speculative decoding and ngram-related flags. While the expansion is necessary, there are critical backward-compatibility issues and potential runtime failures with the backend implementation.

Critical

  • pkg/inference/runtime_flags_allowlist.go:131-134: The legacy flag --draft-max was incorrectly replaced with --draft-n-max, and --draft-min was omitted, breaking backward compatibility. Additionally, the backend in llamacpp.go still hardcodes --draft-max and --draft-p-min, which will fail on newer llama.cpp versions. Update the allowlist and dynamically select flags in the backend.
  • pkg/inference/runtime_flags_allowlist_test.go:165-167: Update the test cases to use the corrected backward-compatible flags (--draft-max and --draft-min).

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/inference/runtime_flags_allowlist.go Outdated
Comment thread pkg/inference/runtime_flags_allowlist_test.go Outdated
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • There are now a lot of legacy/new alias pairs (e.g. --spec-draft-n-max and --draft-n-max); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases.
  • The speculative flag names are duplicated between runtime_flags_allowlist.go and runtime_flags_allowlist_test.go; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- There are now a lot of legacy/new alias pairs (e.g. `--spec-draft-n-max` and `--draft-n-max`); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases.
- The speculative flag names are duplicated between `runtime_flags_allowlist.go` and `runtime_flags_allowlist_test.go`; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Restore --draft-max and --draft-min as
backward-compatible aliases alongside the new --spec-draft-n-max and
--spec-draft-p-min names.
@HelloItMeMort
Copy link
Copy Markdown
Author

My motivation for this merge is so I can use MTP models and pass the necessary --spec-type draft-mtp flag to docker model config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant