feat: update speculative decoding flags for llama.cpp v1.5+ by HelloItMeMort · Pull Request #956 · docker/model-runner

HelloItMeMort · 2026-06-05T05:04:24Z

Summary

Update the llama.cpp speculative decoding flag allowlist to match the latest llama.cpp server. The --draft, --draft-n, and --draft-n-min flags were deprecated in llama.cpp and replaced with the new --spec-draft-* flag naming convention.

Changes

Added (new spec-draft-* flags)

All new flags are safe: they accept only numeric values, booleans, or enum strings. No file paths are involved.

Category	Flags
CPU control	`--spec-draft-cpu-mask`, `-Cd`, `--cpu-mask-draft`, `--spec-draft-cpu-range`, `-Crd`, `--cpu-range-draft`, `--cpu-strict-draft`, `--prio-draft`, `--poll-draft`
Batch CPU	`--cpu-mask-batch-draft`, `--cpu-strict-batch-draft`, `--prio-batch-draft`, `--poll-batch-draft`
Threading	`--spec-draft-threads`, `-td`, `--threads-draft`, `--spec-draft-threads-batch`, `-tbd`, `--threads-batch-draft`
GPU/device	`--spec-draft-device`, `-devd`, `--device-draft`, `--spec-draft-ngl`, `-ngld`, `--gpu-layers-draft`, `--n-gpu-layers-draft`
Speculation params	`--spec-draft-n-max`, `--spec-draft-n-min`, `--spec-draft-p-split`, `--spec-draft-p-min`, `--spec-draft-backend-sampling`, `--spec-type`
Cache types	`--spec-draft-type-k`, `-ctkd`, `--cache-type-k-draft`, `--spec-draft-type-v`, `-ctvd`, `--cache-type-v-draft`
MoE	`--spec-draft-cpu-moe`, `-cmoed`, `--cpu-moe-draft`, `--spec-draft-n-cpu-moe`, `--spec-draft-ncmoe`, `-ncmoed`, `--n-cpu-moe-draft`
Override tensor	`--spec-draft-override-tensor`, `-otd`, `--override-tensor-draft`
Ngram mod	`--spec-ngram-mod-n-min`, `--spec-ngram-mod-n-max`, `--spec-ngram-mod-n-match`
Ngram simple	`--spec-ngram-simple-size-n`, `--spec-ngram-simple-size-m`, `--spec-ngram-simple-min-hits`
Ngram map-k	`--spec-ngram-map-k-size-n`, `--spec-ngram-map-k-size-m`, `--spec-ngram-map-k-min-hits`
Ngram map-k4v	`--spec-ngram-map-k4v-size-n`, `--spec-ngram-map-k4v-size-m`, `--spec-ngram-map-k4v-min-hits`

Test updates

Updated speculative category in runtime_flags_allowlist_test.go to cover the representative flags.

Reference

llama.cpp server flags documentation:
https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md

See the Speculative decoding section in the server README for the full list of --spec-draft-* and --spec-type flags.

….cpp Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min flags and replace with the new --spec-draft-* and --spec-ngram-* flags introduced in recent llama.cpp versions. Changes: - Remove deprecated --draft, --draft-n, --draft-max, --draft-min, --draft-n-min - Add --spec-draft-* flags for draft model CPU control (threads, affinity, priority, polling) - Add --spec-draft-* flags for draft model GPU/device control (override-tensor, cpu-moe) - Add --spec-draft-n-max, --spec-draft-n-min for draft token counts - Add --spec-draft-p-split, --spec-draft-p-min for draft probability thresholds - Add --spec-draft-device/-devd, --spec-draft-ngl/-ngld for draft GPU offloading - Add --spec-draft-type-k/-ctkd, --spec-draft-type-v/-ctvd for draft KV cache types - Add --spec-draft-model/-md for specifying a separate draft model - Add --spec-type for selecting speculative decoding method - Add --spec-ngram-* flags for all ngram-based speculative decoding variants Based on llama.cpp server docs: https://github.com/ggml-org/llama.cpp/tree/master/tools/server

gemini-code-assist

Code Review

Review: Speculative Decoding Flags Update

Summary

This pull request expands the llama.cpp runtime flags allowlist to support a comprehensive set of new speculative decoding and ngram-related flags. While the expansion is necessary, there are critical backward-compatibility issues and potential runtime failures with the backend implementation.

Critical

pkg/inference/runtime_flags_allowlist.go:131-134: The legacy flag --draft-max was incorrectly replaced with --draft-n-max, and --draft-min was omitted, breaking backward compatibility. Additionally, the backend in llamacpp.go still hardcodes --draft-max and --draft-p-min, which will fail on newer llama.cpp versions. Update the allowlist and dynamically select flags in the backend.
pkg/inference/runtime_flags_allowlist_test.go:165-167: Update the test cases to use the corrected backward-compatible flags (--draft-max and --draft-min).

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

sourcery-ai

Hey - I've left some high level feedback:

There are now a lot of legacy/new alias pairs (e.g. --spec-draft-n-max and --draft-n-max); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases.
The speculative flag names are duplicated between runtime_flags_allowlist.go and runtime_flags_allowlist_test.go; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- There are now a lot of legacy/new alias pairs (e.g. `--spec-draft-n-max` and `--draft-n-max`); consider adding a brief comment above this block clarifying which entries are kept only for backward compatibility so future cleanups don’t accidentally remove still-needed aliases.
- The speculative flag names are duplicated between `runtime_flags_allowlist.go` and `runtime_flags_allowlist_test.go`; factoring them into a shared slice or helper used by both would reduce the chance of these lists drifting out of sync when llama.cpp flags change again.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Restore --draft-max and --draft-min as backward-compatible aliases alongside the new --spec-draft-n-max and --spec-draft-p-min names.

HelloItMeMort · 2026-06-05T17:05:42Z

My motivation for this merge is so I can use MTP models and pass the necessary --spec-type draft-mtp flag to docker model config

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread pkg/inference/runtime_flags_allowlist.go Outdated

Comment thread pkg/inference/runtime_flags_allowlist_test.go Outdated

sourcery-ai Bot reviewed Jun 5, 2026

View reviewed changes

feat: restore legacy backward-compatible speculative decoding flags

60e199a

Restore --draft-max and --draft-min as backward-compatible aliases alongside the new --spec-draft-n-max and --spec-draft-p-min names.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update speculative decoding flags for llama.cpp v1.5+#956

feat: update speculative decoding flags for llama.cpp v1.5+#956
HelloItMeMort wants to merge 2 commits into
docker:mainfrom
HelloItMeMort:speculative-decoding-flags

HelloItMeMort commented Jun 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

HelloItMeMort commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HelloItMeMort commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Added (new spec-draft-* flags)

Test updates

Reference

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Review: Speculative Decoding Flags Update

Summary

Critical

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

HelloItMeMort commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

HelloItMeMort commented Jun 5, 2026 •

edited

Loading