Add GGUF fit-target control and wire to llama-server --fit-target#4882
Add GGUF fit-target control and wire to llama-server --fit-target#4882aiSynergy37 wants to merge 5 commits intounslothai:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Hey team, Tried chat loading a local GGUF file for unsloth gemma-4 which it found automatically under finetunes list in models. First load --> full gemma-3-31b context length by default + bf16 kv ---> cpu offload --> chat works fine Tweaked kv to q8_0 and reduced context size to 100k --> the reload command in terminal automatically falls back to 8096 context and q8_0 --> its not respecting the context set from UI but rather falling back to safe kv to fit in VRAM. (When cranking the context above that it displays warning about spilling to RAM nevertheless even clicking apply doesn't force the correct UI set setting) I think there will be dependencies between applying this current PR and fixing that issue |
Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857
e11db97 to
18ad8a7
Compare
for more information, see https://pre-commit.ci
|
Follow-up fix pushed in 18ad8a7 to address the context fallback concern:\n\n- When users request an explicit context and also set it_target, we now keep the requested context and use --fit instead of silently capping context down to a GPU-only safe value.\n- Added runtime regression test est_explicit_fit_target_keeps_requested_context in studio/backend/tests/test_native_context_length.py to lock this behavior.\n\nThis makes the new Fit Target control effective in tight VRAM scenarios where explicit context previously got reduced before launch. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fded3c9b42
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| chat_template_override: chatTemplateOverride, | ||
| cache_type_kv: kvCacheDtype, | ||
| speculative_type: speculativeType, | ||
| fit_target: fitTarget, |
There was a problem hiding this comment.
Preserve fit_target across failed model-switch rollback
Adding fit_target to the primary load request here introduces a new rollback gap: if loading the new model fails after unloading the previous GGUF model, the catch-path reloads the previous model without fit_target, so the restored model silently comes back with different VRAM/offload behavior. This only shows up when a non-default fit target was active and a subsequent load fails, but in that case users lose their runtime tuning unexpectedly.
Useful? React with 👍 / 👎.
|
Addressed review note: preserve it_target across failed model-switch rollback.\n\nUpdate in commit �38ff2ca captures previousLoadedFitTarget before unload and passes it during rollback loadModel(...), so a failed switch restores the previous GGUF model with the same fit-target tuning. |
Summary
fit_targetto inference load API request/response and status response for GGUFfit_targetthrough backend GGUF load path intoLlamaCppBackend.load_model(...)--fit onis used, pass--fit-target <value>to llama-serverFit Targetin Chat Settings (GGUF) with presets: Auto / 64 / 128 / 256 / 512/api/inference/statusWhy
Issue #4857 asks for a Studio UI control to tune llama-server
--fit-targetfor tight VRAM cases where default fit behavior leaves too much GPU memory unused.Validation
pytest studio/backend/tests/test_native_context_length.py -v(38 passed)python -m compileall studio/backend/core/inference/llama_cpp.py studio/backend/models/inference.py studio/backend/routes/inference.pytscis unavailable (frontend deps/tools not installed).Fixes #4857