Skip to content

Add GGUF fit-target control and wire to llama-server --fit-target#4882

Open
aiSynergy37 wants to merge 5 commits intounslothai:mainfrom
aiSynergy37:feat/gguf-fit-target-ui-4857
Open

Add GGUF fit-target control and wire to llama-server --fit-target#4882
aiSynergy37 wants to merge 5 commits intounslothai:mainfrom
aiSynergy37:feat/gguf-fit-target-ui-4857

Conversation

@aiSynergy37
Copy link
Copy Markdown

Summary

  • add optional fit_target to inference load API request/response and status response for GGUF
  • wire fit_target through backend GGUF load path into LlamaCppBackend.load_model(...)
  • when --fit on is used, pass --fit-target <value> to llama-server
  • expose Fit Target in Chat Settings (GGUF) with presets: Auto / 64 / 128 / 256 / 512
  • persist loaded fit-target in runtime store and restore it from /api/inference/status

Why

Issue #4857 asks for a Studio UI control to tune llama-server --fit-target for tight VRAM cases where default fit behavior leaves too much GPU memory unused.

Validation

  • pytest studio/backend/tests/test_native_context_length.py -v (38 passed)
  • python -m compileall studio/backend/core/inference/llama_cpp.py studio/backend/models/inference.py studio/backend/routes/inference.py
  • Frontend typecheck could not run in this environment because tsc is unavailable (frontend deps/tools not installed).

Fixes #4857

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@xyehya
Copy link
Copy Markdown

xyehya commented Apr 7, 2026

Hey team,
Was testing a fresh installation for the latest release to check if the context size UI control has been fixed or not but the behavior is not consistent.

Tried chat loading a local GGUF file for unsloth gemma-4 which it found automatically under finetunes list in models.

First load --> full gemma-3-31b context length by default + bf16 kv ---> cpu offload --> chat works fine

Tweaked kv to q8_0 and reduced context size to 100k --> the reload command in terminal automatically falls back to 8096 context and q8_0 --> its not respecting the context set from UI but rather falling back to safe kv to fit in VRAM. (When cranking the context above that it displays warning about spilling to RAM nevertheless even clicking apply doesn't force the correct UI set setting)

I think there will be dependencies between applying this current PR and fixing that issue

Expose fit_target from Chat settings through API and GGUF load/status responses, and pass --fit-target when --fit is active. Includes backend regression coverage.\n\nFixes unslothai#4857
@aiSynergy37 aiSynergy37 force-pushed the feat/gguf-fit-target-ui-4857 branch from e11db97 to 18ad8a7 Compare April 9, 2026 22:09
@aiSynergy37
Copy link
Copy Markdown
Author

Follow-up fix pushed in 18ad8a7 to address the context fallback concern:\n\n- When users request an explicit context and also set it_target, we now keep the requested context and use --fit instead of silently capping context down to a GPU-only safe value.\n- Added runtime regression test est_explicit_fit_target_keeps_requested_context in studio/backend/tests/test_native_context_length.py to lock this behavior.\n\nThis makes the new Fit Target control effective in tight VRAM scenarios where explicit context previously got reduced before launch.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fded3c9b42

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chat_template_override: chatTemplateOverride,
cache_type_kv: kvCacheDtype,
speculative_type: speculativeType,
fit_target: fitTarget,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve fit_target across failed model-switch rollback

Adding fit_target to the primary load request here introduces a new rollback gap: if loading the new model fails after unloading the previous GGUF model, the catch-path reloads the previous model without fit_target, so the restored model silently comes back with different VRAM/offload behavior. This only shows up when a non-default fit target was active and a subsequent load fails, but in that case users lose their runtime tuning unexpectedly.

Useful? React with 👍 / 👎.

@aiSynergy37
Copy link
Copy Markdown
Author

Addressed review note: preserve it_target across failed model-switch rollback.\n\nUpdate in commit �38ff2ca captures previousLoadedFitTarget before unload and passes it during rollback loadModel(...), so a failed switch restores the previous GGUF model with the same fit-target tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Add Unsloth Studio UI value to tune the llama-server --fit-target flag for squeezed extra performance.

3 participants