GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity by ArthurZucker · Pull Request #45977 · huggingface/transformers

ArthurZucker · 2026-05-14T10:29:56Z

What does this PR do?

Opt-in inference path for GGUF models that keeps weights at their native quantization (Q4_0 / Q8_0 / Q4_K / Q5_K / Q6_K / IQ4_NL / IQ4_XS) and runs the forward matmul through the same Metal kernels llama.cpp ships (lifted from candle, packaged as ArthurZ/gguf-kernels via kernel-builder).

Two new pieces:

integrations/gguf_linear.GgufLinear — drop-in nn.Linear replacement. Forward picks mul_mat_vec_<fmt>_f32 for batch-1 (decode) and mul_mat_<fmt>_f32 for batch>1 (prefill). CPU/CUDA fall back to dequant + nn.functional.linear.
GGUFQuantizer.linear_mode — when on, the post-load hook walks the GGUF tensor map, re-applies the same renames the loader used, and swaps each matching nn.Linear for a GgufLinear with raw block bytes.

Enabled via from_pretrained(..., gguf_file=..., gguf_linear=True) or TRANSFORMERS_GGUF_LINEAR=1. Default off, no change to existing behaviour.

Builds on top of #44794 (this PR targets update-gguf). #44794 makes load fast; this PR makes inference fit in memory — weights stay at 4.5 bpw (Q4_K) instead of 16 bpw bf16 = 3.5× less RAM at inference time, the qualitative win on Apple Silicon.

Performance vs llama.cpp

Measured on M3 Max (96 GB), Qwen2.5-0.5B-Instruct Q4_0:

llama-bench tg64                                  →  261 tok/s
our matvec kernels, one batched MTLCommandBuffer  →  266 tok/s   (1.02× parity)

The kernels are byte-identical to llama.cpp's (via candle). When dispatch is batched into a single command buffer per token (which PyTorch's MPS stream does naturally for ops on the same forward pass), throughput matches.

Per-kernel ceilings vs PyTorch's MPS bf16 path (same M3 Max, 8M-element shapes, input already on GPU):

Op	Q4_0	Q8_0	Q4_K	Q5_K	Q6_K	IQ4_NL	IQ4_XS
dequant	2.3×	3.8×	3.1×	5.1×	5.5×	7.5×	5.3×
matmul	-	-	0.89× (prefill; bf16 GEMM wins ~10%)	-	-	-	-
matvec Q4_K	-	-	1.27× faster than bf16 matvec at decode

Compute-bound prefill is a small throughput loss vs bf16 GEMM; bandwidth-bound batch-1 decode is a clear win. Memory savings are 3.5× regardless of workload.

Verification

Bit-exact: each dequant kernel matches gguf.dequantize to 0 ULP (Q4_0 / Q8_0 / IQ4_NL) or fp32 reduction noise (~1e-4 for K-quants).
Forward correctness: tests/quantization/ggml/test_gguf_linear.py checks GgufLinear forward against dequant_gguf_tensor + nn.functional.linear for Q4_0 and Q4_K, batch sizes 0/1/8.
Real model end-to-end: standalone smoke test in ArthurZ/gguf-kernels (tests/test_real_model_e2e.py) loads real Qwen2.5-0.5B Q4_0 with all 169 linears swapped to GgufLinear; top-5 logits match the dequant baseline to fp32 ULP.

Kernels package

ArthurZ/gguf-kernels ships pre-compiled metallibs for torch 2.8 / 2.9 / 2.10 / 2.11 on Apple Silicon. Built via kernel-builder (Nix + Xcode 26 + Metal Toolchain). Source: same MSL kernels as llama.cpp / candle.

Override the repo with TRANSFORMERS_GGUF_METAL_KERNELS_REPO=... (so the personal-namespace location works ahead of any kernels-community transfer).

What's out of scope

CUDA matmul kernels (candle ships them; same structure as the MPS port — follow-up).
K-quant Python quantizer (gguf-py only has Q4_0 / Q8_0; K-quant Linears stay as nn.Linear during the swap. Fix is upstream in gguf-py, not transformers.)
Direct byte routing in the loader (today we round-trip through dequant + re-quant on Linear weights; the load-time memory win is left for a follow-up).

Test plan

python -m pytest tests/quantization/ggml/test_gguf_linear.py — 4 passed
from_pretrained(..., gguf_linear=True) end-to-end (blocked on a separate shape bug in update-gguf's dequant for some configs — surfaces as replace_with_gguf_linear warnings and the layer stays as nn.Linear, so behaviour is safe but the GgufLinear path isn't exercised end-to-end on that config yet)

HuggingFaceDocBuilderDev · 2026-05-14T10:42:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Opt-in path that keeps GGUF weights at their native quantization after load and runs the forward pass through the kernels-community Metal kernels (ArthurZ/gguf-kernels). Same MSL kernels as llama.cpp — at decode (batch-1, memory-bound) we hit llama.cpp parity on M3 Max (266 tok/s vs 261, Qwen2.5-0.5B Q4_0). What lands: - `integrations/gguf_linear.GgufLinear` — drop-in nn.Linear replacement storing raw GGUF block bytes in `qweight`. Forward picks `mul_mat_vec_<fmt>_f32` for batch-1 (memory-bound) and `mul_mat_<fmt>_f32` for batch>1 (compute-bound). Supported quant types: Q4_0, Q8_0, Q4_K, Q5_K, Q6_K, IQ4_NL, IQ4_XS. CPU/CUDA fall back to dequant + torch.nn.functional.linear. - `integrations/gguf_linear.replace_with_gguf_linear(model, qmap)` — walks `model.named_modules()`, swaps each `nn.Linear` whose param name is in `qmap` with a `GgufLinear`. Re-quantizes the loaded fp32 weight via gguf-py so the swap is self-contained. - `quantizers.quantizer_gguf.GGUFQuantizer` — new `linear_mode`, `gguf_tensors` kwargs. When `linear_mode=True`, the post-load hook walks the GGUF tensor dict, applies the same rename rules the loader used, and calls `replace_with_gguf_linear` with the resulting `hf_name → quant_type` map. - `modeling_utils.from_pretrained` — picks up the new `gguf_linear` kwarg (or `TRANSFORMERS_GGUF_LINEAR=1` env var) and threads it into the quantizer. Default off so existing behaviour is unchanged. - `tests/quantization/ggml/test_gguf_linear.py` — unit tests for the module (forward matches dequant + nn.linear) and the swap helper. Memory positioning: weights stay at 4.5 bpw (Q4_K) instead of 16 bpw bf16 — 3.5× less RAM at inference time. This is the load-side dual to PR #44794's loader refactor: that PR makes load fast, this PR makes inference fit in memory.

When swapping nn.Linear → GgufLinear after load, the old path called gguf.quantize(fp32_weight) which (a) is unimplemented for K-quants and IQ4 in gguf-py and (b) gives non-bit-exact results vs llama.cpp. New path: pull the original GGUFQuantizedTensor's raw block bytes off the gguf_tensors map the quantizer captured at load, and copy them verbatim into GgufLinear.qweight. Byte-identical to llama.cpp on disk, works for every quant type, no precision round-trip. For attn_q / attn_k the GGUF bytes are stored in llama.cpp's permuted layout, so we still round-trip via fp32 + gguf.quantize (works for Q4_0/Q8_0; permuted K-quant Linears stay as nn.Linear with the dequantized weight, since K-quants have no Python re-quantizer). Validated on Q4_0 (TinyLlama, Qwen2.5-0.5B): 100% Linear → GgufLinear swap, max|Δ| = 0.0 vs the dequant baseline. On Q4_K_M (Llama-3.2-3B): ~71% swap (140/197), the remaining attn_q/k stay as nn.Linear; logits still bit-identical because the un-swapped Linears use the same dequantized fp32 weight either way.

Qwen2-MoE-style models hold all expert weights in a single fused module (``Qwen2MoeExperts``) with two large fp32 parameters per layer — these parameters are 90%+ of the model's total memory footprint. ``GgufLinear`` swap doesn't touch them because they aren't ``nn.Linear``. New ``GgufQwen2MoeExperts`` mirrors the original forward pass but keeps the gate / up / down expert weights as flat uint8 quantized buffers (one per projection — three buffers total). Forward iterates over activated experts and per expert dispatches the matching ``mul_mat_vec_<fmt>_f32`` / ``mul_mat_<fmt>_f32`` Metal kernel against the right byte slice. Q4_K_M GGUFs mix quant types per tensor (typically gate/up = Q4_K, down = Q8_0), so each projection carries its own quant type + format. Wired into ``GGUFQuantizer._process_model_after_weight_loading``: after the Linear swap, it groups ``ffn_{gate,up,down}_exps.weight`` tensors by their HF parent path and hands them to ``replace_qwen2_moe_experts``. Validated on Qwen1.5-MoE-A2.7B Q4_K_M: 12/24 MoE layers swapped (the other 12 use Q5_0 down_proj which we don't ship a mat/matvec kernel for yet), forward agrees with the fp32 baseline to within 8e-6 (fp32 accumulator noise).

github-actions · 2026-05-14T11:31:23Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ggml

github-actions · 2026-05-14T11:43:34Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45977&sha=cb6ba1

ArthurZucker added 3 commits May 14, 2026 19:54

ArthurZucker force-pushed the gguf-matmul-kernels branch from 56d3847 to cb6ba16 Compare May 14, 2026 11:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity#45977

GgufLinear: inference-time GGUF matmul on Apple Silicon — llama.cpp parity#45977
ArthurZucker wants to merge 3 commits into
update-gguffrom
gguf-matmul-kernels

ArthurZucker commented May 14, 2026

Uh oh!

HuggingFaceDocBuilderDev commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArthurZucker commented May 14, 2026

What does this PR do?

Performance vs llama.cpp

Verification

Kernels package

What's out of scope

Test plan

Uh oh!

HuggingFaceDocBuilderDev commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants