Add Nunchaku Lite single-file quantization by rootonchair · Pull Request #14100 · huggingface/diffusers

rootonchair · 2026-07-01T17:08:21Z

What does this PR do?

Adds Nunchaku Lite single-file checkpoint loading for Diffusers models.

This introduces NunchakuLiteQuantizationConfig and a new Nunchaku Lite quantizer that can patch supported nn.Linear modules into runtime SVDQ/AWQ linear layers before strict checkpoint loading. The loader reads safetensors metadata during from_single_file so Nunchaku Lite checkpoints can use their embedded runtime manifest to decide which modules to replace.

Deprecated API

import torch
from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=NunchakuLiteQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

pipe.to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device="cuda").manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

New API for `from_single_file` use

import torch
from huggingface_hub import hf_hub_download

from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)


dtype = torch.bfloat16
device = "cuda"

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

svdq_targets = []
for name in [
    "self_attention.to_q",
    "self_attention.to_k",
    "self_attention.to_v",
    "self_attention.to_out.0",
    "mlp.gate_proj",
    "mlp.up_proj",
    "mlp.linear_fc2",
]:
    svdq_targets.extend([f"layers.{i}.{name}" for i in range(36)])

quantization_config = NunchakuLiteQuantizationConfig(
    compute_dtype=dtype,
    svdq_w4a4={
        "precision": "int4",
        "group_size": 64,
        "rank": 32,
        "targets": svdq_targets,
    },
    awq_w4a16={
        "precision": "int4",
        "group_size": 64,
        "targets": [
            "text_proj",
            "time_embedding.linear_1",
            "time_embedding.linear_2",
            "adaLN_modulation.1",
            "final_norm.linear",
            "final_linear",
        ],
    },
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=dtype,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=dtype,
)

pipe.to(device)

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device=device).manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

Fixes # (issue)

Before submitting

Did you use an AI agent (Claude Code, Codex, Cursor, etc.) to help with this PR? If so:
- Did you read the Coding with AI agents guide?
- Did you self-review the diff against .ai/review-rules.md?
Did you read the contributor guideline?
Did you read our philosophy doc? (important for complex PRs)
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?
Are you the author (or part of the team) of the model/pipeline (only applicable for model/pipeline related PRs)?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sayakpaul

Thanks for getting started! Just did a first pass and left high-level reviews.

sayakpaul · 2026-07-02T04:15:28Z

+    def __init__(self, compute_dtype: "torch.dtype" | None = None):
+        self.quant_method = QuantizationMethod.NUNCHAKU_LITE
+        self.compute_dtype = compute_dtype
+        self.pre_quantized = True


Can we also guide the readers on how to obtain the checkpoints?

Also, can we ensure torch.compile compatibility?

The kernels are compatible with torch.compile, as well as SVDQLinear and AWQLinear, I will make a test to assure that the compatibility still remains when we integrate to diffusers

Can we also guide the readers on how to obtain the checkpoints?

I'm a little confused here. Could you help provide more context

I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

Yes we are only dealing with pre-quantized checkpoint here. Perhaps we can leave a comment that said the checkpoints is quantized with diffuse-compressor + run diffuser format converter?

I think it'd be better off in the docs?

sayakpaul · 2026-07-02T04:16:04Z

@@ -0,0 +1,161 @@
+import json


For tests, WDYT of adding a mixin to https://github.com/huggingface/diffusers/blob/main/tests/models/testing_utils/quantization.py and then extending a popular model like Flux to use that mixin?

Yes, let's do it that way

rootonchair · 2026-07-02T07:04:43Z

I just did some benchmark on RTX PRO 6000, here is the visual result between bf16 and nvfp4 checkpoint for ERNIE-Image-Turbo

BF16	Nunchaku NVFP4

Case	Full mean	Denoise mean	Denoise peak alloc	Full peak alloc	Speedup
original	3.003s	2.862s	29.429GB	31.081GB	1.0x
nunchaku_lite NVFP4	2.271s	2.127s	18.926GB	20.578GB	1.35x
nunchaku_lite NVFP4 + compile	1.675s	1.525s	18.672GB	20.578GB	1.8x
nunchaku_lite NVFP4 + bnb text encoder	2.285s	2.132s	14.317GB	15.969GB	1.35x

By replacing Nunchaku Linear, we have reduced the latency of these linear operations by 2x with large shape

Target	Op	Rows	Shape	Normal ms	Nunchaku ms	Speedup
`layers.0.self_attention.to_q`	`svdq_w4a4`	4096	4096 -> 4096	0.3660	0.1563	2.34x
`layers.0.mlp.gate_proj`	`svdq_w4a4`	4096	4096 -> 12288	1.0646	0.4272	2.49x
`layers.0.mlp.linear_fc2`	`svdq_w4a4`	4096	12288 -> 4096	1.0269	0.4596	2.23x

One note here, AWQ only benefit with adaLN layer, so other modules like time embedding or final linear can stay as bf16

Target	Op	Rows	Shape	Normal ms	Nunchaku ms	Speedup
`text_proj`	`awq_w4a16`	1	3072 -> 4096	0.0111	0.0201	0.55x
`time_embedding.linear_1`	`awq_w4a16`	1	4096 -> 4096	0.0129	0.0251	0.52x
`time_embedding.linear_2`	`awq_w4a16`	1	4096 -> 4096	0.0129	0.0247	0.52x
`adaLN_modulation.1`	`awq_w4a16`	1	4096 -> 24576	0.1389	0.0243	5.71x
`final_norm.linear`	`awq_w4a16`	1	4096 -> 8192	0.0220	0.0245	0.90x
`final_linear`	`awq_w4a16`	4114	4096 -> 128	0.0248	0.0635	0.39x

rootonchair · 2026-07-02T10:20:55Z

I have just implemented the native loading feature, which now can load by from_pretrained with converted repo:

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "rootonchair/ERNIE-Image-Turbo-nunchaku-lite-int4",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=1024,
    width=1024,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite-int4.png")

Quantization config now change to:

"quantization_config": {
    "awq_w4a16": {
      "group_size": 64,
      "precision": "int4",
      "targets": [
        "text_proj",
        "time_embedding.linear_1",
        "time_embedding.linear_2",
        "adaLN_modulation.1",
        "final_norm.linear",
        "final_linear"
      ]
    },
    "compute_dtype": "bfloat16",
    "quant_method": "nunchaku_lite",
    "svdq_w4a4": {
      "group_size": 16,
      "precision": "fp4",
      "rank": 32,
      "targets": [
        "layers.0.self_attention.to_q",
        "layers.1.self_attention.to_q",
        "layers.2.self_attention.to_q",
        "layers.3.self_attention.to_q",
         ...
      ]
    }

If we agree to use this schema, I will remove the old metadata/from_single_file approach

sayakpaul

Looking good. I think we can remove all metadata related code?

sayakpaul · 2026-07-02T13:30:08Z

+    def __init__(self, compute_dtype: "torch.dtype" | None = None):
+        self.quant_method = QuantizationMethod.NUNCHAKU_LITE
+        self.compute_dtype = compute_dtype
+        self.pre_quantized = True


I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

sayakpaul

No rush but let us know once you would like another round of review.

HuggingFaceDocBuilderDev · 2026-07-03T02:24:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

rootonchair · 2026-07-03T14:53:23Z

@sayakpaul I think this is ready for the next review

sergereview

🤗 Serge says:

This adds a new nunchaku_lite quantization backend (config, quantizer, runtime linear layers backed by the HF kernels package, docs, and tests). The overall structure follows existing quantizer integrations, but there are a few blocking issues.

Security

_get_ops() in src/diffusers/quantizers/nunchaku/utils.py calls get_kernel(..., trust_remote_code=True) against a personal user repo (rootonchair/nunchaku-lite-kernels). This silently enables remote code execution for anyone who loads a Nunchaku Lite checkpoint, with no user opt-in. No other kernel usage in the repo does this (gguf and attention_dispatch call get_kernel without trust_remote_code). This needs to be removed or made an explicit user decision, and the kernel repo should ideally live under an org namespace with pinned revisions.

Description vs. diff mismatch

The PR description claims "The loader reads safetensors metadata during from_single_file so Nunchaku Lite checkpoints can use their embedded runtime manifest" — but there are no changes to single_file_model.py or any loader in this diff. Correspondingly, the metadata parameter of NunchakuLiteQuantizer._process_model_before_weight_loading is never populated by any call site and is dead code.

Correctness

NunchakuLiteTesterMixin._test_quantized_layers is tautological: it sets expected_quantized_layers = num_quantized_layers and then asserts they're equal, so the count check can never fail. It should compare against the number of targets in the quantization config (like the base mixin compares against linear-layer count).
The docs and the NunchakuLiteQuantizationConfig docstring repeatedly reference model.json, but Diffusers model configs (and the test in this PR) use config.json. As written, users following the doc will put the quantization config in a file Diffusers never reads.

Style

nunchaku_quantizer.py has trailing whitespace (line 64) — make style was apparently not run.
New source files are missing the Apache license header used across src/diffusers.
NunchakuLiteQuantizationConfig is appended to _import_structure via a standalone statement instead of being listed in the dict literal like every other unconditional export.
import itertools is buried inside check_strict_state_dict_match; move it to module top level.
SVDQW4A4Linear re-validates precision/group_size that NunchakuLiteQuantizationConfig.post_init already enforces — per repo guidelines, drop the duplicated defensive checks (the forward path hard-codes group sizes 16/64 anyway, so the group_size argument is effectively unused at runtime).

serge v0.1.0 · model: claude-fable-5 · 10 LLM turns · 22 tool calls · 390.5s · 437374 in / 30106 out tokens

sergereview · 2026-07-03T16:08:46Z

+    if _ops is None:
+        from kernels import get_kernel
+
+        _ops = get_kernel(_HF_KERNEL_REPO, version=_HF_KERNEL_VERSION, trust_remote_code=True).ops


Security: trust_remote_code=True silently enables execution of arbitrary remote code from a personal user repo whenever a Nunchaku Lite checkpoint is loaded — the user never opts in. Neither of the existing get_kernel call sites in this repo (quantizers/gguf/utils.py, models/attention_dispatch.py) passes trust_remote_code. Please drop it (publish the kernel as a standard prebuilt kernels repo that doesn't require remote code), or at minimum surface this as an explicit user-facing opt-in. Hosting under a personal namespace (rootonchair/...) rather than an org also makes this a supply-chain risk for everyone using the backend.

sergereview · 2026-07-03T16:08:46Z

+        self,
+        model: "ModelMixin",
+        state_dict: dict[str, Any] | None = None,
+        metadata: dict[str, str] | None = None,


metadata is never passed by any caller — neither modeling_utils.py nor single_file_model.py forwards safetensors metadata to preprocess_model. The PR description claims the loader "reads safetensors metadata during from_single_file", but no loader changes exist in this diff. Per the repo guidelines (no unused parameters "for API consistency"), remove this parameter.

sergereview · 2026-07-03T16:08:46Z

+                self._verify_if_layer_quantized(name, module, config_kwargs)
+                num_quantized_layers += 1
+
+        expected_quantized_layers = num_quantized_layers


This assertion is tautological: expected_quantized_layers is set to num_quantized_layers, so the num_quantized_layers == expected_quantized_layers check below can never fail, and num_fp32_modules is always 0. The only effective check left is num_quantized_layers > 0. Compute the expected count from the config instead, e.g. the total number of entries in svdq_w4a4["targets"] + awq_w4a16["targets"], so the test actually verifies all targets were replaced.

sergereview · 2026-07-03T16:08:46Z

+    The exported state dict must match the target Diffusers model architecture exactly. Checkpoints quantized with
+    fused QKV projections won't load into a model config that expects separate Q, K, and V projection modules.
+
+    Example compact `model.json` config:


Same as the doc page: Diffusers reads the model config from config.json, not model.json. Please fix the filename here and in the "quantization_config stored in model.json" sentence above so users don't package their checkpoints incorrectly.

sergereview · 2026-07-03T16:08:46Z

    ],
 }

+_import_structure["quantizers.quantization_config"].append("NunchakuLiteQuantizationConfig")


Since this export is unconditional, add "NunchakuLiteQuantizationConfig" directly to the "quantizers.quantization_config" list in the _import_structure dict literal above instead of appending via a standalone statement — that matches how every other unconditional export is declared. Note also that NunchakuLiteQuantizationConfig.__init__ references torch unconditionally, while all other quantization configs exported here are gated behind is_torch_available(); consider whether this one needs the same guard.

sergereview · 2026-07-03T16:08:46Z

+        if device is None:
+            device = torch.device("cpu")
+
+        if precision not in {"int4", "nvfp4"}:


These precision/group_size checks duplicate validation that NunchakuLiteQuantizationConfig.post_init already enforces (and post_init is stricter: it pins group_size to 16 for fp4 / 64 for int4, while this accepts any positive value that forward then ignores — the activation-scale layout hard-codes 16/64). Per the repo's no-defensive-code guideline, drop the re-validation here and rely on the config.

Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>

Add Nunchaku Lite single-file quantization

7f4a3a0

github-actions Bot added size/L PR with diff > 200 LOC quantization tests single-file and removed size/L PR with diff > 200 LOC labels Jul 1, 2026

rootonchair marked this pull request as draft July 1, 2026 17:13

sayakpaul reviewed Jul 2, 2026

View reviewed changes

Support config-backed Nunchaku Lite loading

1a66ac9

github-actions Bot added the size/L PR with diff > 200 LOC label Jul 2, 2026

sayakpaul reviewed Jul 2, 2026

View reviewed changes

Comment thread src/diffusers/quantizers/quantization_config.py

rootonchair added 3 commits July 2, 2026 15:53

Remove Nunchaku runtime manifest metadata loading

36b4fc4

Simplify Nunchaku compact config loading

db05e0b

Add Nunchaku Lite quantization tests

5d4822a

sayakpaul reviewed Jul 3, 2026

View reviewed changes

Document Nunchaku Lite checkpoint loading

c630a88

github-actions Bot added documentation Improvements or additions to documentation and removed single-file labels Jul 3, 2026

Refine Nunchaku Lite quantization docs

ab246af

sayakpaul requested review from SunMarc and asomoza July 3, 2026 15:03

sergereview Bot requested changes Jul 3, 2026

View reviewed changes

rootonchair and others added 3 commits July 3, 2026 16:11

Remove unused Nunchaku smooth factor original weights

57323cd

Update docs/source/en/quantization/nunchaku.md

5fef94f

Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>

Update src/diffusers/quantizers/nunchaku/utils.py

fb33430

Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>

rootonchair and others added 2 commits July 3, 2026 23:14

Update src/diffusers/quantizers/nunchaku/nunchaku_quantizer.py

8e14847

Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>

Address Nunchaku Lite review feedback

f5c0179

github-actions Bot added the utils label Jul 3, 2026

Uh oh!

Conversation

rootonchair commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Deprecated API

New API for from_single_file use

Before submitting

Who can review?

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rootonchair commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rootonchair commented Jul 2, 2026

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 3, 2026

Uh oh!

rootonchair commented Jul 3, 2026

Uh oh!

sergereview Bot left a comment

Choose a reason for hiding this comment

Security

Description vs. diff mismatch

Correctness

Style

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sergereview Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

rootonchair commented Jul 1, 2026 •

edited

Loading

New API for `from_single_file` use

rootonchair commented Jul 2, 2026 •

edited

Loading