Skip to content

Add Nunchaku Lite single-file quantization#14100

Draft
rootonchair wants to merge 12 commits into
huggingface:mainfrom
rootonchair:feature/nunchaku-lite-single-file
Draft

Add Nunchaku Lite single-file quantization#14100
rootonchair wants to merge 12 commits into
huggingface:mainfrom
rootonchair:feature/nunchaku-lite-single-file

Conversation

@rootonchair

@rootonchair rootonchair commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds Nunchaku Lite single-file checkpoint loading for Diffusers models.

This introduces NunchakuLiteQuantizationConfig and a new Nunchaku Lite quantizer that can patch supported nn.Linear modules into runtime SVDQ/AWQ linear layers before strict checkpoint loading. The loader reads safetensors metadata during from_single_file so Nunchaku Lite checkpoints can use their embedded runtime manifest to decide which modules to replace.

Deprecated API

import torch
from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=NunchakuLiteQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

pipe.to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device="cuda").manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

New API for from_single_file use

import torch
from huggingface_hub import hf_hub_download

from diffusers import (
    ErnieImagePipeline,
    ErnieImageTransformer2DModel,
    NunchakuLiteQuantizationConfig,
)


dtype = torch.bfloat16
device = "cuda"

checkpoint = hf_hub_download(
    repo_id="rootonchair/ERNIE-Image-Turbo-nunchaku-lite",
    filename="svdq-int4_r32-ernie-image-turbo-zero-svdq-fix-bias.safetensors",
)

svdq_targets = []
for name in [
    "self_attention.to_q",
    "self_attention.to_k",
    "self_attention.to_v",
    "self_attention.to_out.0",
    "mlp.gate_proj",
    "mlp.up_proj",
    "mlp.linear_fc2",
]:
    svdq_targets.extend([f"layers.{i}.{name}" for i in range(36)])

quantization_config = NunchakuLiteQuantizationConfig(
    compute_dtype=dtype,
    svdq_w4a4={
        "precision": "int4",
        "group_size": 64,
        "rank": 32,
        "targets": svdq_targets,
    },
    awq_w4a16={
        "precision": "int4",
        "group_size": 64,
        "targets": [
            "text_proj",
            "time_embedding.linear_1",
            "time_embedding.linear_2",
            "adaLN_modulation.1",
            "final_norm.linear",
            "final_linear",
        ],
    },
)

transformer = ErnieImageTransformer2DModel.from_single_file(
    checkpoint,
    config="baidu/ERNIE-Image-Turbo",
    subfolder="transformer",
    quantization_config=quantization_config,
    torch_dtype=dtype,
)

pipe = ErnieImagePipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo",
    transformer=transformer,
    torch_dtype=dtype,
)

pipe.to(device)

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=512,
    width=512,
    num_inference_steps=8,
    guidance_scale=1.0,
    generator=torch.Generator(device=device).manual_seed(1234),
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite.png")

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions github-actions Bot added size/L PR with diff > 200 LOC quantization tests single-file and removed size/L PR with diff > 200 LOC labels Jul 1, 2026
@rootonchair rootonchair marked this pull request as draft July 1, 2026 17:13

@sayakpaul sayakpaul left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting started! Just did a first pass and left high-level reviews.

Comment thread src/diffusers/quantizers/nunchaku/nunchaku_quantizer.py Outdated
Comment thread src/diffusers/quantizers/nunchaku/nunchaku_quantizer.py Outdated
def __init__(self, compute_dtype: "torch.dtype" | None = None):
self.quant_method = QuantizationMethod.NUNCHAKU_LITE
self.compute_dtype = compute_dtype
self.pre_quantized = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also guide the readers on how to obtain the checkpoints?

Also, can we ensure torch.compile compatibility?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernels are compatible with torch.compile, as well as SVDQLinear and AWQLinear, I will make a test to assure that the compatibility still remains when we integrate to diffusers

Can we also guide the readers on how to obtain the checkpoints?

I'm a little confused here. Could you help provide more context

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we are only dealing with pre-quantized checkpoint here. Perhaps we can leave a comment that said the checkpoints is quantized with diffuse-compressor + run diffuser format converter?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better off in the docs?

@@ -0,0 +1,161 @@
import json

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests, WDYT of adding a mixin to https://github.com/huggingface/diffusers/blob/main/tests/models/testing_utils/quantization.py and then extending a popular model like Flux to use that mixin?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do it that way

@rootonchair

rootonchair commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

I just did some benchmark on RTX PRO 6000, here is the visual result between bf16 and nvfp4 checkpoint for ERNIE-Image-Turbo

BF16 Nunchaku NVFP4
image image
Case Full mean Denoise mean Denoise peak alloc Full peak alloc Speedup
original 3.003s 2.862s 29.429GB 31.081GB 1.0x
nunchaku_lite NVFP4 2.271s 2.127s 18.926GB 20.578GB 1.35x
nunchaku_lite NVFP4 + compile 1.675s 1.525s 18.672GB 20.578GB 1.8x
nunchaku_lite NVFP4 + bnb text encoder 2.285s 2.132s 14.317GB 15.969GB 1.35x

By replacing Nunchaku Linear, we have reduced the latency of these linear operations by 2x with large shape

Target Op Rows Shape Normal ms Nunchaku ms Speedup
layers.0.self_attention.to_q svdq_w4a4 4096 4096 -> 4096 0.3660 0.1563 2.34x
layers.0.mlp.gate_proj svdq_w4a4 4096 4096 -> 12288 1.0646 0.4272 2.49x
layers.0.mlp.linear_fc2 svdq_w4a4 4096 12288 -> 4096 1.0269 0.4596 2.23x

One note here, AWQ only benefit with adaLN layer, so other modules like time embedding or final linear can stay as bf16

Target Op Rows Shape Normal ms Nunchaku ms Speedup
text_proj awq_w4a16 1 3072 -> 4096 0.0111 0.0201 0.55x
time_embedding.linear_1 awq_w4a16 1 4096 -> 4096 0.0129 0.0251 0.52x
time_embedding.linear_2 awq_w4a16 1 4096 -> 4096 0.0129 0.0247 0.52x
adaLN_modulation.1 awq_w4a16 1 4096 -> 24576 0.1389 0.0243 5.71x
final_norm.linear awq_w4a16 1 4096 -> 8192 0.0220 0.0245 0.90x
final_linear awq_w4a16 4114 4096 -> 128 0.0248 0.0635 0.39x

@github-actions github-actions Bot added the size/L PR with diff > 200 LOC label Jul 2, 2026
@rootonchair

Copy link
Copy Markdown
Contributor Author

I have just implemented the native loading feature, which now can load by from_pretrained with converted repo:

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "rootonchair/ERNIE-Image-Turbo-nunchaku-lite-int4",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
    height=1024,
    width=1024,
    num_inference_steps=8,
    guidance_scale=1.0,
    use_pe=False,
).images[0]

image.save("ernie-image-turbo-nunchaku-lite-int4.png")

Quantization config now change to:

"quantization_config": {
    "awq_w4a16": {
      "group_size": 64,
      "precision": "int4",
      "targets": [
        "text_proj",
        "time_embedding.linear_1",
        "time_embedding.linear_2",
        "adaLN_modulation.1",
        "final_norm.linear",
        "final_linear"
      ]
    },
    "compute_dtype": "bfloat16",
    "quant_method": "nunchaku_lite",
    "svdq_w4a4": {
      "group_size": 16,
      "precision": "fp4",
      "rank": 32,
      "targets": [
        "layers.0.self_attention.to_q",
        "layers.1.self_attention.to_q",
        "layers.2.self_attention.to_q",
        "layers.3.self_attention.to_q",
         ...
      ]
    }

If we agree to use this schema, I will remove the old metadata/from_single_file approach

@sayakpaul sayakpaul left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I think we can remove all metadata related code?

Comment thread src/diffusers/quantizers/nunchaku/nunchaku_quantizer.py
def __init__(self, compute_dtype: "torch.dtype" | None = None):
self.quant_method = QuantizationMethod.NUNCHAKU_LITE
self.compute_dtype = compute_dtype
self.pre_quantized = True

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused here. Could you help provide more context

How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?

Comment thread src/diffusers/quantizers/quantization_config.py
Comment thread src/diffusers/quantizers/quantization_config.py

@sayakpaul sayakpaul left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No rush but let us know once you would like another round of review.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@github-actions github-actions Bot added documentation Improvements or additions to documentation and removed single-file labels Jul 3, 2026
@rootonchair

Copy link
Copy Markdown
Contributor Author

@sayakpaul I think this is ready for the next review

@sayakpaul sayakpaul requested review from SunMarc and asomoza July 3, 2026 15:03

@sergereview sergereview Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤗 Serge says:

This adds a new nunchaku_lite quantization backend (config, quantizer, runtime linear layers backed by the HF kernels package, docs, and tests). The overall structure follows existing quantizer integrations, but there are a few blocking issues.

Security

  • _get_ops() in src/diffusers/quantizers/nunchaku/utils.py calls get_kernel(..., trust_remote_code=True) against a personal user repo (rootonchair/nunchaku-lite-kernels). This silently enables remote code execution for anyone who loads a Nunchaku Lite checkpoint, with no user opt-in. No other kernel usage in the repo does this (gguf and attention_dispatch call get_kernel without trust_remote_code). This needs to be removed or made an explicit user decision, and the kernel repo should ideally live under an org namespace with pinned revisions.

Description vs. diff mismatch

  • The PR description claims "The loader reads safetensors metadata during from_single_file so Nunchaku Lite checkpoints can use their embedded runtime manifest" — but there are no changes to single_file_model.py or any loader in this diff. Correspondingly, the metadata parameter of NunchakuLiteQuantizer._process_model_before_weight_loading is never populated by any call site and is dead code.

Correctness

  • NunchakuLiteTesterMixin._test_quantized_layers is tautological: it sets expected_quantized_layers = num_quantized_layers and then asserts they're equal, so the count check can never fail. It should compare against the number of targets in the quantization config (like the base mixin compares against linear-layer count).
  • The docs and the NunchakuLiteQuantizationConfig docstring repeatedly reference model.json, but Diffusers model configs (and the test in this PR) use config.json. As written, users following the doc will put the quantization config in a file Diffusers never reads.

Style

  • nunchaku_quantizer.py has trailing whitespace (line 64) — make style was apparently not run.
  • New source files are missing the Apache license header used across src/diffusers.
  • NunchakuLiteQuantizationConfig is appended to _import_structure via a standalone statement instead of being listed in the dict literal like every other unconditional export.
  • import itertools is buried inside check_strict_state_dict_match; move it to module top level.
  • SVDQW4A4Linear re-validates precision/group_size that NunchakuLiteQuantizationConfig.post_init already enforces — per repo guidelines, drop the duplicated defensive checks (the forward path hard-codes group sizes 16/64 anyway, so the group_size argument is effectively unused at runtime).

serge v0.1.0 · model: claude-fable-5 · 10 LLM turns · 22 tool calls · 390.5s · 437374 in / 30106 out tokens

if _ops is None:
from kernels import get_kernel

_ops = get_kernel(_HF_KERNEL_REPO, version=_HF_KERNEL_VERSION, trust_remote_code=True).ops

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Security: trust_remote_code=True silently enables execution of arbitrary remote code from a personal user repo whenever a Nunchaku Lite checkpoint is loaded — the user never opts in. Neither of the existing get_kernel call sites in this repo (quantizers/gguf/utils.py, models/attention_dispatch.py) passes trust_remote_code. Please drop it (publish the kernel as a standard prebuilt kernels repo that doesn't require remote code), or at minimum surface this as an explicit user-facing opt-in. Hosting under a personal namespace (rootonchair/...) rather than an org also makes this a supply-chain risk for everyone using the backend.

self,
model: "ModelMixin",
state_dict: dict[str, Any] | None = None,
metadata: dict[str, str] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata is never passed by any caller — neither modeling_utils.py nor single_file_model.py forwards safetensors metadata to preprocess_model. The PR description claims the loader "reads safetensors metadata during from_single_file", but no loader changes exist in this diff. Per the repo guidelines (no unused parameters "for API consistency"), remove this parameter.

Comment thread src/diffusers/quantizers/nunchaku/nunchaku_quantizer.py Outdated
self._verify_if_layer_quantized(name, module, config_kwargs)
num_quantized_layers += 1

expected_quantized_layers = num_quantized_layers

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is tautological: expected_quantized_layers is set to num_quantized_layers, so the num_quantized_layers == expected_quantized_layers check below can never fail, and num_fp32_modules is always 0. The only effective check left is num_quantized_layers > 0. Compute the expected count from the config instead, e.g. the total number of entries in svdq_w4a4["targets"] + awq_w4a16["targets"], so the test actually verifies all targets were replaced.

Comment thread docs/source/en/quantization/nunchaku.md Outdated
The exported state dict must match the target Diffusers model architecture exactly. Checkpoints quantized with
fused QKV projections won't load into a model config that expects separate Q, K, and V projection modules.

Example compact `model.json` config:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the doc page: Diffusers reads the model config from config.json, not model.json. Please fix the filename here and in the "quantization_config stored in model.json" sentence above so users don't package their checkpoints incorrectly.

Comment thread src/diffusers/__init__.py Outdated
],
}

_import_structure["quantizers.quantization_config"].append("NunchakuLiteQuantizationConfig")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this export is unconditional, add "NunchakuLiteQuantizationConfig" directly to the "quantizers.quantization_config" list in the _import_structure dict literal above instead of appending via a standalone statement — that matches how every other unconditional export is declared. Note also that NunchakuLiteQuantizationConfig.__init__ references torch unconditionally, while all other quantization configs exported here are gated behind is_torch_available(); consider whether this one needs the same guard.

Comment thread src/diffusers/quantizers/nunchaku/utils.py Outdated
if device is None:
device = torch.device("cpu")

if precision not in {"int4", "nvfp4"}:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These precision/group_size checks duplicate validation that NunchakuLiteQuantizationConfig.post_init already enforces (and post_init is stricter: it pins group_size to 16 for fp4 / 64 for int4, while this accepts any positive value that forward then ignores — the activation-scale layout hard-codes 16/64). Per the repo's no-defensive-code guideline, drop the re-validation here and rely on the config.

rootonchair and others added 3 commits July 3, 2026 16:11
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>
rootonchair and others added 2 commits July 3, 2026 23:14
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>
@github-actions github-actions Bot added the utils label Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quantization size/L PR with diff > 200 LOC tests utils

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants