Add Nunchaku Lite single-file quantization#14100
Conversation
sayakpaul
left a comment
There was a problem hiding this comment.
Thanks for getting started! Just did a first pass and left high-level reviews.
| def __init__(self, compute_dtype: "torch.dtype" | None = None): | ||
| self.quant_method = QuantizationMethod.NUNCHAKU_LITE | ||
| self.compute_dtype = compute_dtype | ||
| self.pre_quantized = True |
There was a problem hiding this comment.
Can we also guide the readers on how to obtain the checkpoints?
Also, can we ensure torch.compile compatibility?
There was a problem hiding this comment.
The kernels are compatible with torch.compile, as well as SVDQLinear and AWQLinear, I will make a test to assure that the compatibility still remains when we integrate to diffusers
Can we also guide the readers on how to obtain the checkpoints?
I'm a little confused here. Could you help provide more context
There was a problem hiding this comment.
I'm a little confused here. Could you help provide more context
How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?
There was a problem hiding this comment.
Yes we are only dealing with pre-quantized checkpoint here. Perhaps we can leave a comment that said the checkpoints is quantized with diffuse-compressor + run diffuser format converter?
There was a problem hiding this comment.
I think it'd be better off in the docs?
| @@ -0,0 +1,161 @@ | |||
| import json | |||
There was a problem hiding this comment.
For tests, WDYT of adding a mixin to https://github.com/huggingface/diffusers/blob/main/tests/models/testing_utils/quantization.py and then extending a popular model like Flux to use that mixin?
There was a problem hiding this comment.
Yes, let's do it that way
|
I have just implemented the native loading feature, which now can load by
import torch
from diffusers import ErnieImagePipeline
pipe = ErnieImagePipeline.from_pretrained(
"rootonchair/ERNIE-Image-Turbo-nunchaku-lite-int4",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(
prompt="A modern red armchair in a quiet studio, soft window light, realistic product photography",
height=1024,
width=1024,
num_inference_steps=8,
guidance_scale=1.0,
use_pe=False,
).images[0]
image.save("ernie-image-turbo-nunchaku-lite-int4.png")Quantization config now change to: If we agree to use this schema, I will remove the old metadata/from_single_file approach |
sayakpaul
left a comment
There was a problem hiding this comment.
Looking good. I think we can remove all metadata related code?
| def __init__(self, compute_dtype: "torch.dtype" | None = None): | ||
| self.quant_method = QuantizationMethod.NUNCHAKU_LITE | ||
| self.compute_dtype = compute_dtype | ||
| self.pre_quantized = True |
There was a problem hiding this comment.
I'm a little confused here. Could you help provide more context
How are the example checkpoints obtained? I think we're only dealing with pre-quantized checkpoints in this PR?
sayakpaul
left a comment
There was a problem hiding this comment.
No rush but let us know once you would like another round of review.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@sayakpaul I think this is ready for the next review |
There was a problem hiding this comment.
🤗 Serge says:
This adds a new nunchaku_lite quantization backend (config, quantizer, runtime linear layers backed by the HF kernels package, docs, and tests). The overall structure follows existing quantizer integrations, but there are a few blocking issues.
Security
_get_ops()insrc/diffusers/quantizers/nunchaku/utils.pycallsget_kernel(..., trust_remote_code=True)against a personal user repo (rootonchair/nunchaku-lite-kernels). This silently enables remote code execution for anyone who loads a Nunchaku Lite checkpoint, with no user opt-in. No other kernel usage in the repo does this (ggufandattention_dispatchcallget_kernelwithouttrust_remote_code). This needs to be removed or made an explicit user decision, and the kernel repo should ideally live under an org namespace with pinned revisions.
Description vs. diff mismatch
- The PR description claims "The loader reads safetensors metadata during
from_single_fileso Nunchaku Lite checkpoints can use their embedded runtime manifest" — but there are no changes tosingle_file_model.pyor any loader in this diff. Correspondingly, themetadataparameter ofNunchakuLiteQuantizer._process_model_before_weight_loadingis never populated by any call site and is dead code.
Correctness
NunchakuLiteTesterMixin._test_quantized_layersis tautological: it setsexpected_quantized_layers = num_quantized_layersand then asserts they're equal, so the count check can never fail. It should compare against the number of targets in the quantization config (like the base mixin compares against linear-layer count).- The docs and the
NunchakuLiteQuantizationConfigdocstring repeatedly referencemodel.json, but Diffusers model configs (and the test in this PR) useconfig.json. As written, users following the doc will put the quantization config in a file Diffusers never reads.
Style
nunchaku_quantizer.pyhas trailing whitespace (line 64) —make stylewas apparently not run.- New source files are missing the Apache license header used across
src/diffusers. NunchakuLiteQuantizationConfigis appended to_import_structurevia a standalone statement instead of being listed in the dict literal like every other unconditional export.import itertoolsis buried insidecheck_strict_state_dict_match; move it to module top level.SVDQW4A4Linearre-validatesprecision/group_sizethatNunchakuLiteQuantizationConfig.post_initalready enforces — per repo guidelines, drop the duplicated defensive checks (the forward path hard-codes group sizes 16/64 anyway, so thegroup_sizeargument is effectively unused at runtime).
serge v0.1.0 · model: claude-fable-5 · 10 LLM turns · 22 tool calls · 390.5s · 437374 in / 30106 out tokens
| if _ops is None: | ||
| from kernels import get_kernel | ||
|
|
||
| _ops = get_kernel(_HF_KERNEL_REPO, version=_HF_KERNEL_VERSION, trust_remote_code=True).ops |
There was a problem hiding this comment.
Security: trust_remote_code=True silently enables execution of arbitrary remote code from a personal user repo whenever a Nunchaku Lite checkpoint is loaded — the user never opts in. Neither of the existing get_kernel call sites in this repo (quantizers/gguf/utils.py, models/attention_dispatch.py) passes trust_remote_code. Please drop it (publish the kernel as a standard prebuilt kernels repo that doesn't require remote code), or at minimum surface this as an explicit user-facing opt-in. Hosting under a personal namespace (rootonchair/...) rather than an org also makes this a supply-chain risk for everyone using the backend.
| self, | ||
| model: "ModelMixin", | ||
| state_dict: dict[str, Any] | None = None, | ||
| metadata: dict[str, str] | None = None, |
There was a problem hiding this comment.
metadata is never passed by any caller — neither modeling_utils.py nor single_file_model.py forwards safetensors metadata to preprocess_model. The PR description claims the loader "reads safetensors metadata during from_single_file", but no loader changes exist in this diff. Per the repo guidelines (no unused parameters "for API consistency"), remove this parameter.
| self._verify_if_layer_quantized(name, module, config_kwargs) | ||
| num_quantized_layers += 1 | ||
|
|
||
| expected_quantized_layers = num_quantized_layers |
There was a problem hiding this comment.
This assertion is tautological: expected_quantized_layers is set to num_quantized_layers, so the num_quantized_layers == expected_quantized_layers check below can never fail, and num_fp32_modules is always 0. The only effective check left is num_quantized_layers > 0. Compute the expected count from the config instead, e.g. the total number of entries in svdq_w4a4["targets"] + awq_w4a16["targets"], so the test actually verifies all targets were replaced.
| The exported state dict must match the target Diffusers model architecture exactly. Checkpoints quantized with | ||
| fused QKV projections won't load into a model config that expects separate Q, K, and V projection modules. | ||
|
|
||
| Example compact `model.json` config: |
There was a problem hiding this comment.
Same as the doc page: Diffusers reads the model config from config.json, not model.json. Please fix the filename here and in the "quantization_config stored in model.json" sentence above so users don't package their checkpoints incorrectly.
| ], | ||
| } | ||
|
|
||
| _import_structure["quantizers.quantization_config"].append("NunchakuLiteQuantizationConfig") |
There was a problem hiding this comment.
Since this export is unconditional, add "NunchakuLiteQuantizationConfig" directly to the "quantizers.quantization_config" list in the _import_structure dict literal above instead of appending via a standalone statement — that matches how every other unconditional export is declared. Note also that NunchakuLiteQuantizationConfig.__init__ references torch unconditionally, while all other quantization configs exported here are gated behind is_torch_available(); consider whether this one needs the same guard.
| if device is None: | ||
| device = torch.device("cpu") | ||
|
|
||
| if precision not in {"int4", "nvfp4"}: |
There was a problem hiding this comment.
These precision/group_size checks duplicate validation that NunchakuLiteQuantizationConfig.post_init already enforces (and post_init is stricter: it pins group_size to 16 for fp4 / 64 for int4, while this accepts any positive value that forward then ignores — the activation-scale layout hard-codes 16/64). Per the repo's no-defensive-code guideline, drop the re-validation here and rely on the config.
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>
Co-authored-by: sergereview[bot] <283583894+sergereview[bot]@users.noreply.github.com>


What does this PR do?
Adds Nunchaku Lite single-file checkpoint loading for Diffusers models.
This introduces
NunchakuLiteQuantizationConfigand a new Nunchaku Lite quantizer that can patch supported nn.Linear modules into runtime SVDQ/AWQ linear layers before strict checkpoint loading. The loader reads safetensors metadata duringfrom_single_fileso Nunchaku Lite checkpoints can use their embedded runtime manifest to decide which modules to replace.Deprecated API
New API for
from_single_fileuseFixes # (issue)
Before submitting
.ai/review-rules.md?documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.