Tags · NVIDIA/Model-Optimizer

0.45.0dev

fix: PTQ 1GPU, export PP divisibility, hidden states conversations key (

#1293)

## Summary
- **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters
(TP=1, all tasks)
- **quantize.sh**: Auto-find largest PP dividing model's
`num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't
divisible by 8, causing `AssertionError` on 8-GPU nodes
- **compute_hidden_states_trtllm.py**: Use `messages` with
`conversations` fallback, matching the HF version. Fixes `KeyError:
'conversations'` when data uses OpenAI `messages` format

## Test plan
- [x] Qwen3-8B PTQ runs on single L40 GPU
- [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs,
PP=4 on 4 GPUs, PP=1 on 1 GPU)
- [x] EAGLE3 offline pipeline reads data with `messages` field

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Dataset input handling now supports multiple field formats for
enhanced compatibility.

* **Bug Fixes**
* Optimized GPU resource allocation during model quantization with
improved pipeline parallelism computation.
* Updated quantization configuration for more efficient resource
utilization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Apr 20, 2026
355c6b7
zip
tar.gz

0.44.0rc1

[Release-fix] Pin transformers<5.6 in release branch

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Apr 20, 2026
8d2f99f
zip
tar.gz
Notes
Downloads

0.44.0rc0

[2/3] Implicit Gemm NVFP4 (#1227)

### What does this PR do?

Type of change: new feature <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

- Add Conv3D implicit GEMM kernel with BF16 WMMA tensor cores and fused
NVFP4 activation quantization for video diffusion VAE layers
- Integrate into _QuantConv3d via QuantModuleRegistry — automatically
dispatched when NVFP4 quantization is applied to nn.Conv3d
- Move kernel from `experimental/conv/ to modelopt/torch/kernels/conv/`;
move tests to `tests/gpu/torch/quantization/kernels/`

### Testing
<!-- Mention how have you tested your change if applicable. -->

- Added test cases to measure the difference between cuDNN and our CUDA
implicit GEMM kernel
- Added an NVFP4 fake quantization test using CUDA code

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Per-backbone quantization/export in a single run with per-backbone
checkpoints and backbone-aware quant filters
* Configurable NVFP4 block-size via CLI/config; improved NVFP4 Conv3D
inference path and Wan 2.2 quantization support
* **Bug Fixes**
* Video-model calibration now respects extra params and forces video
decoding during calibration
* **Documentation**
* Added comprehensive Conv3D implicit‑GEMM kernel documentation; removed
experimental Conv3D prototype docs/benchmark
* **Tests**
* New Wan 2.2 quantization/export tests and expanded Conv3D/FP4 kernel
test coverage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

Apr 19, 2026
26ae8da
zip
tar.gz
Notes
Downloads

0.43.0

Update LICENSE and SPDX-License-Identifier as per OSRB guidance (#1244)

Update LICENSE and SPDX-License-Identifier as per OSRB guidance

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Documentation**
* Updated contribution guidelines with expanded license compliance
instructions and SPDX identifier guidance.
* Extended LICENSE file with new "Third-Party Software Notices" section
documenting Apache 2.0, MIT, and BSD 3-Clause licensed components.

* **Chores**
* Updated SPDX license identifiers across multiple files to reflect dual
and triple licensing (Apache 2.0 with MIT and/or BSD 3-Clause).

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Apr 13, 2026
ccabb95
zip
tar.gz
Notes
Downloads

0.43.0rc4

Update LICENSE and SPDX-License-Identifier as per OSRB guidance (#1244)

Update LICENSE and SPDX-License-Identifier as per OSRB guidance

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

* **Documentation**
* Updated contribution guidelines with expanded license compliance
instructions and SPDX identifier guidance.
* Extended LICENSE file with new "Third-Party Software Notices" section
documenting Apache 2.0, MIT, and BSD 3-Clause licensed components.

* **Chores**
* Updated SPDX license identifiers across multiple files to reflect dual
and triple licensing (Apache 2.0 with MIT and/or BSD 3-Clause).

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Apr 13, 2026
ccabb95
zip
tar.gz
Notes
Downloads

0.43.0rc3

fix: pass include_buffers=True to init_empty_weights for Gemma-4 supp…

…ort (#1169)

### What does this PR do?

Type of change: Bug fix

Pass `include_buffers=True` to `init_empty_weights()` when computing the
device map for HuggingFace models. Gemma-4 registers its model
parameters as buffers rather than parameters, so without this flag they
are not accounted for during device map computation, causing incorrect
placement or OOM errors.

### Usage

```python
# No API change — fix is internal to get_model() in example_utils.py
# Simply run HF PTQ with a Gemma-4 model as usual:
python hf_ptq.py --pyt_ckpt_path google/gemma-4-... --quantize ...
```

### Testing

Manually tested with Gemma-4 model loading via `hf_ptq.py`.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information

Gemma-4 uses buffers instead of parameters for some model weights,
requiring `include_buffers=True` for correct device map estimation with
`accelerate`'s `init_empty_weights`.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved model initialization in quantization examples by ensuring
buffers are properly included during temporary model construction,
resulting in more accurate device mapping inference for model
optimization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: James Shen <yueshen@nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Apr 6, 2026
f3151d2
zip
tar.gz
Notes
Downloads

0.43.0rc2

Bug fix: 6012573 (#1131)

### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
  * Standardized the configuration key for model precision.
* Model loading now defaults to bfloat16 precision instead of float32,
aligning configs and runtime behavior.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

Mar 28, 2026
0315fb1
zip
tar.gz
Notes
Downloads

0.44.0dev

Add Python 3.13 support (#1048)

Fixes #217

## Summary

- Bump `requires-python` from `>=3.10,<3.13` to `>=3.10,<3.14` to
formally include Python 3.13
- Add explicit Python 3.10–3.13 PyPI classifiers for better
discoverability
- Add `py313` to tox CPU unit test and partial-install environment
matrices
- Add Python 3.10–3.13 to the `multi-py` CI matrix in `unit_tests.yml`

## Background

Python 3.13 was previously excluded by the `<3.13` upper bound. Testing
in a related repo with `--ignore-requires-python` confirmed that the
library installs and runs correctly under Python 3.13. This PR lifts the
restriction and wires up CI to verify it going forward.

## Test plan

- [ ] CI `multi-py` job passes on `py313-torch210-tf_latest-unit`
- [ ] `tox -e py313-torch210-tf_latest-unit` passes locally (requires
Python 3.13 installed)
- [ ] `tox -e py313-partial-unit-torch` passes locally
- [ ] No regressions on existing Python 3.10/3.11/3.12 matrix jobs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Extended Python support: minimum remains 3.10; added official support
up through 3.13 (upper bound advanced accordingly).
* **Tests**
* CI and test matrix expanded to include experimental Python 3.13
coverage.
* **Documentation**
* Installation docs and changelog updated to reflect Python 3.13
support.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Ivan Basov <ibasov@nvidia.com>
Signed-off-by: Ivan Basov <5455484+ivanbasov@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Mar 17, 2026
cb1ff32
zip
tar.gz

0.43.0rc1

ModelOpt Framework, Recipe Lib, converting subset of existing recipes…

… 1/N (#1000)

### What does this PR do?

1. start a new config system using yaml/yml files.

2. add a new top level package: modelopt_recipes

I want it to be a top level package so we can make it clear that the
modelopt package holds the code, this new package holds recipes

3. implement some of the existing quantization recipes using the new
config system as model agnostic general recipes, but not actually in
use. these recipes sit inside modelopt_recipes/general/ptq/...

4. make sure the configs from the new config system match the exisiting
configs

5. extend the hf_ptq script to enable recipe based PTQ

8. testted hf_ptq using both builtin and extenal config file. example
script:


### Usage

```bash
   python examples/llm_ptq/hf_ptq.py         \
     --model Qwen/Qwen3-8B                   \
     --recipe general/ptq/fp8_default-fp8_kv \
     ...

```

### Testing

```bash
   python examples/llm_ptq/hf_ptq.py         \
     --model Qwen/Qwen3-8B                   \
     --recipe general/ptq/fp8_default-fp8_kv \
     --export_path=fp8_default-fp8_kv        \
     --calib_size=16                         \
     --batch_size=0                          \
     --trust_remote_code                     \
     --export_fmt=hf

```
### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Recipe-driven PTQ workflows via YAML recipes and new recipe loader;
CLI gains a --recipe option and --pyt_ckpt_path renamed to --model.
* Many new PTQ recipe and config presets (FP8, INT4/INT8, NVFP4, MXFPx,
KV-cache variants) and improved runtime config loading/merging.

* **Documentation**
  * Added READMEs describing recipe/config layout.

* **Tests**
* New unit tests covering config loading, inheritance and recipe
loading.

* **Chores**
  * Added YAML/OmegaConf runtime support and packaging of recipe YAMLs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

Mar 17, 2026
00fa5bd
zip
tar.gz
Notes
Downloads

0.43.0rc0

OMNIML-2663] Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes (

#852)

## What does this PR do?

**Type of change:**
New feature

**Overview:** 
- Updated FP8 quant exporter to replace modelopt custom QDQ nodes with
native ONNX QDQ nodes
- Updated get_onnx_bytes_and_metadata to make convert_float_to_float16()
default instead of autocast
- Created util functions to fix graph structure after conversion

## Testing
```
python torch_quant_to_onnx.py --quantize_mode=fp8 \
	--onnx_save_path=<model_path> \
	--calibration_data_size 64 \
	--batch_size 128

python evaluate.py --onnx_path=<model_path> \
	--model_name=vit_base_patch16_224 \
	--results_path=./results.txt \
	--batch_size 128
```

Results:
Before replacement:
```
The top1 accuracy of the model is 85.06%
The top5 accuracy of the model is 97.558%
Inference latency of the model is 5.27963 ms
```
After replacement:
```
The top1 accuracy of the model is 85.054%
The top5 accuracy of the model is 97.542%
Inference latency of the model is 5.74771 ms
```

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: No
- Replaced modelopt QDQ nodes with native ONNX qdq nodes
- **Did you write any new necessary tests?**: No
- **Did you add or update any necessary documentation?**: No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* ONNX utilities to remove redundant Casts, fold Constant→Cast patterns,
and convert targeted Casts to FP16.

* **Improvements**
* FP8 QDQ nodes now converted to native ONNX QDQ/Dequantize nodes for
improved compatibility.
* Export pipeline streamlined: consistent FP16 handling, unified weight
quantization, cast cleanup ordering, and added logging for better
traceability.

* **Tests**
  * Unit tests updated to use the new ONNX utilities.

* **Changelog**
  * Entry added noting FP8 QDQ → native ONNX QDQ conversion.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

Mar 17, 2026
e4df91b
zip
tar.gz
Notes
Downloads

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.45.0dev

0.44.0rc1

0.44.0rc0

0.43.0

0.43.0rc4

0.43.0rc3

0.43.0rc2

0.44.0dev

0.43.0rc1

0.43.0rc0

Tags: NVIDIA/Model-Optimizer