Tags: NVIDIA/Model-Optimizer
Tags
fix: PTQ 1GPU, export PP divisibility, hidden states conversations key ( #1293) ## Summary - **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters (TP=1, all tasks) - **quantize.sh**: Auto-find largest PP dividing model's `num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't divisible by 8, causing `AssertionError` on 8-GPU nodes - **compute_hidden_states_trtllm.py**: Use `messages` with `conversations` fallback, matching the HF version. Fixes `KeyError: 'conversations'` when data uses OpenAI `messages` format ## Test plan - [x] Qwen3-8B PTQ runs on single L40 GPU - [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs, PP=4 on 4 GPUs, PP=1 on 1 GPU) - [x] EAGLE3 offline pipeline reads data with `messages` field 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Dataset input handling now supports multiple field formats for enhanced compatibility. * **Bug Fixes** * Optimized GPU resource allocation during model quantization with improved pipeline parallelism computation. * Updated quantization configuration for more efficient resource utilization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
[2/3] Implicit Gemm NVFP4 (#1227) ### What does this PR do? Type of change: new feature <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> - Add Conv3D implicit GEMM kernel with BF16 WMMA tensor cores and fused NVFP4 activation quantization for video diffusion VAE layers - Integrate into _QuantConv3d via QuantModuleRegistry — automatically dispatched when NVFP4 quantization is applied to nn.Conv3d - Move kernel from `experimental/conv/ to modelopt/torch/kernels/conv/`; move tests to `tests/gpu/torch/quantization/kernels/` ### Testing <!-- Mention how have you tested your change if applicable. --> - Added test cases to measure the difference between cuDNN and our CUDA implicit GEMM kernel - Added an NVFP4 fake quantization test using CUDA code ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!--- Mandatory --> - Did you write any new necessary tests?: ✅ <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Per-backbone quantization/export in a single run with per-backbone checkpoints and backbone-aware quant filters * Configurable NVFP4 block-size via CLI/config; improved NVFP4 Conv3D inference path and Wan 2.2 quantization support * **Bug Fixes** * Video-model calibration now respects extra params and forces video decoding during calibration * **Documentation** * Added comprehensive Conv3D implicit‑GEMM kernel documentation; removed experimental Conv3D prototype docs/benchmark * **Tests** * New Wan 2.2 quantization/export tests and expanded Conv3D/FP4 kernel test coverage <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Update LICENSE and SPDX-License-Identifier as per OSRB guidance (#1244) Update LICENSE and SPDX-License-Identifier as per OSRB guidance <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Documentation** * Updated contribution guidelines with expanded license compliance instructions and SPDX identifier guidance. * Extended LICENSE file with new "Third-Party Software Notices" section documenting Apache 2.0, MIT, and BSD 3-Clause licensed components. * **Chores** * Updated SPDX license identifiers across multiple files to reflect dual and triple licensing (Apache 2.0 with MIT and/or BSD 3-Clause). <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Update LICENSE and SPDX-License-Identifier as per OSRB guidance (#1244) Update LICENSE and SPDX-License-Identifier as per OSRB guidance <!-- This is an auto-generated comment: release notes by coderabbit.ai --> * **Documentation** * Updated contribution guidelines with expanded license compliance instructions and SPDX identifier guidance. * Extended LICENSE file with new "Third-Party Software Notices" section documenting Apache 2.0, MIT, and BSD 3-Clause licensed components. * **Chores** * Updated SPDX license identifiers across multiple files to reflect dual and triple licensing (Apache 2.0 with MIT and/or BSD 3-Clause). <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
fix: pass include_buffers=True to init_empty_weights for Gemma-4 supp… …ort (#1169) ### What does this PR do? Type of change: Bug fix Pass `include_buffers=True` to `init_empty_weights()` when computing the device map for HuggingFace models. Gemma-4 registers its model parameters as buffers rather than parameters, so without this flag they are not accounted for during device map computation, causing incorrect placement or OOM errors. ### Usage ```python # No API change — fix is internal to get_model() in example_utils.py # Simply run HF PTQ with a Gemma-4 model as usual: python hf_ptq.py --pyt_ckpt_path google/gemma-4-... --quantize ... ``` ### Testing Manually tested with Gemma-4 model loading via `hf_ptq.py`. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information Gemma-4 uses buffers instead of parameters for some model weights, requiring `include_buffers=True` for correct device map estimation with `accelerate`'s `init_empty_weights`. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved model initialization in quantization examples by ensuring buffers are properly included during temporary model construction, resulting in more accurate device mapping inference for model optimization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: James Shen <yueshen@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug fix: 6012573 (#1131) ### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Standardized the configuration key for model precision. * Model loading now defaults to bfloat16 precision instead of float32, aligning configs and runtime behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Add Python 3.13 support (#1048) Fixes #217 ## Summary - Bump `requires-python` from `>=3.10,<3.13` to `>=3.10,<3.14` to formally include Python 3.13 - Add explicit Python 3.10–3.13 PyPI classifiers for better discoverability - Add `py313` to tox CPU unit test and partial-install environment matrices - Add Python 3.10–3.13 to the `multi-py` CI matrix in `unit_tests.yml` ## Background Python 3.13 was previously excluded by the `<3.13` upper bound. Testing in a related repo with `--ignore-requires-python` confirmed that the library installs and runs correctly under Python 3.13. This PR lifts the restriction and wires up CI to verify it going forward. ## Test plan - [ ] CI `multi-py` job passes on `py313-torch210-tf_latest-unit` - [ ] `tox -e py313-torch210-tf_latest-unit` passes locally (requires Python 3.13 installed) - [ ] `tox -e py313-partial-unit-torch` passes locally - [ ] No regressions on existing Python 3.10/3.11/3.12 matrix jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Extended Python support: minimum remains 3.10; added official support up through 3.13 (upper bound advanced accordingly). * **Tests** * CI and test matrix expanded to include experimental Python 3.13 coverage. * **Documentation** * Installation docs and changelog updated to reflect Python 3.13 support. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Ivan Basov <ibasov@nvidia.com> Signed-off-by: Ivan Basov <5455484+ivanbasov@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
ModelOpt Framework, Recipe Lib, converting subset of existing recipes… … 1/N (#1000) ### What does this PR do? 1. start a new config system using yaml/yml files. 2. add a new top level package: modelopt_recipes I want it to be a top level package so we can make it clear that the modelopt package holds the code, this new package holds recipes 3. implement some of the existing quantization recipes using the new config system as model agnostic general recipes, but not actually in use. these recipes sit inside modelopt_recipes/general/ptq/... 4. make sure the configs from the new config system match the exisiting configs 5. extend the hf_ptq script to enable recipe based PTQ 8. testted hf_ptq using both builtin and extenal config file. example script: ### Usage ```bash python examples/llm_ptq/hf_ptq.py \ --model Qwen/Qwen3-8B \ --recipe general/ptq/fp8_default-fp8_kv \ ... ``` ### Testing ```bash python examples/llm_ptq/hf_ptq.py \ --model Qwen/Qwen3-8B \ --recipe general/ptq/fp8_default-fp8_kv \ --export_path=fp8_default-fp8_kv \ --calib_size=16 \ --batch_size=0 \ --trust_remote_code \ --export_fmt=hf ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Recipe-driven PTQ workflows via YAML recipes and new recipe loader; CLI gains a --recipe option and --pyt_ckpt_path renamed to --model. * Many new PTQ recipe and config presets (FP8, INT4/INT8, NVFP4, MXFPx, KV-cache variants) and improved runtime config loading/merging. * **Documentation** * Added READMEs describing recipe/config layout. * **Tests** * New unit tests covering config loading, inheritance and recipe loading. * **Chores** * Added YAML/OmegaConf runtime support and packaging of recipe YAMLs. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
OMNIML-2663] Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes ( #852) ## What does this PR do? **Type of change:** New feature **Overview:** - Updated FP8 quant exporter to replace modelopt custom QDQ nodes with native ONNX QDQ nodes - Updated get_onnx_bytes_and_metadata to make convert_float_to_float16() default instead of autocast - Created util functions to fix graph structure after conversion ## Testing ``` python torch_quant_to_onnx.py --quantize_mode=fp8 \ --onnx_save_path=<model_path> \ --calibration_data_size 64 \ --batch_size 128 python evaluate.py --onnx_path=<model_path> \ --model_name=vit_base_patch16_224 \ --results_path=./results.txt \ --batch_size 128 ``` Results: Before replacement: ``` The top1 accuracy of the model is 85.06% The top5 accuracy of the model is 97.558% Inference latency of the model is 5.27963 ms ``` After replacement: ``` The top1 accuracy of the model is 85.054% The top5 accuracy of the model is 97.542% Inference latency of the model is 5.74771 ms ``` ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: No - Replaced modelopt QDQ nodes with native ONNX qdq nodes - **Did you write any new necessary tests?**: No - **Did you add or update any necessary documentation?**: No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * ONNX utilities to remove redundant Casts, fold Constant→Cast patterns, and convert targeted Casts to FP16. * **Improvements** * FP8 QDQ nodes now converted to native ONNX QDQ/Dequantize nodes for improved compatibility. * Export pipeline streamlined: consistent FP16 handling, unified weight quantization, cast cleanup ordering, and added logging for better traceability. * **Tests** * Unit tests updated to use the new ONNX utilities. * **Changelog** * Entry added noting FP8 QDQ → native ONNX QDQ conversion. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
PreviousNext