Skip to content

Commit 291cfa3

Browse files
authored
Merge branch 'main' into jennifchen/cp_amax_sync
2 parents 6761109 + bc54694 commit 291cfa3

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+3885
-419
lines changed

.github/CODEOWNERS

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ modelopt/torch/distill @NVIDIA/modelopt-torch-distill-codeowners
2222
modelopt/torch/export @NVIDIA/modelopt-torch-export-codeowners
2323
modelopt/torch/nas @NVIDIA/modelopt-torch-nas-prune-codeowners
2424
modelopt/torch/opt @NVIDIA/modelopt-torch-opt-codeowners
25+
modelopt/torch/peft @NVIDIA/modelopt-torch-peft-codeowners
2526
modelopt/torch/prune @NVIDIA/modelopt-torch-nas-prune-codeowners
2627
modelopt/torch/quantization @NVIDIA/modelopt-torch-quantization-codeowners
2728
modelopt/torch/sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
@@ -50,4 +51,5 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
5051
/examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
5152
/examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
5253
/examples/vlm_ptq @NVIDIA/modelopt-examples-vlm-codeowners
54+
/examples/vllm_serve @NVIDIA/modelopt-examples-llm_ptq-codeowners
5355
/examples/windows @NVIDIA/modelopt-windows-codeowners

.gitlab/tests.yml

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,20 +54,12 @@ example-torch:
5454
timeout: 30m
5555
parallel:
5656
matrix:
57-
- EXAMPLE: [llm_distill, llm_sparsity, speculative_decoding]
57+
- EXAMPLE: [llm_distill, llm_qat, llm_sparsity, speculative_decoding]
5858
script:
5959
- pip install ".[hf,dev-test]"
6060
- find examples/$EXAMPLE -name "requirements.txt" | while read req_file; do pip install -r "$req_file" || exit 1; done
6161
- pytest -s tests/examples/$EXAMPLE
6262

63-
# TODO: Fix llm_qat test hang in GitLab CI
64-
example-failing:
65-
extends: example-torch
66-
allow_failure: true
67-
parallel:
68-
matrix:
69-
- EXAMPLE: [llm_qat]
70-
7163
example-trtllm:
7264
extends: example-torch
7365
timeout: 60m

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ Model Optimizer Changelog (Linux)
99
**New Features**
1010

1111
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
13+
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
1214

1315
0.37 (2025-09-xx)
1416
^^^^^^^^^^^^^^^^^

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
2626

2727
## Latest News
2828

29+
- [2025/10/07] [Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
2930
- [2025/09/17] [An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
3031
- [2025/09/11] [How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
3132
- [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)

docs/source/deployment/1_tensorrt_llm.rst

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,15 @@
22
TensorRT-LLM
33
==========================
44

5+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6+
57
.. note::
68

7-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
810
first before going through this section.
911

1012

13+
1114
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
1215

1316
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144147
Convert to TensorRT-LLM
145148
=======================
146149

147-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.

docs/source/guides/7_nas.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -635,3 +635,12 @@ The difference between NAS and pruning is summarized below.
635635
increased training time.
636636
- May provide similar performance to NAS in particular applications, however, usually exhibits
637637
worse performance due to the limited search space and training time.
638+
639+
640+
[Advanced] Adding a new NAS/Prune Algorithm
641+
===========================================
642+
643+
* Please refer to this `template <https://github.com/NVIDIA/TensorRT-Model-Optimizer/compare/template/new-nas-mode>`_
644+
for adding a new NAS algorithm.
645+
* Please refer to `mcore_minitron.py <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/modelopt/torch/prune/plugins/mcore_minitron.py>`_
646+
for an actual example of adding Minitron Pruning algorithm.

examples/llm_distill/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
1616
| Distillation with NeMo | Learn how to distill your models with NeMo Framework | \[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1717
| Distillation with Huggingface | Learn how to distill your models with Hugging Face | \[[Link](#knowledge-distillation-kd-for-huggingface-models)\] | \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\] |
1818
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
19+
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo | \[[Link](../nemo_run/prune_distill/README.md)\] | |
1920

2021
</div>
2122

examples/llm_distill/main.py

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,11 @@
1313
# See the License for the specific language governing permissions and
1414
# limitations under the License.
1515

16-
import logging
1716
import os
1817
from dataclasses import dataclass
1918

19+
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
20+
2021
import datasets
2122
import torch
2223
import torch.distributed
@@ -29,10 +30,7 @@
2930
import modelopt.torch.opt as mto
3031
from modelopt.torch.distill.plugins.huggingface import KDTrainer, LMLogitsLoss
3132

32-
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
33-
34-
logger = get_logger(__name__)
35-
logging.basicConfig(level=logging.INFO)
33+
logger = get_logger(__name__, log_level="INFO")
3634

3735

3836
@dataclass
@@ -69,6 +67,29 @@ class KDSFTTrainer(SFTTrainer, KDTrainer):
6967
pass
7068

7169

70+
def _save_model_fsdp_compat(
71+
self,
72+
output_dir: str | None = None,
73+
_internal_call: bool = False,
74+
*args,
75+
**kwargs,
76+
):
77+
output_dir = output_dir or self.args.output_dir
78+
model = self.accelerator.unwrap_model(self.model)
79+
if not _internal_call and self.is_fsdp_enabled:
80+
state_dict = self.accelerator.get_state_dict(self.model)
81+
if self.accelerator.is_main_process:
82+
model.save_pretrained(
83+
output_dir,
84+
is_main_process=self.accelerator.is_main_process,
85+
save_function=self.accelerator.save,
86+
state_dict=state_dict,
87+
)
88+
self.processing_class.save_pretrained(output_dir)
89+
else:
90+
super(SFTTrainer, self).save_model(output_dir, _internal_call, *args, **kwargs)
91+
92+
7293
def train():
7394
parser = transformers.HfArgumentParser((ModelArguments, TrainingArguments))
7495
model_args, training_args = parser.parse_args_into_dataclasses()
@@ -77,6 +98,9 @@ def train():
7798
# modelopt state will be saved automatically to "modelopt_state.pth"
7899
mto.enable_huggingface_checkpointing()
79100

101+
# HACK: Fix FSDP2-incompatible save_model() function for SFTTrainer
102+
SFTTrainer.save_model = _save_model_fsdp_compat
103+
80104
# Set total batch size across all ranks to equal 64
81105
total_batch_size = 64
82106
num_accum_steps = total_batch_size / (
@@ -91,19 +115,22 @@ def train():
91115
f"Using {int(num_accum_steps)} grad accumulation steps for effective batchsize of {total_batch_size}."
92116
)
93117

118+
# Dataset
94119
logger.info("Loading dataset...")
95120
dset = datasets.load_dataset("Open-Orca/OpenOrca", split="train")
96121
dset_splits = dset.train_test_split(train_size=25600, test_size=1700, seed=420)
97122
dset_train, dset_eval = dset_splits["train"], dset_splits["test"]
98123
logger.info("Dataset loaded.")
99124

125+
# Tokenizer
100126
logger.info("Loading tokenizer...")
101127
model_path = model_args.teacher_name_or_path or model_args.student_name_or_path
102128
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
103129
tokenizer.pad_token = tokenizer.eos_token
104130
tokenizer.padding_side = "right"
105131
logger.info("Tokenizer loaded.")
106132

133+
# Model
107134
if model_args.single_model:
108135
logger.info("Loading single model only...")
109136
model = transformers.AutoModelForCausalLM.from_pretrained(
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
pyarrow
2+
transformers<5.0
23
trl>=0.23.0

examples/llm_eval/gen_model_answer.py

Lines changed: 0 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -180,14 +180,6 @@ def get_model_answers(
180180
# Model Optimizer modification
181181
tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
182182
if checkpoint_dir:
183-
# get model type
184-
last_part = os.path.basename(checkpoint_dir)
185-
model_type = last_part.split("_")[0]
186-
# Some models require to set pad_token and eos_token based on external config (e.g., qwen)
187-
if model_type == "qwen":
188-
tokenizer.pad_token = tokenizer.convert_ids_to_tokens(151643)
189-
tokenizer.eos_token = tokenizer.convert_ids_to_tokens(151643)
190-
191183
assert LLM is not None, "tensorrt_llm APIs could not be imported."
192184
model = LLM(checkpoint_dir, tokenizer=tokenizer)
193185
elif not nim_model:

0 commit comments

Comments
 (0)