You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,8 @@ Model Optimizer Changelog (Linux)
9
9
**New Features**
10
10
11
11
- Add flag ``op_types_to_exclude_fp16`` in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating ``'fp32'`` precision in ``trt_plugins_precision``.
12
+
- Add LoRA mode support for MCore in a new peft submodule: ``modelopt.torch.peft.update_model(model, LORA_CFG)``.
13
+
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,6 +26,7 @@ Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-
26
26
27
27
## Latest News
28
28
29
+
-[2025/10/07][Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
29
30
-[2025/09/17][An Introduction to Speculative Decoding for Reducing Latency in AI Inference](https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/)
30
31
-[2025/09/11][How Quantization Aware Training Enables Low-Precision Accuracy Recovery](https://developer.nvidia.com/blog/how-quantization-aware-training-enables-low-precision-accuracy-recovery/)
31
32
-[2025/08/29][Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
Copy file name to clipboardExpand all lines: docs/source/deployment/1_tensorrt_llm.rst
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,15 @@
2
2
TensorRT-LLM
3
3
==========================
4
4
5
+
**Deprecation Notice**: The export_tensorrt_llm_checkpoint API will be deprecated in future releases. Users are encouraged to transition to the :doc:`unified HF export API <3_unified_hf>`, which provides enhanced functionality and flexibility for exporting models to multiple inference frameworks including TensorRT-LLM, vLLM, and SGLang.
6
+
5
7
.. note::
6
8
7
-
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md>`_
9
+
Please read the `TensorRT-LLM checkpoint workflow <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/checkpoint.md>`_
8
10
first before going through this section.
9
11
10
12
13
+
11
14
ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.
12
15
13
16
This conversion is achieved by:
@@ -144,4 +147,4 @@ If the :meth:`export_tensorrt_llm_checkpoint <modelopt.torch.export.model_config
144
147
Convert to TensorRT-LLM
145
148
=======================
146
149
147
-
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
150
+
Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/architecture/workflow.md#build-apis>`_ to build and deploy the quantized LLM.
Copy file name to clipboardExpand all lines: examples/llm_distill/README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,7 @@ This section focuses on demonstrating how to apply Model Optimizer to perform kn
16
16
| Distillation with NeMo | Learn how to distill your models with NeMo Framework |\[[Link](#knowledge-distillation-kd-for-nvidia-nemo-models)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\]|
17
17
| Distillation with Huggingface | Learn how to distill your models with Hugging Face |\[[Link](#knowledge-distillation-kd-for-huggingface-models)\]|\[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/4_distillation.html)\]|
18
18
| Resources | Extra links to relevant resources |\[[Link](#resources)\]||
19
+
| NeMo Prune + Distill Simplified Flow | Example script demonstrating end-to-end pruning plus distillation in NeMo |\[[Link](../nemo_run/prune_distill/README.md)\]||
0 commit comments