On-the-fly quantization, if I understand correctly, refers to dynamic quantization, right?

We do support dynamic quantization, which does not require any calibration. The quantize.py script is just an example—you can modify it as needed to suit your requirements.

Here’s an example of how to pre-quantize the model:

import modelopt.torch.quantization as mtq

# Setup the model
pipe = ...

# The quantization algorithm requires calibration data. Below we show a rough example of how to
# set up a calibration data loader with the desired calib_size
data_loader = get_dataloader(num_samples=calib_size)


# Define the forward_loop function with the model as input. The data loader should be wrapped
# inside the function.
def forward_loop(model):
    pipe.transformer = model
    for batch in data_loader:
        model(batch)


# Quantize the model and perform calibration (PTQ)
pipe.transformer = mtq.quantize(pipe.transformer, mtq.INT8_SMOOTHQUANT_CFG, forward_loop)

After the model is quantized, it needs to be deployed, rather than runing the model directly in torch because we are just simulating the quantization at torch level, to get meaningful speedup the model has to be deployed, in the mean time, the torch ckpt can be shared between different users and users can restore the quantized ckpt using our API something like mto.restore(pipe.transformer, ckpt_path). In this case, we can use TensorRT via ONNX or PyTorch-TRT directly (though I haven’t tested Torch-TensorRT yet, but I plan to in a few days seems like torch.compile supports trt as backend). In addition to TensorRT, we also support SGLang, vLLM, and other similar frameworks for LLMs.

TRT-MO is highly extensible and can easily support various quantization algorithms. It includes FP8 per-tensor and per-channel quantization, INT8 SmoothQuant, INT4 AWQ, and even more advanced methods like W4A8, FP4, and SVDQuant, either already supported or coming soon.

Don't hesitate to contact us if you have any further questions

[Quantization] Support TRT as a backend #11032

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions