Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Quantization] Support TRT as a backend #11032

Open
sayakpaul opened this issue Mar 11, 2025 · 9 comments
Open

[Quantization] Support TRT as a backend #11032

sayakpaul opened this issue Mar 11, 2025 · 9 comments

Comments

@sayakpaul
Copy link
Member

Nice improvements here in https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/diffusers/quantization.

Would it make sense to support this?

Cc: @SunMarc @DN6

@sayakpaul
Copy link
Member Author

@kevalmorabia97 would you maybe like to share some pointers? I think having TensorRT as an official quantization backend for diffusers would be really nice.

@SunMarc
Copy link
Member

SunMarc commented Mar 13, 2025

I think that it could be a nice option to have. I see that it will require to export the model to onnx format but it shouldn't be a big issue. https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/diffusers/quantization/diffusion_trt.py

@kevalmorabia97
Copy link

@jingyu-ml can you look into this?

@jingyu-ml
Copy link

jingyu-ml commented Mar 13, 2025

Supporting Diffusers as an official runtime shouldn’t be a major issue but will require some engineering effort. We can also enable deployment through either torch-onnx-tensorrt or torch-tensorrt, both of which can be integrated. In this example, only torch-onnx-tensorrt is demonstrated.

Here are some pointers, quantize.py do the calibration and export the model to onnx. Then user can just use the onnx model to compile the onnx model to tensorrt engine, And you can load the tensorrt engine into your inference pipe just like https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/diffusers/quantization/diffusion_trt.py does.

Additionally, our tool supports (or will support) more quantization algorithms, such as SVDQuant, to maximize speedup for users with good quality.

@sayakpaul
Copy link
Member Author

Good to meet you here!

SVDQuant has been on our radar for a long time now but we haven't been able to find a way to integrate it that respects our quantization design.

We support both kinds of quantization -- on the fly and loading a pre-quantized checkpoint. For the on the fly, we expect the users to pass a quantization config (example).

For the four quantization backends we support (bitsandbytes, torchao, gguf, quanto) -- we support both the options of on-the-fly quantization and loading a pre-quantized checkpoint, except for gguf.

From what I understand here is that TRT requires a pre-quantized checkpoint by choice I think? Or can quantize.py be repurposed to perform quants (accordingly a user-supplied quant config) on the fly?

@DN6 would love to hear your thoughts too.

@jingyu-ml
Copy link

jingyu-ml commented Mar 14, 2025

On-the-fly quantization, if I understand correctly, refers to dynamic quantization, right?

We do support dynamic quantization, which does not require any calibration. The quantize.py script is just an example—you can modify it as needed to suit your requirements.

Here’s an example of how to pre-quantize the model:

import modelopt.torch.quantization as mtq

# Setup the model
pipe = ...

# The quantization algorithm requires calibration data. Below we show a rough example of how to
# set up a calibration data loader with the desired calib_size
data_loader = get_dataloader(num_samples=calib_size)


# Define the forward_loop function with the model as input. The data loader should be wrapped
# inside the function.
def forward_loop(model):
    pipe.transformer = model
    for batch in data_loader:
        model(batch)


# Quantize the model and perform calibration (PTQ)
pipe.transformer = mtq.quantize(pipe.transformer, mtq.INT8_SMOOTHQUANT_CFG, forward_loop)

After the model is quantized, it needs to be deployed, rather than runing the model directly in torch because we are just simulating the quantization at torch level, to get meaningful speedup the model has to be deployed, in the mean time, the torch ckpt can be shared between different users and users can restore the quantized ckpt using our API something like mto.restore(pipe.transformer, ckpt_path). In this case, we can use TensorRT via ONNX or PyTorch-TRT directly (though I haven’t tested Torch-TensorRT yet, but I plan to in a few days seems like torch.compile supports trt as backend). In addition to TensorRT, we also support SGLang, vLLM, and other similar frameworks for LLMs.

TRT-MO is highly extensible and can easily support various quantization algorithms. It includes FP8 per-tensor and per-channel quantization, INT8 SmoothQuant, INT4 AWQ, and even more advanced methods like W4A8, FP4, and SVDQuant, either already supported or coming soon.

Don't hesitate to contact us if you have any further questions

@sayakpaul
Copy link
Member Author

Off the top of my head, it's unclear how to best leverage the options you are telling us that could enable for the smoothest possible integration.

I guess on the diffusers side, we are a bit too occupied to start/lead the integration. So, I will keep this open for contributions.

@SyntaxDiffusion
Copy link

I for one think this would be a great option, perhaps could utilize the already converted onnx variants from BFL... https://huggingface.co/black-forest-labs/FLUX.1-dev-onnx

@jingyu-ml
Copy link

Thank you! Our team will follow up with you soon to discuss the plan in more detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants