-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quantization] Support TRT as a backend #11032
Comments
@kevalmorabia97 would you maybe like to share some pointers? I think having TensorRT as an official quantization backend for |
I think that it could be a nice option to have. I see that it will require to export the model to onnx format but it shouldn't be a big issue. https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/diffusers/quantization/diffusion_trt.py |
@jingyu-ml can you look into this? |
Supporting Diffusers as an official runtime shouldn’t be a major issue but will require some engineering effort. We can also enable deployment through either torch-onnx-tensorrt or torch-tensorrt, both of which can be integrated. In this example, only torch-onnx-tensorrt is demonstrated. Here are some pointers, quantize.py do the calibration and export the model to onnx. Then user can just use the onnx model to compile the onnx model to tensorrt engine, And you can load the tensorrt engine into your inference pipe just like https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/diffusers/quantization/diffusion_trt.py does. Additionally, our tool supports (or will support) more quantization algorithms, such as SVDQuant, to maximize speedup for users with good quality. |
Good to meet you here! SVDQuant has been on our radar for a long time now but we haven't been able to find a way to integrate it that respects our quantization design. We support both kinds of quantization -- on the fly and loading a pre-quantized checkpoint. For the on the fly, we expect the users to pass a quantization config (example). For the four quantization backends we support (bitsandbytes, torchao, gguf, quanto) -- we support both the options of on-the-fly quantization and loading a pre-quantized checkpoint, except for gguf. From what I understand here is that TRT requires a pre-quantized checkpoint by choice I think? Or can @DN6 would love to hear your thoughts too. |
On-the-fly quantization, if I understand correctly, refers to dynamic quantization, right? We do support dynamic quantization, which does not require any calibration. The Here’s an example of how to pre-quantize the model:
After the model is quantized, it needs to be deployed, rather than runing the model directly in torch because we are just simulating the quantization at torch level, to get meaningful speedup the model has to be deployed, in the mean time, the torch ckpt can be shared between different users and users can restore the quantized ckpt using our API something like TRT-MO is highly extensible and can easily support various quantization algorithms. It includes FP8 per-tensor and per-channel quantization, INT8 SmoothQuant, INT4 AWQ, and even more advanced methods like W4A8, FP4, and SVDQuant, either already supported or coming soon. Don't hesitate to contact us if you have any further questions |
Off the top of my head, it's unclear how to best leverage the options you are telling us that could enable for the smoothest possible integration. I guess on the diffusers side, we are a bit too occupied to start/lead the integration. So, I will keep this open for contributions. |
I for one think this would be a great option, perhaps could utilize the already converted onnx variants from BFL... https://huggingface.co/black-forest-labs/FLUX.1-dev-onnx |
Thank you! Our team will follow up with you soon to discuss the plan in more detail. |
Nice improvements here in https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/diffusers/quantization.
Would it make sense to support this?
Cc: @SunMarc @DN6
The text was updated successfully, but these errors were encountered: