-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
❓ [Question] How do you properly deploy a quantized model with tensorrt #3267
Comments
Did you follow this tutorial? https://pytorch.org/TensorRT/tutorials/_rendered_examples/dynamo/vgg16_ptq.html |
@narendasan
|
@lanluo-nvidia or @peri044 can you provide additional guidance here? |
@Urania880519 in terms of dynamic shape support in torch_tensorrt, |
❓ Question
I have a PTQ model and a QAT model trained with the official pytorch API following the quantization tutorial, and I wish to deploy them on TensorRT for inference. The model is metaformer-like using convolution layers as token mixer. One part of the quantized model looks like this:
What you have already tried
I have tried different ways to make things work:
model_trt = torch2trt( model_fp32, [torch.randn(1, 11, 64, 64).to('cuda')], max_batch_size=batch_size, fp16_mode=False, int8_mode=True, calibrator= trainLoader, input_shapes=[(None, 11, None, None)] )
Here's the code I used:
trt_gm = torch.compile( model, dynamic= True, backend="tensorrt",)
The onnx model is runs weirdly slow with onnx runtime. Furthermore, the loss calculated is extremely high. Here's an example:
I tried to visualize the quantized ONNX model with Netron because converting the quantized ONNX model to TRT engine always raise
This is the problematic part of the graph
The rightmost DequantizeLinear node is causing problem. I checked the x and found that it's an in32 constant array and the x_scale is a float32 constant array. The output of this node turned out to be the bias passed into the Conv layer.
There must be something wrong in the behavior of the conversion. When doing quantization with the pytorch API, only activations and weights were observed by the defined observer, so I was expecting only the leftmost and the middle DequantizeLinear Nodes while bias should be stored in fp32 and directly passed into the Conv layer. Using onnx_simplified is not able to get rid of the node. With the incompatibility between the conversion of quantized torch model to ONNX model, I'm not able to further convert the model into trt engine. I've considered using the onnx API for quantization, but the performance drop thing from unquantized original torch model to ONNX model is quite concerning.
The converting code looks like this:
torch.onnx.export( quantized_model, dummy_input, args.onnx_export_path, input_names=["input"], output_names=["output"], opset_version=13, export_params= True, keep_initializers_as_inputs=False, dynamic_axes= {'input': {0:'batch_size', 2: "h", 3: "w"}, 'output': {0:'batch_size', 2: "h", 3: "w"} } )
Environment
conda
,pip
,libtorch
, source): condaAdditional context
Personally I think the torch.compile() API is the most possible for me to successfully convert the quantized model since there's no performance drop. Does anyone has relevant experience on handling quantized model?
The text was updated successfully, but these errors were encountered: