-
Describe your question I am quantizing QuartzNet5x5 model that I use for ASR. I followed the following example: My code is as follows:
Code runs without errors. I am getting lots of line saying:
Then I get the result from model. Result is correct, however, inference is a slowed to 0.33 seconds average from 0.29 second average. Am I missing something here? Environment overview (please complete the following information)
Environment details If NVIDIA docker image is used you don't need to specify these.
Additional context Tested on CPU. Tried setting num_threads to 1 and 2. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hi, Ulucsahin The The script can generate an ONNX graph with quantization nodes. To deploy the quantized model for optimal performance, it will need TensorRT 8.0 (Refert this GTC talk for more detail) to compile the model to int8 I/O and compute. |
Beta Was this translation helpful? Give feedback.
Hi, Ulucsahin
The
speech_to_text_quant_infer.py
script is meant for evaluating numerical property and accuracy of the quantized network, not speed/throughput. It is expected to be slower because it runs emulated quantization.The script can generate an ONNX graph with quantization nodes. To deploy the quantized model for optimal performance, it will need TensorRT 8.0 (Refert this GTC talk for more detail) to compile the model to int8 I/O and compute.
speech_to_text_quant_infer_trt.py
is the script to run the quantized model by TensorRT and get optimal performance. Note that quantization features are designed for GPU and TensorRT only. To run the quantized model on CPU, it will need a diff…