Skip to content

Quantizing QuartzNet5x5 Reduces Inference Speed #2108

Answered by skyw
ulucsahin asked this question in Q&A
Discussion options

You must be logged in to vote

Hi, Ulucsahin

The speech_to_text_quant_infer.py script is meant for evaluating numerical property and accuracy of the quantized network, not speed/throughput. It is expected to be slower because it runs emulated quantization.

The script can generate an ONNX graph with quantization nodes. To deploy the quantized model for optimal performance, it will need TensorRT 8.0 (Refert this GTC talk for more detail) to compile the model to int8 I/O and compute. speech_to_text_quant_infer_trt.py is the script to run the quantized model by TensorRT and get optimal performance. Note that quantization features are designed for GPU and TensorRT only. To run the quantized model on CPU, it will need a diff…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by okuchaiev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #2051 on April 23, 2021 23:50.