Quantizing QuartzNet5x5 Reduces Inference Speed #2108

ulucsahin · 2021-04-12T13:21:02Z

ulucsahin
Apr 12, 2021

Describe your question

I am quantizing QuartzNet5x5 model that I use for ASR. I followed the following example: /examples/asr/quantization/speech_to_text_quant_infer.py

My code is as follows:

quant_modules.initialize()
params['model']['encoder']['quantize'] = True

self.first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']))
self.first_asr_model = self.first_asr_model.load_from_checkpoint(self.model_to_load)
self.first_asr_model.cpu()
self.first_asr_model.eval()

Code runs without errors. I am getting lots of line saying:

Input is fake quantized to 8 bits in QuantConv1d with axis None!
Weight is fake quantized to 8 bits in QuantConv1d with axis 0!
Creating max calibrator
Creating max calibrator

Then I get the result from model. Result is correct, however, inference is a slowed to 0.33 seconds average from 0.29 second average.

Am I missing something here?

Environment overview (please complete the following information)

Environment location: Docker
Method of NeMo install: Tested installing from source, also tested pip install

Environment details

If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:

OS version: Ubuntu 20.04 and Windows 10
PyTorch version: 1.7.1 and 1.9
Python version: 3.6 and 3.8

Additional context

Tested on CPU. Tried setting num_threads to 1 and 2.

Answered by skyw

Apr 12, 2021

Hi, Ulucsahin

The speech_to_text_quant_infer.py script is meant for evaluating numerical property and accuracy of the quantized network, not speed/throughput. It is expected to be slower because it runs emulated quantization.

The script can generate an ONNX graph with quantization nodes. To deploy the quantized model for optimal performance, it will need TensorRT 8.0 (Refert this GTC talk for more detail) to compile the model to int8 I/O and compute. speech_to_text_quant_infer_trt.py is the script to run the quantized model by TensorRT and get optimal performance. Note that quantization features are designed for GPU and TensorRT only. To run the quantized model on CPU, it will need a diff…

View full answer

skyw · 2021-04-12T17:18:21Z

skyw
Apr 12, 2021

Hi, Ulucsahin

The speech_to_text_quant_infer.py script is meant for evaluating numerical property and accuracy of the quantized network, not speed/throughput. It is expected to be slower because it runs emulated quantization.

The script can generate an ONNX graph with quantization nodes. To deploy the quantized model for optimal performance, it will need TensorRT 8.0 (Refert this GTC talk for more detail) to compile the model to int8 I/O and compute. speech_to_text_quant_infer_trt.py is the script to run the quantized model by TensorRT and get optimal performance. Note that quantization features are designed for GPU and TensorRT only. To run the quantized model on CPU, it will need a different inference engine that supports the model

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantizing QuartzNet5x5 Reduces Inference Speed #2108

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Quantizing QuartzNet5x5 Reduces Inference Speed #2108

ulucsahin Apr 12, 2021

Replies: 1 comment

skyw Apr 12, 2021

ulucsahin
Apr 12, 2021

skyw
Apr 12, 2021