Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.
The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them.
Refer to the Intel article on lower numerical precision inference and training in deep learning.
Quantization methods include the following three types:
Types | Quantization | Dataset Requirements | Framework | Backend |
---|---|---|---|---|
Post-Training Static Quantization (PTQ) | weights and activations | calibration | PyTorch | PyTorch Eager/PyTorch FX/IPEX |
TensorFlow | TensorFlow/Intel TensorFlow | |||
ONNX Runtime | QLinearops/QDQ | |||
Post-Training Dynamic Quantization | weights | none | PyTorch | PyTorch eager mode/PyTorch fx mode/IPEX |
ONNX Runtime | QIntegerops | |||
Quantization-aware Training (QAT) | weights and activations | fine-tuning | PyTorch | PyTorch eager mode/PyTorch fx mode/IPEX |
TensorFlow | TensorFlow/Intel TensorFlow |
Post-Training Static Quantization performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.
Post-Training Dynamic Quantization simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.
Quantization-aware Training (QAT) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.
For Quantization related examples, please refer to Quantization examples