Quantization

Quantization is a widely-used model compression technique that can reduce model size while also improving inference and training latency.
The full precision data converts to low-precision, there is little degradation in model accuracy, but the inference performance of quantized model can gain higher performance by saving the memory bandwidth and accelerating computations with low precision instructions. Intel provided several lower precision instructions (ex: 8-bit or 16-bit multipliers), both training and inference can get benefits from them. Refer to the Intel article on lower numerical precision inference and training in deep learning.

Quantization Support Matrix

Quantization methods include the following three types:

Types	Quantization	Dataset Requirements	Framework	Backend
Post-Training Static Quantization (PTQ)	weights and activations	calibration	PyTorch	PyTorch Eager/PyTorch FX/IPEX
			TensorFlow	TensorFlow/Intel TensorFlow
			ONNX Runtime	QLinearops/QDQ
Post-Training Dynamic Quantization	weights	none	PyTorch	PyTorch eager mode/PyTorch fx mode/IPEX
Post-Training Dynamic Quantization	weights	none	ONNX Runtime	QIntegerops
Quantization-aware Training (QAT)	weights and activations	fine-tuning	PyTorch	PyTorch eager mode/PyTorch fx mode/IPEX
Quantization-aware Training (QAT)	weights and activations	fine-tuning	TensorFlow	TensorFlow/Intel TensorFlow

Post-Training Static Quantization performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.

Post-Training Dynamic Quantization simply multiplies input values by a scaling factor, then rounds the result to the nearest, it determines the scale factor for activations dynamically based on the data range observed at runtime. Weights are quantized ahead of time but the activations are dynamically quantized during inference.

Quantization-aware Training (QAT) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.

Examples of Quantization

For Quantization related examples, please refer to Quantization examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization.md

Quantization.md

Quantization

Quantization Support Matrix

Post-Training Static Quantization performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.

Quantization-aware Training (QAT) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.

Examples of Quantization

Files

Quantization.md

Latest commit

History

Quantization.md

File metadata and controls

Quantization

Quantization Support Matrix

Post-Training Static Quantization performs quantization on already trained models, it requires an additional pass over the dataset to work, only activations do calibration.

Quantization-aware Training (QAT) quantizes models during training and typically provides higher accuracy comparing with post-training quantization, but QAT may require additional hyper-parameter tuning and it may take more time to deployment.

Examples of Quantization