You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To increase quantization support in TVM, it is necessary to support the pre-quantized models, i.e., the models that have been quantized in the framework itself (outside of Relay). In this issue, we are laying down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.
Covered frameworks for now - TFLite and MxNet Target network for now - Inception V3 from TFLite. (I will create one for Mxnet) Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)
List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize
It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)
Op quantize
defquantize(data, scale, zero_point, out_dtype):
""" Quantize takes the scale and zero_point attributes and quantizes the FP32 input data to int8/uint8 tensor. Parameters ----------- data: FP32 tensor The input tensor in FP32. scale: FP32 scalar (An attribute of the op) The float scalar to scale the int8 values back to FP32. zero_point: Int32 zero point (An attribute of the op) The zero point of the distribution. out_dtype: String The dtype of the output. Can only be int8/uint8 Returns ------- quantized_data: int8/uint8 tensor The quantized tensor. """
Key points to discuss
The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Reference implementation in TFLite. This can also be thought as a framework parser utils where we can handle min/max, symmetric/asymmetric etc and generate the scale and zero_point as frameworks handles them.
Op quantized_conv2d
defquantized_conv2d(quantized_data, quantized_kernel,
input_scale, input_zero_point,
kernel_scale, kernel_zero_point,
output_scale, output_zero_point,
out_dtype,
# All the old remaining ones from conv2dstrides=(1, 1),
padding=(0, 0),
dilation=(1, 1),
groups=1,
channels=None,
kernel_size=None,
data_layout="NCHW",
kernel_layout="OIHW",
out_layout=""):
""" Quantize takes the scale and zero_point attributes and quantizes the FP32 input data to int8/uint8 tensor. The scale and zero_point calculations happen outside the relay graph, i.e., the framework parsers will have to compute the scale and offset if only min and max are provided. Parameters ----------- quantized_data: int8/uint8 tensor The quantized input tensor in int8/uint8. quantized_kernel: FP32 tensor The quantized kernel tensor in int8/uint8. input_scale: FP32 scalar (An attribute of the op) The float scalar to scale the quantized_data int8 values back to FP32. input_zero_point: Int32 zero point (An attribute of the op) The zero point of the quantized_data distribution. kernel_scale: FP32 scalar (An attribute of the op) The float scalar to scale the quantized_kernel int8 values back to FP32. kernel_zero_point: Int32 zero point (An attribute of the op) The zero point of the quantized_kernel distribution. output_scale: FP32 scalar (An attribute of the op) The output scale is set during the quantization process using training/calibration. The float scalar to scale the quantized_output int8 values back to FP32. output_zero_point: Int32 zero point (An attribute of the op) The output zero point is set during the quantization process using training/calibration. The zero point of the quantized_output distribution. out_dtype: String The dtype of the quantized_output. Can only be int8/uint8. The requantization from int32 to int8/uint8 is a part of the op compute. out_dtype: String The dtype of the output. Can only be int8/uint8 ..... Other attributes are same as before. Returns ------- quantized_output: int8/uint8 tensor The quantized tensor. """
Key points to discuss further
This op has a set of computations that can be pre-computed ideally but difficult to do because fold-constant only works on Relay ops and not within a Relay op. This has been discussed in more detail in discuss forum.
First pre-computable - The core computation has some compute with kernel (Term 2 and Term 4 in the above link) that will be the part of tvm compute. This is very hard to avoid. We need a fused compute to get the best performance.
Second pre-computable - The output scale and zero_point are used to calculate int multiplier and shifts to keep all the computations in Int domain. This computation changes for each op (e.g. concat will handle this in a different manner compared to conv). So, this computation is also kept inside quantized_conv2d op. This can be avoided by changing the API and replacing output_scale with output_multiplier and output_shift. But, this seems very specific to TFLite and one might want to handle the output_scale and output_offset in a different manner. I am not sure about this part, so please comment.
The op already has the requantization portion accounted for. As far as I understand, the requantization portion is just a clamp for out_dtype. (The handling of output_multiplier and output_shift, as mentioned above, is for the calculation of output quantized tensor and not for requantization).
Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.
defdequantize(quantized_data, scale, zero_point, out_dtype):
""" Dequantize takes the scale and zero_point attributes and dequantizes the int8/uint8 tensor to FP32 tensor. Parameters ----------- quantized_data: int8/uint8 quantized input tensor The input tensor in int8/uint8. scale: FP32 scalar (An attribute of the op) The float scalar to scale the int8 values back to FP32. zero_point: Int32 zero point (An attribute of the op) The zero point of the distribution. out_dtype: String The dtype of the output. Can only be float32. Returns ------- data: FP32 tensor The dequantized tensor. """
The text was updated successfully, but these errors were encountered:
To increase quantization support in TVM, it is necessary to support the pre-quantized models, i.e., the models that have been quantized in the framework itself (outside of Relay). In this issue, we are laying down the high-level API design for some of the quantized operators. A large portion of this is coming from the following relevant discussions. Thanks to @jackwish, @FrozenGene and @jnorwood for sharing their experiences with quantization, and also @shoubhik for helping design this RFC.
Other non-TVM related links that were used to understand quantization
Covered frameworks for now - TFLite and MxNet
Target network for now - Inception V3 from TFLite. (I will create one for Mxnet)
Target platforms for now - ARM and Intel (will create separate Issue as the project progresses)
List of required operators - quantize, quantized_conv2d, qunatized_relu, quantized_pool2d, quantized_fully_connected, quantized_concat, dequantize
It will be good if we can agree on Relay ops - its inputs/outputs and the attributes. The initial proposal for the quantize, quantized_conv2d and dequantize ops is as follows (other quantized_* operators will be on the same lines as that of quantized_conv2d)
Op quantize
Key points to discuss
Op quantized_conv2d
Key points to discuss further
Op dequantize
Dequantization is required while connecting a quantized operator and an FP32 operator. This might be a temporary stage where we do not have a quantized implementation of the second op. Dequantization might also be required at the end of the network to keep the output of the graph in FP32.
The text was updated successfully, but these errors were encountered: