-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782
Conversation
This reverts commit cd90aa7.
d59328e
to
f05fac6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here. Handling non-const values in requantize might be hard because of fixed point multiplication.
|
||
# reduce_range became True in v1.6 | ||
if is_version_greater_than("1.5.1"): | ||
qmax = 127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats happening here? Should this be coupled with dtype - 127 should be for int8 while 255 for uint8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, they intentionally reduce the possible range of quantized values by half, i.e. [qmin, qmax] to [qmin/2, qmax/2]. Since PyTorch only uses uint8, this is fine.
It's not clear to me why they do this, but the following PR has some explanation: "reduce_range option restricts the activation tensor to 7 bits instead of 8.This is necessary to enable per channel quant for RNNs and LSTMs" pytorch/pytorch#39041
Yes this is enough. Dynamic quantization flow replaces fp32 dense with runtime qparam calculation + int8 dense , leaving everything else fp32. The output of |
I see, what you describe is |
bias_var = inputs[1][3] | ||
|
||
dequant_scale = input_scale * weight_scale | ||
dense_out = _op.cast(dense, "float32") * dequant_scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to call qnn.dequantize operation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why better? I did try dequantize but it didn't matter in terms of accuracy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First readability, since we are working with qnn ops, one can easily understand that we are going from int to float domain just by seeing dequantize op.
Second, dequantize becomes one place where we handle per-tensor and per-channel quantization scales and non-zero zero points. Internally, dequantize will lower to the same exact operations that you have currently in the parser. In future, when we support per-channel scales for dense, we can just rely on dequantize op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since I know zp is zero, I thought this should be a bit more efficient (doesn't need to add zp, not sure if qnn dequantize does this optimization)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add that in qnn dequantize. It should be easy. (I just checked, I missed that optimization, it exists in other ops like conv2d and dense)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok replaced with dequantize, thanks for the thought!
Thanks @anijain2305 |
…antized transformer models (apache#6782) * add stub and test * per channel quantize * calculate qparam correctly * import qbert working * support batched qdense * test batched input * fix mkl offloading of batch matmul * reduce range become True in torch 1.6 * fix for 1.6 * Revert "fix mkl offloading of batch matmul" This reverts commit cd90aa7. * fix merge * fix * lint fix * fix black * more black fix * fix version check for 1.5.1 * disable assert on v1.4 (strange pytorch issue) * minor fix * use dequantize Co-authored-by: masa <masa@pop-os.localdomain>
…antized transformer models (apache#6782) * add stub and test * per channel quantize * calculate qparam correctly * import qbert working * support batched qdense * test batched input * fix mkl offloading of batch matmul * reduce range become True in torch 1.6 * fix for 1.6 * Revert "fix mkl offloading of batch matmul" This reverts commit cd90aa7. * fix merge * fix * lint fix * fix black * more black fix * fix version check for 1.5.1 * disable assert on v1.4 (strange pytorch issue) * minor fix * use dequantize Co-authored-by: masa <masa@pop-os.localdomain>
…antized transformer models (apache#6782) * add stub and test * per channel quantize * calculate qparam correctly * import qbert working * support batched qdense * test batched input * fix mkl offloading of batch matmul * reduce range become True in torch 1.6 * fix for 1.6 * Revert "fix mkl offloading of batch matmul" This reverts commit cd90aa7. * fix merge * fix * lint fix * fix black * more black fix * fix version check for 1.5.1 * disable assert on v1.4 (strange pytorch issue) * minor fix * use dequantize Co-authored-by: masa <masa@pop-os.localdomain>
This adds support for what PyTorch calls "dynamic quantization", where weights are quantized ahead of time but activations are quantized on the fly at runtime. See more details in:
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13
https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html
TVM doesn't support such quantization flow at the moment. This flow is in a sweep spot in terms of quantization easy of use and performance, so I think this is worth supporting. Here are pros/cons compared to the static quantization (the one we do support):
Pros:
Cons:
My motivation for introducing this flow is to support quantized models from
transformers
like BERT and GPT2, where dynamic quantization via PyTorch or ONNXRuntime is the only quantization path they support (from what I understand). See the following blog post and the accompanying notebook by the ONNXRuntime team for inspiration.https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb
This PR has changes required for supporting dynamic quantization flow via QNN.
quantized::linear_dynamic
op converter in PyTorch frontend.I prepared a script to evaluate accuracy and performance of BERT quantized via dynamic quantization, and compiled by TVM. The accuracy is reasonable but performance is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I haven't looked into details). I sense a big oppotunity here.
please review @anijain2305 @siju-samuel @t-vi @jwfromm