Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

Merged
merged 19 commits into from
Oct 29, 2020

Conversation

masahi
Copy link
Member

@masahi masahi commented Oct 28, 2020

This adds support for what PyTorch calls "dynamic quantization", where weights are quantized ahead of time but activations are quantized on the fly at runtime. See more details in:
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13
https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html

TVM doesn't support such quantization flow at the moment. This flow is in a sweep spot in terms of quantization easy of use and performance, so I think this is worth supporting. Here are pros/cons compared to the static quantization (the one we do support):

Pros:

  • API is trivial and quantization is automatic. No need to rewrite a model or do calibration (which is required for other quantization workflow in PyTorch).
  • Weight is quantized ahead of time, so model size becomes much smaller. We can also use int8 math.

Cons:

  • Scale and zero point calculation is done at runtime, so there is some overhead compared to the more standard static quantization.

My motivation for introducing this flow is to support quantized models from transformers like BERT and GPT2, where dynamic quantization via PyTorch or ONNXRuntime is the only quantization path they support (from what I understand). See the following blog post and the accompanying notebook by the ONNXRuntime team for inspiration.

https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb

This PR has changes required for supporting dynamic quantization flow via QNN.

  • Support non constant qparams in QNN quantize and dense op
  • Add Torchscript quantized::linear_dynamic op converter in PyTorch frontend.

I prepared a script to evaluate accuracy and performance of BERT quantized via dynamic quantization, and compiled by TVM. The accuracy is reasonable but performance is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I haven't looked into details). I sense a big oppotunity here.

please review @anijain2305 @siju-samuel @t-vi @jwfromm

@masahi masahi changed the title Torch quant linear dynamic [Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models Oct 28, 2020
Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here. Handling non-const values in requantize might be hard because of fixed point multiplication.


# reduce_range became True in v1.6
if is_version_greater_than("1.5.1"):
qmax = 127
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats happening here? Should this be coupled with dtype - 127 should be for int8 while 255 for uint8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes from https://github.com/pytorch/pytorch/blob/d642992877139671466d2a96663abede9e39ad55/aten/src/ATen/native/quantized/cpu/quant_utils.h#L64-L66

Here, they intentionally reduce the possible range of quantized values by half, i.e. [qmin, qmax] to [qmin/2, qmax/2]. Since PyTorch only uses uint8, this is fine.

It's not clear to me why they do this, but the following PR has some explanation: "reduce_range option restricts the activation tensor to 7 bits instead of 8.This is necessary to enable per channel quant for RNNs and LSTMs" pytorch/pytorch#39041

@masahi
Copy link
Member Author

masahi commented Oct 28, 2020

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here

Yes this is enough. Dynamic quantization flow replaces fp32 dense with runtime qparam calculation + int8 dense , leaving everything else fp32.

The output of linear_dynamic op is fp32, so I don't think we need requantize. PyTorch just cast int32 output to fp32, and multiply by input and weight scales, which I followed here. The corresponding implementation is here
https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L232

@anijain2305
Copy link
Contributor

anijain2305 commented Oct 28, 2020

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here

Yes this is enough. Dynamic quantization flow replaces fp32 dense with runtime qparam calculation + int8 dense , leaving everything else fp32.

The output of linear_dynamic op is fp32, so I don't think we need requantize. PyTorch just cast int32 output to fp32, and multiply by input and weight scales, which I followed here. The corresponding implementation is here
https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L232

I see, what you describe is dequantize operation with in_scale = input_scale * weigh_scale and in_zero_point=0. This means that there will be lots of quantize and dequantize ops in the graph. That explains why requantize never showed up.

bias_var = inputs[1][3]

dequant_scale = input_scale * weight_scale
dense_out = _op.cast(dense, "float32") * dequant_scale
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to call qnn.dequantize operation here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why better? I did try dequantize but it didn't matter in terms of accuracy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First readability, since we are working with qnn ops, one can easily understand that we are going from int to float domain just by seeing dequantize op.

Second, dequantize becomes one place where we handle per-tensor and per-channel quantization scales and non-zero zero points. Internally, dequantize will lower to the same exact operations that you have currently in the parser. In future, when we support per-channel scales for dense, we can just rely on dequantize op.

Copy link
Member Author

@masahi masahi Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since I know zp is zero, I thought this should be a bit more efficient (doesn't need to add zp, not sure if qnn dequantize does this optimization)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add that in qnn dequantize. It should be easy. (I just checked, I missed that optimization, it exists in other ops like conv2d and dense)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok replaced with dequantize, thanks for the thought!

@masahi masahi merged commit 0c7aae3 into apache:main Oct 29, 2020
@masahi
Copy link
Member Author

masahi commented Oct 29, 2020

Thanks @anijain2305

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
…antized transformer models (apache#6782)

* add stub and test

* per channel quantize

* calculate qparam correctly

* import qbert working

* support batched qdense

* test batched input

* fix mkl offloading of batch matmul

* reduce range become True in torch 1.6

* fix for 1.6

* Revert "fix mkl offloading of batch matmul"

This reverts commit cd90aa7.

* fix merge

* fix

* lint fix

* fix black

* more black fix

* fix version check for 1.5.1

* disable assert on v1.4 (strange pytorch issue)

* minor fix

* use dequantize

Co-authored-by: masa <masa@pop-os.localdomain>
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
…antized transformer models (apache#6782)

* add stub and test

* per channel quantize

* calculate qparam correctly

* import qbert working

* support batched qdense

* test batched input

* fix mkl offloading of batch matmul

* reduce range become True in torch 1.6

* fix for 1.6

* Revert "fix mkl offloading of batch matmul"

This reverts commit cd90aa7.

* fix merge

* fix

* lint fix

* fix black

* more black fix

* fix version check for 1.5.1

* disable assert on v1.4 (strange pytorch issue)

* minor fix

* use dequantize

Co-authored-by: masa <masa@pop-os.localdomain>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
…antized transformer models (apache#6782)

* add stub and test

* per channel quantize

* calculate qparam correctly

* import qbert working

* support batched qdense

* test batched input

* fix mkl offloading of batch matmul

* reduce range become True in torch 1.6

* fix for 1.6

* Revert "fix mkl offloading of batch matmul"

This reverts commit cd90aa7.

* fix merge

* fix

* lint fix

* fix black

* more black fix

* fix version check for 1.5.1

* disable assert on v1.4 (strange pytorch issue)

* minor fix

* use dequantize

Co-authored-by: masa <masa@pop-os.localdomain>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants