[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

masahi · 2020-10-28T03:54:44Z

This adds support for what PyTorch calls "dynamic quantization", where weights are quantized ahead of time but activations are quantized on the fly at runtime. See more details in:
https://pytorch.org/blog/introduction-to-quantization-on-pytorch/#the-three-modes-of-quantization-supported-in-pytorch-starting-version-13
https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html
https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html

TVM doesn't support such quantization flow at the moment. This flow is in a sweep spot in terms of quantization easy of use and performance, so I think this is worth supporting. Here are pros/cons compared to the static quantization (the one we do support):

Pros:

API is trivial and quantization is automatic. No need to rewrite a model or do calibration (which is required for other quantization workflow in PyTorch).
Weight is quantized ahead of time, so model size becomes much smaller. We can also use int8 math.

Cons:

Scale and zero point calculation is done at runtime, so there is some overhead compared to the more standard static quantization.

My motivation for introducing this flow is to support quantized models from transformers like BERT and GPT2, where dynamic quantization via PyTorch or ONNXRuntime is the only quantization path they support (from what I understand). See the following blog post and the accompanying notebook by the ONNXRuntime team for inspiration.

https://medium.com/microsoftazure/faster-and-smaller-quantized-nlp-with-hugging-face-and-onnx-runtime-ec5525473bb7
https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb

This PR has changes required for supporting dynamic quantization flow via QNN.

Support non constant qparams in QNN quantize and dense op
Add Torchscript quantized::linear_dynamic op converter in PyTorch frontend.

I prepared a script to evaluate accuracy and performance of BERT quantized via dynamic quantization, and compiled by TVM. The accuracy is reasonable but performance is terrible. Even with MKL enabled, TVM int8 is 3-4x slower than PyTorch (I haven't looked into details). I sense a big oppotunity here.

please review @anijain2305 @siju-samuel @t-vi @jwfromm

This reverts commit cd90aa7.

anijain2305

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here. Handling non-const values in requantize might be hard because of fixed point multiplication.

anijain2305 · 2020-10-28T19:59:57Z

python/tvm/relay/frontend/qnn_torch.py

+
+        # reduce_range became True in v1.6
+        if is_version_greater_than("1.5.1"):
+            qmax = 127


Whats happening here? Should this be coupled with dtype - 127 should be for int8 while 255 for uint8

This comes from https://github.com/pytorch/pytorch/blob/d642992877139671466d2a96663abede9e39ad55/aten/src/ATen/native/quantized/cpu/quant_utils.h#L64-L66

Here, they intentionally reduce the possible range of quantized values by half, i.e. [qmin, qmax] to [qmin/2, qmax/2]. Since PyTorch only uses uint8, this is fine.

It's not clear to me why they do this, but the following PR has some explanation: "reduce_range option restricts the activation tensor to 7 bits instead of 8.This is necessary to enable per channel quant for RNNs and LSTMs" pytorch/pytorch#39041

masahi · 2020-10-28T21:31:08Z

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here

Yes this is enough. Dynamic quantization flow replaces fp32 dense with runtime qparam calculation + int8 dense , leaving everything else fp32.

The output of linear_dynamic op is fp32, so I don't think we need requantize. PyTorch just cast int32 output to fp32, and multiply by input and weight scales, which I followed here. The corresponding implementation is here
https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L232

anijain2305 · 2020-10-28T21:33:02Z

Overall looks good. Is this enough to run qBERT? I am surprised that we dont need to work on requantize here

Yes this is enough. Dynamic quantization flow replaces fp32 dense with runtime qparam calculation + int8 dense , leaving everything else fp32.

The output of linear_dynamic op is fp32, so I don't think we need requantize. PyTorch just cast int32 output to fp32, and multiply by input and weight scales, which I followed here. The corresponding implementation is here
https://github.com/pytorch/FBGEMM/blob/master/include/fbgemm/OutputProcessing-inl.h#L232

I see, what you describe is dequantize operation with in_scale = input_scale * weigh_scale and in_zero_point=0. This means that there will be lots of quantize and dequantize ops in the graph. That explains why requantize never showed up.

anijain2305 · 2020-10-28T21:34:23Z

python/tvm/relay/frontend/qnn_torch.py

+        bias_var = inputs[1][3]
+
+        dequant_scale = input_scale * weight_scale
+        dense_out = _op.cast(dense, "float32") * dequant_scale


It might be better to call qnn.dequantize operation here.

why better? I did try dequantize but it didn't matter in terms of accuracy

First readability, since we are working with qnn ops, one can easily understand that we are going from int to float domain just by seeing dequantize op.

Second, dequantize becomes one place where we handle per-tensor and per-channel quantization scales and non-zero zero points. Internally, dequantize will lower to the same exact operations that you have currently in the parser. In future, when we support per-channel scales for dense, we can just rely on dequantize op.

since I know zp is zero, I thought this should be a bit more efficient (doesn't need to add zp, not sure if qnn dequantize does this optimization)

We should add that in qnn dequantize. It should be easy. (I just checked, I missed that optimization, it exists in other ops like conv2d and dense)

ok replaced with dequantize, thanks for the thought!

masahi · 2020-10-29T04:53:07Z

Thanks @anijain2305

…antized transformer models (apache#6782) * add stub and test * per channel quantize * calculate qparam correctly * import qbert working * support batched qdense * test batched input * fix mkl offloading of batch matmul * reduce range become True in torch 1.6 * fix for 1.6 * Revert "fix mkl offloading of batch matmul" This reverts commit cd90aa7. * fix merge * fix * lint fix * fix black * more black fix * fix version check for 1.5.1 * disable assert on v1.4 (strange pytorch issue) * minor fix * use dequantize Co-authored-by: masa <masa@pop-os.localdomain>

masahi changed the title ~~Torch quant linear dynamic~~ [Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models Oct 28, 2020

masa and others added 17 commits October 28, 2020 22:27

add stub and test

2b753f8

per channel quantize

e13037b

calculate qparam correctly

e3bd310

import qbert working

9b27ea5

support batched qdense

bada149

test batched input

c47e5f6

fix mkl offloading of batch matmul

2eda1d0

reduce range become True in torch 1.6

789507d

fix for 1.6

e92e02a

Revert "fix mkl offloading of batch matmul"

1c40889

This reverts commit cd90aa7.

fix merge

1fd5b42

fix

44a30cb

lint fix

6de0c4a

fix black

12447d7

more black fix

7858c80

fix version check for 1.5.1

bc78179

disable assert on v1.4 (strange pytorch issue)

f05fac6

masahi force-pushed the torch-quant-linear-dynamic branch from d59328e to f05fac6 Compare October 28, 2020 14:15

minor fix

b9f1eb4

anijain2305 reviewed Oct 28, 2020

View reviewed changes

use dequantize

246b11f

anijain2305 approved these changes Oct 28, 2020

View reviewed changes

masahi merged commit 0c7aae3 into apache:main Oct 29, 2020

electriclilies mentioned this pull request Nov 4, 2020

[QNN] Dynamic scale, zero point in qnn.op.dequantize #6849

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

masahi commented Oct 28, 2020 •

edited

Loading

anijain2305 left a comment

anijain2305 Oct 28, 2020

masahi Oct 28, 2020

masahi commented Oct 28, 2020

anijain2305 commented Oct 28, 2020 •

edited

Loading

anijain2305 Oct 28, 2020

masahi Oct 28, 2020

anijain2305 Oct 28, 2020

masahi Oct 28, 2020 •

edited

Loading

anijain2305 Oct 28, 2020

masahi Oct 28, 2020

masahi commented Oct 29, 2020

[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

[Torch, QNN] Support dynamic quantization flow to enable importing quantized transformer models #6782

Conversation

masahi commented Oct 28, 2020 • edited Loading

anijain2305 left a comment

Choose a reason for hiding this comment

anijain2305 Oct 28, 2020

Choose a reason for hiding this comment

masahi Oct 28, 2020

Choose a reason for hiding this comment

masahi commented Oct 28, 2020

anijain2305 commented Oct 28, 2020 • edited Loading

anijain2305 Oct 28, 2020

Choose a reason for hiding this comment

masahi Oct 28, 2020

Choose a reason for hiding this comment

anijain2305 Oct 28, 2020

Choose a reason for hiding this comment

masahi Oct 28, 2020 • edited Loading

Choose a reason for hiding this comment

anijain2305 Oct 28, 2020

Choose a reason for hiding this comment

masahi Oct 28, 2020

Choose a reason for hiding this comment

masahi commented Oct 29, 2020

masahi commented Oct 28, 2020 •

edited

Loading

anijain2305 commented Oct 28, 2020 •

edited

Loading

masahi Oct 28, 2020 •

edited

Loading