[Relay][QNN] Simulated Quantize and Dequantize #7613

jwfromm · 2021-03-09T01:58:21Z

This PR adds simulated_quantize and simulated_dequantize to the QNN library in relay. These operators are primarily meant to support the pass-based quantization framework proposed in this Discuss post. However, these new ops can be cleanly broken into their own PR and can be useful for other applications. The obvious benefit of simulated qnn ops is that they mimic real quantization in floating point. The more interesting benefit of this approach is that it allows switching between per-channel and scalar QNN parameters and changing the datatype without recompilation. This has major compute time benefits when doing calibration or eventually trying to do quantization aware training.

I also found that using qnn quantize and dequantize with per-channel parameters and negative axes caused an error and fixed it and changed a test case to catch it going forward.

jwfromm · 2021-03-09T01:59:30Z

@electriclilies @anijain2305 @masahi @mbrookhart can you guys take a look at this PR?

anijain2305 · 2021-03-09T02:01:35Z

cc @ZihengJiang

jwfromm · 2021-03-09T02:04:29Z

python/tvm/topi/nn/qnn.py

+    "int32": SQNN_INT32,
+}
+
+SQNN_CODE_TO_DTYPE = {v: k for k, v in SQNN_DTYPE_TO_CODE.items()}


Note that the use of integer codes to map to datatypes is a hack since relay doesn't currently support string variables. Once it does, this can be simplified. Until then, this allows datatypes to be dynamically changed without recompilation.

anijain2305 · 2021-03-09T02:08:24Z

@jwfromm Thanks for the contribution. One high-level question is - Is it possible to have one op - say simulate_q that has both in_dtype and out_dtype params and can act as simulated_quantize or simulated_dequantize depending on dtypes? The input/output will always be of dtype fp32, but the in_dtype and out_dtype params can be int/float and tell if our intent is to simulate quantize or dequantize.

jwfromm · 2021-03-09T02:15:48Z

I think we could do that, yes, but I would argue that having a clear analogue to qnn.quantize and qnn.dequantize is conceptually much cleaner. It also has some benefits when moving from simulation to real qnn through pattern matching.

anijain2305 · 2021-03-09T07:47:29Z

I see, yeah that should be ok.

Maybe it's not relevant to this PR but there is a slight catch about requantize. I was hoping that the requantize could also be represented using the simulate_q (or simualted_requantize op) where both in_dtype and out_dtype are integers (while the actual in/out tensors are fp32). The issue representing requantize with a sequence of quantize-dequantize is that requantize's integer-only computation gives different results compared to the sequence of quantize-dequantize (the deviation grows further for <8 bits). If there was just one op, then we can simulate requantize with hopefully more fidelity (should help QAT as well in future).

But I totally understand that it can make design complicated and maybe also difficult to pattern match. Please give it a thought and see if this is helpful in overall design (or this PR in any way). Otherwise, please feel free to ignore.

jwfromm · 2021-03-09T17:58:31Z

Adding simulated_requantize is definitely an interesting option that we might want to pursue further down the line. I think for now just having simulated quantize and dequantize is the right starting point though since they are much simpler to work with. We should be able to analyze how much accuracy is lost due to requantize integer division and decide to add the simulated version at a later time.

anijain2305

LGTM, with minor comments

python/tvm/relay/qnn/op/qnn.py

anijain2305 · 2021-03-09T18:24:51Z

python/tvm/topi/nn/qnn.py

+        A scalar tensor representing the scale to use when quantizing to integer datatypes.
+        When it contains more than a single value, N must match the number of channels in data.
+
+    output_zero_point: tvm.te.Tensor, optional


Typically, the zero points are scalar. Even for asymmetric, they are scalar. This is done mostly for performance reasons. But, since these ops are generic, it's better to keep it the way that you have.

anijain2305 · 2021-03-09T18:26:11Z

python/tvm/topi/nn/qnn.py

+    # Use an if chain to dynamically return the proper quantization based on the input datatype.
+    # This allows the op to compile once but apply different quantization approaches
+    # using a variable datatype input.
+    def _dispatch_sim_quantize(value):


+1, clever trick.

python/tvm/topi/nn/qnn.py

anijain2305 · 2021-03-09T18:29:32Z

src/relay/qnn/op/simulated_quantize.cc

+
+TVM_REGISTER_NODE_TYPE(SimulatedQuantizeAttrs);
+
+bool SimulatedQuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,


Minor suggestion (feel free to ignore) - How about sharing the TypeRel functions for Quantize and Dequantize? they seem to be same.

A solid suggestion but to do that cleanly I'd have to put both simulated ops in one file which would break parity with how the regular quantize and dequantize ops are written. I think I slightly prefer a little code duplication but clearer structure.

mbrookhart

Some minor concerns about floating value defaults. I might be missing something.

mbrookhart · 2021-03-09T18:47:58Z

python/tvm/topi/nn/qnn.py

+        The channel axis for quantization. Default value is -1 which corresponds to the last axis.
+
+    """
+    # Since all simulated outputs are in float32, we can just return the input tensor for fp32.


I'm not sure I understand this. Shouldn't we still shift and scale with float inputs?

Basically act as a passthrough in case you dont want to (de)quantize?

Yeah exactly, this is allowing us to turn off quantization / dequantization if we want to.

I don't think we should be doing this based off dtype. As you mentioned in the type relation function, we might want to pass in something that isn't float32 and run this against it's own dtype. What's wrong with allowing the user to pass in scale=1, zp=0, dtype=data.dtype if they want to passthrough?

mbrookhart · 2021-03-09T18:48:22Z

python/tvm/topi/nn/qnn.py

+
+    """
+    # Since all simulated inputs are in float32, we can just return the input tensor for fp32.
+    def _compute_fp32(value, *indices):


Same as above, shouldn't we still shift and scale?

mbrookhart · 2021-03-09T18:50:50Z

src/relay/qnn/op/simulated_dequantize.cc

+  }
+
+  // assign output type
+  reporter->Assign(types[4], TensorType(data->shape, data->dtype));


If the output of dequantize is fp32, doesn't this assume the input is always fp32? What if I had a senario where I was trying to simulate the quantization of int32->int8, and the dequantization of int8->int32?

maybe i should clarify the docs. There's no need for the inputs outputs to explicitly be float32. The simulated ops will return whatever the input data type is. I think this is a good behavior to have since it lets them be inserted into any graph without introducing type issues.

@mbrookhart are you talking about requantize operation?

For your example of quantization of int32 -> int8, is the input a quantized tensor with scale and zero point, or just a plain int32 tensor.

If it is just a plain int32 tensor, should we even quantize it? From definition standpoint, the quantize (dequantize) has input (output) always as float32 datatype.

However, if the input is a quantized integer representation, then you are doing a requantize operation (which in this case can be represented by a sequence of simulated_quantize - simulated_dequantize op).

I was thinking just a plain int32 input, not a quantized version. I'm not sure if we'll hit this in real models, but the possibility is always there, and I'd rather not make assumptions about inputs.

mbrookhart · 2021-03-09T18:51:13Z

src/relay/qnn/op/simulated_quantize.cc

+  }
+
+  // assign output type
+  reporter->Assign(types[4], TensorType(data->shape, data->dtype));


Same confusing about input types as above

jwfromm · 2021-03-09T22:15:03Z

@mbrookhart I cleaned up the docs to make the datatype behavior more clear and changed SQNN_FP32 to SQNN_DISABLE to align more closely with its functionality. What do you think?

mbrookhart

LGTM

jwfromm · 2021-03-10T20:33:02Z

@ZihengJiang I'd love to hear your take before this gets merged.

jwfromm · 2021-03-11T22:39:02Z

@masahi it would also be great to hear what you think about this implementation of simulated qnn.

masahi · 2021-03-11T22:53:44Z

@jwfromm Looks great, thanks!

masahi · 2021-03-11T22:54:07Z

thanks @jwfromm @anijain2305 @mbrookhart

* Add initial implementation of flexible simulated qnn ops. * Added proper topi testing and fixed qnn axis bug. * Add injective schedule wrapping. * Stuck on typerel problem. * Relay integration fully working. * Simulated quantize totally finished. * Change dtype to be a scalar rather than tensor. * Undo change to quantize. * formatting. * Fix attritubes. * Fix negative axis dequantize bug. * Add topi simulated dequantize. * Add simulated_dequantize op to topi and relay. * Formatting. * Test negative axis perchannel dequantization. * Lint formatting. * Change import order to make lint happy. * Fix pytest. * Directly return make call. * Clarify disable mode for simulated qnn ops and fix typos. * Line too long oops. Co-authored-by: Ubuntu <jwfromm@jwfromm-cpu-dev.itxhlkosmouevgkdrmwxfbs5qh.xx.internal.cloudapp.net>

Ubuntu added 15 commits March 8, 2021 23:11

Add initial implementation of flexible simulated qnn ops.

4d510c8

Added proper topi testing and fixed qnn axis bug.

61657f6

Add injective schedule wrapping.

cdc262c

Stuck on typerel problem.

477d244

Relay integration fully working.

06a6b18

Simulated quantize totally finished.

7ec929b

Change dtype to be a scalar rather than tensor.

a5b3211

Undo change to quantize.

6a47450

formatting.

773f764

Fix attritubes.

769c6de

Fix negative axis dequantize bug.

e465038

Add topi simulated dequantize.

c81968d

Add simulated_dequantize op to topi and relay.

fefc2b0

Formatting.

90b3853

Test negative axis perchannel dequantization.

4d6f763

jwfromm commented Mar 9, 2021

View reviewed changes

Ubuntu added 4 commits March 9, 2021 03:36

Lint formatting.

7297198

Change import order to make lint happy.

46ed909

Fix pytest.

e83c833

Directly return make call.

a1b767d

anijain2305 approved these changes Mar 9, 2021

View reviewed changes

mbrookhart reviewed Mar 9, 2021

View reviewed changes

Clarify disable mode for simulated qnn ops and fix typos.

efc2e66

Line too long oops.

79a03dd

mbrookhart approved these changes Mar 9, 2021

View reviewed changes

masahi approved these changes Mar 11, 2021

View reviewed changes

masahi merged commit e9e014b into apache:main Mar 11, 2021

masahi mentioned this pull request Mar 12, 2021

[docker] fixed ci-gpu docker environment path typo. #7648

Merged

jwfromm mentioned this pull request Mar 17, 2021

[Relay][QNN] Relax simulated qnn tests to prevent flakiness. #7684

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

jwfromm deleted the simulated_qnn branch April 12, 2023 15:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Relay][QNN] Simulated Quantize and Dequantize #7613

[Relay][QNN] Simulated Quantize and Dequantize #7613

jwfromm commented Mar 9, 2021 •

edited

Loading

jwfromm commented Mar 9, 2021

anijain2305 commented Mar 9, 2021

jwfromm Mar 9, 2021 •

edited

Loading

anijain2305 commented Mar 9, 2021

jwfromm commented Mar 9, 2021

anijain2305 commented Mar 9, 2021 •

edited

Loading

jwfromm commented Mar 9, 2021

anijain2305 left a comment

anijain2305 Mar 9, 2021

anijain2305 Mar 9, 2021

anijain2305 Mar 9, 2021

jwfromm Mar 9, 2021

mbrookhart left a comment

mbrookhart Mar 9, 2021

anijain2305 Mar 9, 2021

jwfromm Mar 9, 2021

mbrookhart Mar 9, 2021

mbrookhart Mar 9, 2021

mbrookhart Mar 9, 2021

jwfromm Mar 9, 2021

anijain2305 Mar 9, 2021

mbrookhart Mar 9, 2021

mbrookhart Mar 9, 2021

jwfromm commented Mar 9, 2021 •

edited

Loading

mbrookhart left a comment

jwfromm commented Mar 10, 2021

jwfromm commented Mar 11, 2021

masahi commented Mar 11, 2021

masahi commented Mar 11, 2021


		TVM_REGISTER_NODE_TYPE(SimulatedQuantizeAttrs);

		bool SimulatedQuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,

[Relay][QNN] Simulated Quantize and Dequantize #7613

[Relay][QNN] Simulated Quantize and Dequantize #7613

Conversation

jwfromm commented Mar 9, 2021 • edited Loading

jwfromm commented Mar 9, 2021

anijain2305 commented Mar 9, 2021

jwfromm Mar 9, 2021 • edited Loading

Choose a reason for hiding this comment

anijain2305 commented Mar 9, 2021

jwfromm commented Mar 9, 2021

anijain2305 commented Mar 9, 2021 • edited Loading

jwfromm commented Mar 9, 2021

anijain2305 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mbrookhart left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jwfromm commented Mar 9, 2021 • edited Loading

mbrookhart left a comment

Choose a reason for hiding this comment

jwfromm commented Mar 10, 2021

jwfromm commented Mar 11, 2021

masahi commented Mar 11, 2021

masahi commented Mar 11, 2021

jwfromm commented Mar 9, 2021 •

edited

Loading

jwfromm Mar 9, 2021 •

edited

Loading

anijain2305 commented Mar 9, 2021 •

edited

Loading

jwfromm commented Mar 9, 2021 •

edited

Loading