-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay][QNN] Simulated Quantize and Dequantize #7613
Conversation
@electriclilies @anijain2305 @masahi @mbrookhart can you guys take a look at this PR? |
cc @ZihengJiang |
"int32": SQNN_INT32, | ||
} | ||
|
||
SQNN_CODE_TO_DTYPE = {v: k for k, v in SQNN_DTYPE_TO_CODE.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the use of integer codes to map to datatypes is a hack since relay doesn't currently support string variables. Once it does, this can be simplified. Until then, this allows datatypes to be dynamically changed without recompilation.
@jwfromm Thanks for the contribution. One high-level question is - Is it possible to have one op - say |
I think we could do that, yes, but I would argue that having a clear analogue to |
I see, yeah that should be ok. Maybe it's not relevant to this PR but there is a slight catch about But I totally understand that it can make design complicated and maybe also difficult to pattern match. Please give it a thought and see if this is helpful in overall design (or this PR in any way). Otherwise, please feel free to ignore. |
Adding simulated_requantize is definitely an interesting option that we might want to pursue further down the line. I think for now just having simulated quantize and dequantize is the right starting point though since they are much simpler to work with. We should be able to analyze how much accuracy is lost due to requantize integer division and decide to add the simulated version at a later time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with minor comments
A scalar tensor representing the scale to use when quantizing to integer datatypes. | ||
When it contains more than a single value, N must match the number of channels in data. | ||
|
||
output_zero_point: tvm.te.Tensor, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically, the zero points are scalar. Even for asymmetric, they are scalar. This is done mostly for performance reasons. But, since these ops are generic, it's better to keep it the way that you have.
# Use an if chain to dynamically return the proper quantization based on the input datatype. | ||
# This allows the op to compile once but apply different quantization approaches | ||
# using a variable datatype input. | ||
def _dispatch_sim_quantize(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, clever trick.
|
||
TVM_REGISTER_NODE_TYPE(SimulatedQuantizeAttrs); | ||
|
||
bool SimulatedQuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion (feel free to ignore) - How about sharing the TypeRel functions for Quantize and Dequantize? they seem to be same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A solid suggestion but to do that cleanly I'd have to put both simulated ops in one file which would break parity with how the regular quantize and dequantize ops are written. I think I slightly prefer a little code duplication but clearer structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor concerns about floating value defaults. I might be missing something.
python/tvm/topi/nn/qnn.py
Outdated
The channel axis for quantization. Default value is -1 which corresponds to the last axis. | ||
|
||
""" | ||
# Since all simulated outputs are in float32, we can just return the input tensor for fp32. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this. Shouldn't we still shift and scale with float inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically act as a passthrough in case you dont want to (de)quantize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah exactly, this is allowing us to turn off quantization / dequantization if we want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be doing this based off dtype. As you mentioned in the type relation function, we might want to pass in something that isn't float32 and run this against it's own dtype. What's wrong with allowing the user to pass in scale=1, zp=0, dtype=data.dtype if they want to passthrough?
python/tvm/topi/nn/qnn.py
Outdated
|
||
""" | ||
# Since all simulated inputs are in float32, we can just return the input tensor for fp32. | ||
def _compute_fp32(value, *indices): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, shouldn't we still shift and scale?
} | ||
|
||
// assign output type | ||
reporter->Assign(types[4], TensorType(data->shape, data->dtype)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the output of dequantize is fp32, doesn't this assume the input is always fp32? What if I had a senario where I was trying to simulate the quantization of int32->int8, and the dequantization of int8->int32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i should clarify the docs. There's no need for the inputs outputs to explicitly be float32. The simulated ops will return whatever the input data type is. I think this is a good behavior to have since it lets them be inserted into any graph without introducing type issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbrookhart are you talking about requantize operation?
For your example of quantization of int32 -> int8, is the input a quantized tensor with scale and zero point, or just a plain int32 tensor.
-
If it is just a plain int32 tensor, should we even quantize it? From definition standpoint, the quantize (dequantize) has input (output) always as
float32
datatype. -
However, if the input is a quantized integer representation, then you are doing a
requantize
operation (which in this case can be represented by a sequence of simulated_quantize - simulated_dequantize op).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking just a plain int32 input, not a quantized version. I'm not sure if we'll hit this in real models, but the possibility is always there, and I'd rather not make assumptions about inputs.
} | ||
|
||
// assign output type | ||
reporter->Assign(types[4], TensorType(data->shape, data->dtype)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same confusing about input types as above
@mbrookhart I cleaned up the docs to make the datatype behavior more clear and changed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ZihengJiang I'd love to hear your take before this gets merged. |
@masahi it would also be great to hear what you think about this implementation of simulated qnn. |
@jwfromm Looks great, thanks! |
thanks @jwfromm @anijain2305 @mbrookhart |
* Add initial implementation of flexible simulated qnn ops. * Added proper topi testing and fixed qnn axis bug. * Add injective schedule wrapping. * Stuck on typerel problem. * Relay integration fully working. * Simulated quantize totally finished. * Change dtype to be a scalar rather than tensor. * Undo change to quantize. * formatting. * Fix attritubes. * Fix negative axis dequantize bug. * Add topi simulated dequantize. * Add simulated_dequantize op to topi and relay. * Formatting. * Test negative axis perchannel dequantization. * Lint formatting. * Change import order to make lint happy. * Fix pytest. * Directly return make call. * Clarify disable mode for simulated qnn ops and fix typos. * Line too long oops. Co-authored-by: Ubuntu <jwfromm@jwfromm-cpu-dev.itxhlkosmouevgkdrmwxfbs5qh.xx.internal.cloudapp.net>
* Add initial implementation of flexible simulated qnn ops. * Added proper topi testing and fixed qnn axis bug. * Add injective schedule wrapping. * Stuck on typerel problem. * Relay integration fully working. * Simulated quantize totally finished. * Change dtype to be a scalar rather than tensor. * Undo change to quantize. * formatting. * Fix attritubes. * Fix negative axis dequantize bug. * Add topi simulated dequantize. * Add simulated_dequantize op to topi and relay. * Formatting. * Test negative axis perchannel dequantization. * Lint formatting. * Change import order to make lint happy. * Fix pytest. * Directly return make call. * Clarify disable mode for simulated qnn ops and fix typos. * Line too long oops. Co-authored-by: Ubuntu <jwfromm@jwfromm-cpu-dev.itxhlkosmouevgkdrmwxfbs5qh.xx.internal.cloudapp.net>
This PR adds
simulated_quantize
andsimulated_dequantize
to the QNN library in relay. These operators are primarily meant to support the pass-based quantization framework proposed in this Discuss post. However, these new ops can be cleanly broken into their own PR and can be useful for other applications. The obvious benefit of simulated qnn ops is that they mimic real quantization in floating point. The more interesting benefit of this approach is that it allows switching between per-channel and scalar QNN parameters and changing the datatype without recompilation. This has major compute time benefits when doing calibration or eventually trying to do quantization aware training.I also found that using qnn quantize and dequantize with per-channel parameters and negative axes caused an error and fixed it and changed a test case to catch it going forward.