Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay][QNN] Simulated Quantize and Dequantize #7613

Merged
merged 21 commits into from
Mar 11, 2021

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Mar 9, 2021

This PR adds simulated_quantize and simulated_dequantize to the QNN library in relay. These operators are primarily meant to support the pass-based quantization framework proposed in this Discuss post. However, these new ops can be cleanly broken into their own PR and can be useful for other applications. The obvious benefit of simulated qnn ops is that they mimic real quantization in floating point. The more interesting benefit of this approach is that it allows switching between per-channel and scalar QNN parameters and changing the datatype without recompilation. This has major compute time benefits when doing calibration or eventually trying to do quantization aware training.

I also found that using qnn quantize and dequantize with per-channel parameters and negative axes caused an error and fixed it and changed a test case to catch it going forward.

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 9, 2021

@electriclilies @anijain2305 @masahi @mbrookhart can you guys take a look at this PR?

@anijain2305
Copy link
Contributor

cc @ZihengJiang

"int32": SQNN_INT32,
}

SQNN_CODE_TO_DTYPE = {v: k for k, v in SQNN_DTYPE_TO_CODE.items()}
Copy link
Contributor Author

@jwfromm jwfromm Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the use of integer codes to map to datatypes is a hack since relay doesn't currently support string variables. Once it does, this can be simplified. Until then, this allows datatypes to be dynamically changed without recompilation.

@anijain2305
Copy link
Contributor

@jwfromm Thanks for the contribution. One high-level question is - Is it possible to have one op - say simulate_q that has both in_dtype and out_dtype params and can act as simulated_quantize or simulated_dequantize depending on dtypes? The input/output will always be of dtype fp32, but the in_dtype and out_dtype params can be int/float and tell if our intent is to simulate quantize or dequantize.

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 9, 2021

I think we could do that, yes, but I would argue that having a clear analogue to qnn.quantize and qnn.dequantize is conceptually much cleaner. It also has some benefits when moving from simulation to real qnn through pattern matching.

@anijain2305
Copy link
Contributor

anijain2305 commented Mar 9, 2021

I see, yeah that should be ok.

Maybe it's not relevant to this PR but there is a slight catch about requantize. I was hoping that the requantize could also be represented using the simulate_q (or simualted_requantize op) where both in_dtype and out_dtype are integers (while the actual in/out tensors are fp32). The issue representing requantize with a sequence of quantize-dequantize is that requantize's integer-only computation gives different results compared to the sequence of quantize-dequantize (the deviation grows further for <8 bits). If there was just one op, then we can simulate requantize with hopefully more fidelity (should help QAT as well in future).

But I totally understand that it can make design complicated and maybe also difficult to pattern match. Please give it a thought and see if this is helpful in overall design (or this PR in any way). Otherwise, please feel free to ignore.

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 9, 2021

Adding simulated_requantize is definitely an interesting option that we might want to pursue further down the line. I think for now just having simulated quantize and dequantize is the right starting point though since they are much simpler to work with. We should be able to analyze how much accuracy is lost due to requantize integer division and decide to add the simulated version at a later time.

Copy link
Contributor

@anijain2305 anijain2305 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with minor comments

python/tvm/relay/qnn/op/qnn.py Outdated Show resolved Hide resolved
python/tvm/relay/qnn/op/qnn.py Outdated Show resolved Hide resolved
A scalar tensor representing the scale to use when quantizing to integer datatypes.
When it contains more than a single value, N must match the number of channels in data.

output_zero_point: tvm.te.Tensor, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, the zero points are scalar. Even for asymmetric, they are scalar. This is done mostly for performance reasons. But, since these ops are generic, it's better to keep it the way that you have.

# Use an if chain to dynamically return the proper quantization based on the input datatype.
# This allows the op to compile once but apply different quantization approaches
# using a variable datatype input.
def _dispatch_sim_quantize(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, clever trick.

python/tvm/topi/nn/qnn.py Outdated Show resolved Hide resolved

TVM_REGISTER_NODE_TYPE(SimulatedQuantizeAttrs);

bool SimulatedQuantizeRel(const Array<Type>& types, int num_inputs, const Attrs& attrs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion (feel free to ignore) - How about sharing the TypeRel functions for Quantize and Dequantize? they seem to be same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solid suggestion but to do that cleanly I'd have to put both simulated ops in one file which would break parity with how the regular quantize and dequantize ops are written. I think I slightly prefer a little code duplication but clearer structure.

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor concerns about floating value defaults. I might be missing something.

The channel axis for quantization. Default value is -1 which corresponds to the last axis.

"""
# Since all simulated outputs are in float32, we can just return the input tensor for fp32.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. Shouldn't we still shift and scale with float inputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically act as a passthrough in case you dont want to (de)quantize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly, this is allowing us to turn off quantization / dequantization if we want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be doing this based off dtype. As you mentioned in the type relation function, we might want to pass in something that isn't float32 and run this against it's own dtype. What's wrong with allowing the user to pass in scale=1, zp=0, dtype=data.dtype if they want to passthrough?


"""
# Since all simulated inputs are in float32, we can just return the input tensor for fp32.
def _compute_fp32(value, *indices):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, shouldn't we still shift and scale?

}

// assign output type
reporter->Assign(types[4], TensorType(data->shape, data->dtype));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the output of dequantize is fp32, doesn't this assume the input is always fp32? What if I had a senario where I was trying to simulate the quantization of int32->int8, and the dequantization of int8->int32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i should clarify the docs. There's no need for the inputs outputs to explicitly be float32. The simulated ops will return whatever the input data type is. I think this is a good behavior to have since it lets them be inserted into any graph without introducing type issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbrookhart are you talking about requantize operation?

For your example of quantization of int32 -> int8, is the input a quantized tensor with scale and zero point, or just a plain int32 tensor.

  • If it is just a plain int32 tensor, should we even quantize it? From definition standpoint, the quantize (dequantize) has input (output) always as float32 datatype.

  • However, if the input is a quantized integer representation, then you are doing a requantize operation (which in this case can be represented by a sequence of simulated_quantize - simulated_dequantize op).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking just a plain int32 input, not a quantized version. I'm not sure if we'll hit this in real models, but the possibility is always there, and I'd rather not make assumptions about inputs.

}

// assign output type
reporter->Assign(types[4], TensorType(data->shape, data->dtype));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same confusing about input types as above

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 9, 2021

@mbrookhart I cleaned up the docs to make the datatype behavior more clear and changed SQNN_FP32 to SQNN_DISABLE to align more closely with its functionality. What do you think?

Copy link
Contributor

@mbrookhart mbrookhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 10, 2021

@ZihengJiang I'd love to hear your take before this gets merged.

@jwfromm
Copy link
Contributor Author

jwfromm commented Mar 11, 2021

@masahi it would also be great to hear what you think about this implementation of simulated qnn.

@masahi
Copy link
Member

masahi commented Mar 11, 2021

@jwfromm Looks great, thanks!

@masahi masahi merged commit e9e014b into apache:main Mar 11, 2021
@masahi
Copy link
Member

masahi commented Mar 11, 2021

thanks @jwfromm @anijain2305 @mbrookhart

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request May 6, 2021
* Add initial implementation of flexible simulated qnn ops.

* Added proper topi testing and fixed qnn axis bug.

* Add injective schedule wrapping.

* Stuck on typerel problem.

* Relay integration fully working.

* Simulated quantize totally finished.

* Change dtype to be a scalar rather than tensor.

* Undo change to quantize.

* formatting.

* Fix attritubes.

* Fix negative axis dequantize bug.

* Add topi simulated dequantize.

* Add simulated_dequantize op to topi and relay.

* Formatting.

* Test negative axis perchannel dequantization.

* Lint formatting.

* Change import order to make lint happy.

* Fix pytest.

* Directly return make call.

* Clarify disable mode for simulated qnn ops and fix typos.

* Line too long oops.

Co-authored-by: Ubuntu <jwfromm@jwfromm-cpu-dev.itxhlkosmouevgkdrmwxfbs5qh.xx.internal.cloudapp.net>
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request May 11, 2021
* Add initial implementation of flexible simulated qnn ops.

* Added proper topi testing and fixed qnn axis bug.

* Add injective schedule wrapping.

* Stuck on typerel problem.

* Relay integration fully working.

* Simulated quantize totally finished.

* Change dtype to be a scalar rather than tensor.

* Undo change to quantize.

* formatting.

* Fix attritubes.

* Fix negative axis dequantize bug.

* Add topi simulated dequantize.

* Add simulated_dequantize op to topi and relay.

* Formatting.

* Test negative axis perchannel dequantization.

* Lint formatting.

* Change import order to make lint happy.

* Fix pytest.

* Directly return make call.

* Clarify disable mode for simulated qnn ops and fix typos.

* Line too long oops.

Co-authored-by: Ubuntu <jwfromm@jwfromm-cpu-dev.itxhlkosmouevgkdrmwxfbs5qh.xx.internal.cloudapp.net>
@jwfromm jwfromm deleted the simulated_qnn branch April 12, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants