-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay][Quantization] Per-Channel FQ2I #8883
Conversation
37abf7e
to
5d73612
Compare
@@ -478,7 +478,7 @@ def _impl_v1(cls, inputs, attr, params): | |||
attr["dilations"] = [1] + list(attr["dilations"]) | |||
if "pads" in attr: | |||
attr["pads"] = [0, attr["pads"][0], 0, attr["pads"][1]] | |||
|
|||
attr["channels"] = kernel_shapes[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this change relate to the subject of the PR or it is side fix? Don't see where "channels" attribute is used in other changes of this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a side fix, I utilized it for an intermediate solution for conv, but I guess I don't technically need it. It doesn't seem to be required in many of the relay passes.
just to dblcheck assumptions for this PR - we allow to have per-channel scales on the weight inputs only, right? Other stuff like binary op is only a helper for this main purpose? |
Yeah, the way the math works out, we can only really do per-channel quantization on weights and still end up with an integer conv/dense op, if we have per-channel quantization on the data going into one of these ops the lowering breaks. All the unary/binary stuff is to help with fusing ops that show up in the graph after the contraction but before the requantize, i.e., things we want to fuse into the contraction. |
5871a8d
to
c31512d
Compare
@@ -18,13 +18,22 @@ | |||
import tvm | |||
from tvm import relay | |||
from tvm.ir import TensorAffineType, TupleAffineType | |||
from tvm.tir import bijective_layout | |||
from ..op import register_fake_quantization_to_integer | |||
|
|||
|
|||
def fold_constant(expr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you use the functions in python/tvm/relay/frontend/common.py
or modify them slightly there to support your need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, since this is in transforms and the frontends call the transforms namespace extensively, importing that will cause an import loop.
from ..op import register_fake_quantization_to_integer | ||
|
||
|
||
def fold_constant(expr): | ||
return relay.transform.FoldConstantExpr(expr, tvm.IRModule()) | ||
|
||
|
||
def get_zeros(scale): | ||
return fold_constant(relay.op.cast(relay.op.zeros_like(scale), "int32")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a cast_like or do we always want things to be int32?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zero_points in qnn are always int32
b43ae30
to
607ba7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to change the description of the qnn.conv2d and qnn.dense to describe which zp and scales are expected. python/tvm/relay/qnn/op/qnn.py is not included into this PR, adding separate note
for example for conv2d we have now
input_zero_point: tvm.relay.Expr
The zero point of the data distribution.
kernel_zero_point: tvm.relay.Expr
The zero point of the quantized_kernel distribution.
input_scale: tvm.relay.Expr
The scale for the input tensor. The scale for the input tensor is
stored purely for convenience here. See more commentary below.
kernel_scale: tvm.relay.Expr
The scale for the weight tensor. The scale for the weight tensor is
stored for access to this during relay. This information is not
needed in the pass pipeline after qnn.conv2d is lowered to the
sequence of steps as in nn.conv2d. See also input_scale in Requantize.
while we allow to have only scalar for input zp and input scales and any(?) zp and scales for weights
arg = expr.args[0] | ||
t = type_map[arg] | ||
scale_shape = infer_shape(t.scale) | ||
z_p = t.zero_point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might have here 4 situations
- scale is scalar, zp is scalar
- scale is scalar, zp is not scalar
- scale is not scalse, zp is scalar
- scale is not scalar, zp is not scalar
cases 3 and 4 are covered by next if, we broadcast zp to scale shapes by axist from AffineType structure
Q: will zp by updated in-place in the TensorAffineType map after broadcast?
case 1 is ok
case 2 - Q: don't we need to handle this explicitly and broadcast scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the first Q, no, the broadcasted z_p will not be updated in place, it's only used in the computation.
On the second Q, that's an interesting point. I guess I haven't every seen it, but it's feasible. Qnn currently supports scalar scale and scalar zp OR vector scale and scalar zp OR vector scale and vector zp, which matches all of the combinations I've ever seen in the wild. What do you think, should we try to support that in QNN?
* Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests.
* Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug.
* Fix legalization for non spatial operators. * Fix axis checks for end2end functionality.
fix lint fix lint again
* WIP support per-channel quantization * more WIP * More WIP * fix issue with per-channel bias_add * Fix fake quantize tests (#4) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Add Relu * One more little one (#5) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug. * Non-working Per-channel Dense * Fix legalization for non spatial operators. (#6) * Fix legalization for non spatial operators. * Fix axis checks for end2end functionality. * fix axis normalization fix lint fix lint again * Fix bug in requantize dimension expansion. * Format. Co-authored-by: Josh Fromm <jwfromm@octoml.ai>
607ba7c
to
b2b3d1f
Compare
respond to review comments
b2b3d1f
to
e5d80e5
Compare
* WIP support per-channel quantization * more WIP * More WIP * fix issue with per-channel bias_add * Fix fake quantize tests (apache#4) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Add Relu * One more little one (apache#5) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug. * Non-working Per-channel Dense * Fix legalization for non spatial operators. (apache#6) * Fix legalization for non spatial operators. * Fix axis checks for end2end functionality. * fix axis normalization fix lint fix lint again * Per channel fq2i (apache#8) * WIP support per-channel quantization * more WIP * More WIP * fix issue with per-channel bias_add * Fix fake quantize tests (apache#4) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Add Relu * One more little one (apache#5) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug. * Non-working Per-channel Dense * Fix legalization for non spatial operators. (apache#6) * Fix legalization for non spatial operators. * Fix axis checks for end2end functionality. * fix axis normalization fix lint fix lint again * Fix bug in requantize dimension expansion. * Format. Co-authored-by: Josh Fromm <jwfromm@octoml.ai> * respond to review comments respond to review comments Co-authored-by: Josh Fromm <jwfromm@octoml.ai>
* WIP support per-channel quantization * more WIP * More WIP * fix issue with per-channel bias_add * Fix fake quantize tests (apache#4) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Add Relu * One more little one (apache#5) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug. * Non-working Per-channel Dense * Fix legalization for non spatial operators. (apache#6) * Fix legalization for non spatial operators. * Fix axis checks for end2end functionality. * fix axis normalization fix lint fix lint again * Per channel fq2i (apache#8) * WIP support per-channel quantization * more WIP * More WIP * fix issue with per-channel bias_add * Fix fake quantize tests (apache#4) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Add Relu * One more little one (apache#5) * Fixed fake quantize issues. * Formatting. * Cleanup unused imports * Fix real int8 tests. * Fix requantize shape bug. * Non-working Per-channel Dense * Fix legalization for non spatial operators. (apache#6) * Fix legalization for non spatial operators. * Fix axis checks for end2end functionality. * fix axis normalization fix lint fix lint again * Fix bug in requantize dimension expansion. * Format. Co-authored-by: Josh Fromm <jwfromm@octoml.ai> * respond to review comments respond to review comments Co-authored-by: Josh Fromm <jwfromm@octoml.ai>
This extend the FQ2I pass to support per-channel quantization needed for a number of TF models I've run into recently.
Many thanks to @jwfromm for help finding and fixing issues in the QNN ops with per-channel quantization.
cc @masahi @anijain2305 @elvin-n