-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Adaptation of the new quantization method for mkldnn. #37422
Comments
您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快~ Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API,FAQ,Github Issue and AI community to get the answer.Have a nice day! |
|
Hi, @baoachun I can not download this model, the URL looks not reachable from outside. Could you please try drag the model into comment box? |
|
python->C++ move to new PR: #38643 |
@lidanqing-intel @wozna I am wondering whether we can refer to the quantization pass of the GPU to do the mkldnn quantization pass scheme. Because there are a lot of weight processing in the current Python script, it is not in line with the design function of pass, and the code is difficult to maintain. For example, this pass involves transferring data between multiple passes. As far as I know, this situation is currently not supported. I can only save the information in the graph. In addition, there are many weight operations in the pass, such as reshape, transpose, etc., which will bring a huge workload to our later maintenance. In view of this, I still hope that we re-discuss and plan the implementation plan. Thanks! |
Tobe refined:
Steps:
|
Confirmed: oneDNN does support asymetric quantization. https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html We should just reuse GPU GPU passes:
|
notes from 5/20 meeting: |
Hi, @baoachun @yeliang2258 I'm working on this issue on changes made previously by Danqing here #42106 where there is still an accuracy problem. From what I can see, one-scale gathering is missing there, and that's probably why it doesn't work properly. WIP |
@wozna Hi, ScaleFilePath stores the scale information of all tensors in the quantize model. This file was added recently. Maybe the problem of lack of scale you mentioned can be solved after loading this file. |
This new quantization method was done by @yeliang2258 in #45416. ZeroPoint is not done yet. OneDNN has this option so it is possible to add it. @yeliang2258 do you know if any of new models use this ZeroPoint value for data shift? |
@wozna I consulted @yeliang2258 and he confirmed PaddleSlim (the team for quant models) supports only symmetric quantization, so all models should have zeropoint be 0. |
quantize_linear
anddequantize_linear
.quantize_linear
The quantization process requires two parameters, scale and zero_point, both of which are 1-dimensional tensors.
The quantitative formula is:
y=saturate(round(x/scale)+zero_point)
.Attributes:
quant_axis
: INT32, optional.In the
per-axis
quantification method, the axis on which the dimension is quantified. If this property is not set, the quantization method defaults toper-layer
quantization. For convolution input[batch, channel, H, W]
, when the quantization method ischannel-wise
,quant_axis
is 1.bit_length
: INT32, default is 8.The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
Inputs:
X
: FP32.Scale
: FP32When the quantization method is
layer-wise
, the size of theScale
is 1. When the quantization method isaxis-wise
, the size of theScale
is equal to the size of the input tensor in the axis dimension.The size is the same as the
Scale
. The value range depends onbit_length
, whenbit_length
is 8, the value range of the Tensor is within the value range of Int8.Outputs:
Y
: INT32.The shape of Y is the same as X. The value range depends on
bit_length
, whenbit_length
is 8, the value range of the Tensor is within the value range of Int8.dequantize_linear
According to
scale
,zero_point
andquant_axis
, the low-precision value Tensor is inversely quantized into a high-precision value Tensor.The de-quantitative formula is:
y=(x-zero_point)*scale
.Attributes:
quant_axis
: INT32, optional.In the
per-axis
de-quantification method, the axis on which the dimension is de-quantified. If this property is not set, the de-quantization method defaults toper-layer
de-quantization. For convolution input[batch, channel, H, W]
, when the de-quantization method ischannel-wise
,quant_axis
is 1.bit_length
: INT32, default is 8.The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
Inputs:
X
: INT32.The value range depends on
bit_length
, whenbit_length
is 8, the value range of the Tensor is within the value range of Int8.Scale
: FP32When the de-quantization method is
layer-wise
, the size of theScale
is 1. When the de-quantization method isaxis-wise
, the size of theScale
is equal to the size of the input tensor in the axis dimension.The size is the same as the
Scale
. The value range depends onbit_length
, whenbit_length
is 8, the value range of the Tensor is within the value range of Int8.Outputs:
Y
: FP32.The shape of Y is the same as X.
Testing model: http: https://paddle-inference-dist.bj.bcebos.com/temp/quantized_mobilenetv1.tar.gz
Refer to the definition of quantization operation in ONNX:
quantizelinear
DequantizeLinear
The text was updated successfully, but these errors were encountered: