[Feature] Adaptation of the new quantization method for mkldnn. #37422

baoachun · 2021-11-22T08:40:35Z

For output activation of ops, slim will insert QuantizeLinear and DequantizeLinear operation which named after quantize_linear and dequantize_linear.

For trainable parameters of ops, slim first quantizes and saves the weight in low bits, and then inserts an dequantize operation before entering op for calculation.

quantize_linear
The quantization process requires two parameters, scale and zero_point, both of which are 1-dimensional tensors.

The quantitative formula is: y=saturate(round(x/scale)+zero_point).

Attributes:
- quant_axis: INT32, optional.
  In the per-axis quantification method, the axis on which the dimension is quantified. If this property is not set, the quantization method defaults to per-layer quantization. For convolution input [batch, channel, H, W], when the quantization method is channel-wise, quant_axis is 1.
- bit_length: INT32, default is 8.
  The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
Inputs:
- X: FP32.
- Scale: FP32
  When the quantization method is layer-wise, the size of the Scale is 1. When the quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.
- ZeroPoint: INT32, optional.
  The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
Outputs:
- Y: INT32.
  The shape of Y is the same as X. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.

dequantize_linear
According to scale, zero_point and quant_axis, the low-precision value Tensor is inversely quantized into a high-precision value Tensor.
The de-quantitative formula is: y=(x-zero_point)*scale.

Attributes:
- quant_axis: INT32, optional.
  In the per-axis de-quantification method, the axis on which the dimension is de-quantified. If this property is not set, the de-quantization method defaults to per-layer de-quantization. For convolution input [batch, channel, H, W], when the de-quantization method is channel-wise, quant_axis is 1.
- bit_length: INT32, default is 8.
  The number of bits occupied by the quantized numerical precision, currently only supports quantization as a signed integer.
Inputs:
- X: INT32.
  The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
- Scale: FP32
  When the de-quantization method is layer-wise, the size of the Scale is 1. When the de-quantization method is axis-wise, the size of the Scale is equal to the size of the input tensor in the axis dimension.
- ZeroPoint: INT32, optional.
  The size is the same as the Scale. The value range depends on bit_length, when bit_length is 8, the value range of the Tensor is within the value range of Int8.
Outputs:
- Y: FP32.
  The shape of Y is the same as X.

Testing model: http: https://paddle-inference-dist.bj.bcebos.com/temp/quantized_mobilenetv1.tar.gz

Refer to the definition of quantization operation in ONNX:
quantizelinear
DequantizeLinear

The text was updated successfully, but these errors were encountered:

paddle-bot-old · 2021-11-22T08:41:00Z

您好，我们已经收到了您的问题，会安排技术人员尽快解答您的问题，请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时，您也可以通过查看官网API文档、常见问题、历史Issue、AI社区来寻求解答。祝您生活愉快～

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the API，FAQ，Github Issue and AI community to get the answer.Have a nice day!

lidanqing-intel · 2021-11-24T08:30:02Z

https://github.com/onnx/onnx/blob/master/docs/Operators.md#quantizelinear

lidanqing-intel · 2021-11-24T08:34:50Z

Change the save_quant_model to fit this new quant model
Then transform python file to C++ pass.

lidanqing-intel · 2021-12-05T12:11:26Z

Hi, @baoachun I can not download this model, the URL looks not reachable from outside. Could you please try drag the model into comment box?

lidanqing-intel · 2021-12-24T05:31:58Z

quantize_linear detected.
channel-wise quantization detected.
scale_var_name: batch_norm_8.tmp_4.scale scale_value: [0.001]
channel-wise show up in dequantize_linear op

lidanqing-intel · 2021-12-27T04:01:45Z

python->C++ refactoring PR:
add mkldnn quant_dequant pass and config #38437
Adapt new quant-aware model (have bug in save graph with removing quant-dequant op)
https://github.com/lidanqing-intel/Paddle/tree/develop-new-quant-strategy
oneDNN support

+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});

GPU solution reference:

Paddle/paddle/fluid/framework/ir/quant_conv2d_dequant_fuse_pass.cc

Line 384 in c396ee6

IR_NODE_LINK_TO(input_act, quantized_node);

lidanqing-intel · 2021-12-30T09:38:22Z

@wozna

baoachun · 2021-12-31T08:11:06Z

python->C++ move to new PR: #38643

baoachun · 2021-12-31T08:17:49Z

@lidanqing-intel @wozna I am wondering whether we can refer to the quantization pass of the GPU to do the mkldnn quantization pass scheme. Because there are a lot of weight processing in the current Python script, it is not in line with the design function of pass, and the code is difficult to maintain. For example, this pass involves transferring data between multiple passes. As far as I know, this situation is currently not supported. I can only save the information in the graph. In addition, there are many weight operations in the pass, such as reshape, transpose, etc., which will bring a huge workload to our later maintenance. In view of this, I still hope that we re-discuss and plan the implementation plan. Thanks!

lidanqing-intel · 2022-01-12T11:10:34Z

Tobe refined:
Concerns:

Ask issue about Scale propagation
zero-point issue, it is fine，dnnl::reorder support it.

+    conv_attr.set_zero_points(DNNL_ARG_SRC, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_DST, /* mask */ 0, {1});
+    conv_attr.set_zero_points(DNNL_ARG_WEIGHTS, /* mask */ 0, {1});

GRU does not support signed int8. We need to recalculate. Is it true that the quant model input is always singed int8. Even it is unsigned, it will be noted through zero-point. What if we get some value different from 128. That will only mean quant model is wrong, hence We should use "assert zero-point 128" in GRU quant model.
Those passes and its influence on collected scales. Feel like now we have to change some passes. Like squash, scale etc. Like for scale_conv fuse passes. we have to recalculate the scales. So we will need int8 passes ?? We will have to figure out how many passes will be influence the existing changes.

Steps:

We do this in passes. We get scales and zero-point. We should put already these values as attributes of this op.
Second pass. Remove all fake quant and fake dequant.
Running all mkldnn int8 passes. And at the end of whole process, we can propogate the scales that is propogatable. Look for typical ops (stored propogatable_op_lists) How to mark in the graph. Then we need dequantize op somewhere or use force_fp32_output. Only question is how to propogate the scales, it could be done in similar way in quant2, but could be different, because before we will have done the propogation, we were saving scales for each tensor. Even reshape and transpose will also be done like this. Before we put input and output scale in tensor, but now where do we put the scale. The only concern is that is worrying the method that how to finish int8 pattern. How to finalize int8 calculation and reorder to fp32.
1. fp32 mkldnn passes (actually int8 mkldnn passes, which means we should already consider ops fusion caused scales recalculation)
2. insert quantize, dequantize around convs, cpu_quantize_pass, (Before we get tensor with scales from quant2_mkldnn_pass, now we don't use tensor, have to reconsider, currently it should be in target op In_Scale attribute value) We will still need cpu_quantize_squash_pass to safely finalize all int8 calculation

lidanqing-intel · 2022-01-19T08:47:33Z

Confirmed: oneDNN does support asymetric quantization. https://oneapi-src.github.io/oneDNN/dev_guide_convolution.html
Now adding zero-points into Paddle conv mkldnn op
https://github.com/oneapi-src/oneDNN/blob/master/tests/benchdnn/doc/knobs_attr.md

We should just reuse GPU GPU passes:

const std::vector<std::string> kTRTSubgraphPasses({
  "conv_affine_channel_fuse_pass",  //
      "adaptive_pool2d_convert_global_pass",
      "conv_eltwiseadd_affine_channel_fuse_pass",  //
      "shuffle_channel_detect_pass",               //
      "quant_conv2d_dequant_fuse_pass",            //
      "delete_quant_dequant_op_pass",              //
      "delete_quant_dequant_filter_op_pass",       //
      // "fc_fuse_pass",                                 //
      "simplify_with_basic_ops_pass",           //
      "embedding_eltwise_layernorm_fuse_pass",  //
      "multihead_matmul_fuse_pass_v2",          //
      "multihead_matmul_fuse_pass_v3",          //
      "skip_layernorm_fuse_pass",               //
      "conv_bn_fuse_pass",                      //
      "unsqueeze2_eltwise_fuse_pass",           //
      "squeeze2_matmul_fuse_pass",              //
      "reshape2_matmul_fuse_pass",              //
      "flatten2_matmul_fuse_pass",              //
      "map_matmul_v2_to_mul_pass",              //
      "map_matmul_v2_to_matmul_pass",           //
      "map_matmul_to_mul_pass",                 //
      "fc_fuse_pass",                           //
      "conv_elementwise_add_fuse_pass",         //
      "add_support_int8_pass",
      "tensorrt_subgraph_pass",  //
      "conv_bn_fuse_pass",       //
#if CUDNN_VERSION >= 7100  // To run conv_fusion, the version of cudnn must be
                           // guaranteed at least v7
// cudnn8.0 has memory leak problem in conv + eltwise + act, so we
// disable the pass.
#if !(CUDNN_VERSION >= 8000 && CUDNN_VERSION < 8100)
      "conv_elementwise_add_act_fuse_pass",   //
      "conv_elementwise_add2_act_fuse_pass",  //
#endif
#endif
      "transpose_flatten_concat_fuse_pass",
});

yaomichael · 2022-05-24T07:28:16Z

notes from 5/20 meeting:
@wozna is taking further optimization from @lidanqing-intel, and expect to deliver in early Q3.

wozna · 2022-09-02T07:54:20Z

Hi, @baoachun @yeliang2258 I'm working on this issue on changes made previously by Danqing here #42106 where there is still an accuracy problem. From what I can see, one-scale gathering is missing there, and that's probably why it doesn't work properly. WIP
I'm also doing a review for #45416 (review) where quantize_linear and dequantize_linear are also used and it is called ONNX Format. Additional element is ScaleFilePath here. Do you also know if there are any differences?

yeliang2258 · 2022-09-02T08:54:38Z

@wozna Hi, ScaleFilePath stores the scale information of all tensors in the quantize model. This file was added recently. Maybe the problem of lack of scale you mentioned can be solved after loading this file.

wozna · 2022-09-30T07:39:36Z

This new quantization method was done by @yeliang2258 in #45416.
There are to PR also connected to this #45920 and #46378.

ZeroPoint is not done yet. OneDNN has this option so it is possible to add it. @yeliang2258 do you know if any of new models use this ZeroPoint value for data shift?

yaomichael · 2022-10-09T06:13:02Z

This new quantization method was done by @yeliang2258 in #45416. There are to PR also connected to this #45920 and #46378.

ZeroPoint is not done yet. OneDNN has this option so it is possible to add it. @yeliang2258 do you know if any of new models use this ZeroPoint value for data shift?

@wozna I consulted @yeliang2258 and he confirmed PaddleSlim (the team for quant models) supports only symmetric quantization, so all models should have zeropoint be 0.

baoachun changed the title ~~Adaptation of the new quantization method.~~ Adaptation of the new quantization method for mkldnn. Nov 22, 2021

paddle-bot-old bot assigned sandyhouse Nov 22, 2021

baoachun changed the title ~~Adaptation of the new quantization method for mkldnn.~~ [Feature] Adaptation of the new quantization method for mkldnn. Nov 22, 2021

lidanqing-intel added the Intel label Nov 23, 2021

lidanqing-intel assigned lidanqing-intel and unassigned sandyhouse Nov 23, 2021

lidanqing-intel added this to the Q4 milestone Nov 23, 2021

lidanqing-intel assigned wozna Dec 2, 2021

lidanqing-intel added the high priority label Dec 10, 2021

baoachun mentioned this issue Jan 6, 2022

Add enabling FC MKLDNN passes #38728

Closed

lidanqing-intel modified the milestones: Q4, 2022 Q1 Mar 1, 2022

lidanqing-intel mentioned this issue Apr 21, 2022

New quantization model mkldnn adaption pass #42106

Closed

wozna removed their assignment Aug 2, 2022

paddle-bot-old bot closed this as completed Jan 11, 2023

paddle-bot bot added the status/close 已关闭 label Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Adaptation of the new quantization method for mkldnn. #37422

[Feature] Adaptation of the new quantization method for mkldnn. #37422

baoachun commented Nov 22, 2021 •

edited

Loading

paddle-bot-old bot commented Nov 22, 2021

lidanqing-intel commented Nov 24, 2021

lidanqing-intel commented Nov 24, 2021

lidanqing-intel commented Dec 5, 2021 •

edited

Loading

lidanqing-intel commented Dec 24, 2021 •

edited

Loading

lidanqing-intel commented Dec 27, 2021 •

edited

Loading

lidanqing-intel commented Dec 30, 2021

baoachun commented Dec 31, 2021

baoachun commented Dec 31, 2021

lidanqing-intel commented Jan 12, 2022 •

edited

Loading

lidanqing-intel commented Jan 19, 2022 •

edited

Loading

yaomichael commented May 24, 2022

wozna commented Sep 2, 2022 •

edited

Loading

yeliang2258 commented Sep 2, 2022

wozna commented Sep 30, 2022

yaomichael commented Oct 9, 2022

[Feature] Adaptation of the new quantization method for mkldnn. #37422

[Feature] Adaptation of the new quantization method for mkldnn. #37422

Comments

baoachun commented Nov 22, 2021 • edited Loading

paddle-bot-old bot commented Nov 22, 2021

lidanqing-intel commented Nov 24, 2021

lidanqing-intel commented Nov 24, 2021

lidanqing-intel commented Dec 5, 2021 • edited Loading

lidanqing-intel commented Dec 24, 2021 • edited Loading

lidanqing-intel commented Dec 27, 2021 • edited Loading

lidanqing-intel commented Dec 30, 2021

baoachun commented Dec 31, 2021

baoachun commented Dec 31, 2021

lidanqing-intel commented Jan 12, 2022 • edited Loading

lidanqing-intel commented Jan 19, 2022 • edited Loading

yaomichael commented May 24, 2022

wozna commented Sep 2, 2022 • edited Loading

yeliang2258 commented Sep 2, 2022

wozna commented Sep 30, 2022

yaomichael commented Oct 9, 2022

baoachun commented Nov 22, 2021 •

edited

Loading

lidanqing-intel commented Dec 5, 2021 •

edited

Loading

lidanqing-intel commented Dec 24, 2021 •

edited

Loading

lidanqing-intel commented Dec 27, 2021 •

edited

Loading

lidanqing-intel commented Jan 12, 2022 •

edited

Loading

lidanqing-intel commented Jan 19, 2022 •

edited

Loading

wozna commented Sep 2, 2022 •

edited

Loading