Basic support of quantization in TE #1

Guobing-Chen · 2022-07-10T15:06:28Z

This PR aims to provide basic support of quantization in NNC, including the whole path since TE fuser until OP lowering to TE IR.

The basic philosophy is to leverage FP32 path in TE as much as possible for those OPs without dedicated hardware acceleration/benefit, which can then enable pytorch to support quantization as quick/efficient as possible while keeping those hardware-beneficial OPs with dedicated TE IR implementation.

With this philosophy, OPs like Conv/Matmul should definitely be with dedicated TE IR implementation as with CPU acceleration like AVX512-VNNI, etc. And most of the other OPs should fall into the first category to leverage related FP32 OPs in TE.

The way used to leverage FP32 OP is 'Decompose' -- which means to decompose a quantized OP into an OP sequence like dequantize/FP32 OP/quantize. E.x.: quantized:mul => aten::dequantize/aten::mul/aten::quantize_per_tensor. After TE fuser gets the sub-graph, it can run this 'Decompose' pass as part of its graph optimization phase, then all the quantization OPs will be replaced with related sequences except the quantized Conv/Matmuls, which can then be lowered into TE IR accordingly.

Besides the basic philosophy, another major part for quantization support is to embed quantization information like qscale and qzero_point during TE execution both for TE kernel compiling and running. This includes both TE framework support and TE OP level support.

Below listed more details:

Added dedicated interface(c/python) to control the enable/disable of quantization support in NNC.

torch/csrc/jit/passes/tensorexpr_fuser.cpp
torch/csrc/jit/passes/tensorexpr_fuser.h
torch/csrc/jit/python/init.cpp

Enable quantization path in TE fuser and a list of supported TE quantization OPs

torch/csrc/jit/passes/tensorexpr_fuser.cpp

Decompose quantized OP into non-quant OPs during Graph optimization pass

torch/csrc/jit/tensorexpr/graph_opt.cpp
torch/csrc/jit/tensorexpr/graph_opt.h

TE kernel compiling and running support for quantization OPs

torch/csrc/jit/tensorexpr/kernel.cpp
torch/csrc/jit/tensorexpr/kernel.h
torch/csrc/jit/tensorexpr/codegen.h
torch/csrc/jit/tensorexpr/llvm_codegen.h
torch/csrc/jit/tensorexpr/llvm_codegen.cpp

Updated existing NNC quantization lowering functions to support qscale and qzero obtaining from runtime

torch/csrc/jit/tensorexpr/operators/quantization.cpp
torch/csrc/jit/tensorexpr/operators/quantization.h

Added tests for the decompose graph pass and new quantized OP 'addrelu' as to show-case decompose capability

test/cpp/tensorexpr/test_graph_opt.cpp
test/cpp/tensorexpr/test_quantization.cpp

Added dedicated interface(c/python) to control the enable/disable of quantization support in NNC. Enable quantization path in TE fuser and a list of supported NNC quantization OPs. Decompose quantized OP into non-quant OPs during Graph optimization pass. NNC lowering support for quantization OPs. Updated existing NNC quantization lowering functions to support qscale and qzero obtaining from runtime.

torch/csrc/jit/tensorexpr/operators/quantization.cpp

jgong5 · 2022-07-11T05:02:30Z

torch/csrc/jit/tensorexpr/kernel.cpp

+      qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx);
+      runArgs_idx += 2;


Is it possible to use the size of bufferArgs_ to calculate the runArgs_idx instead of explicitly tracking it with an extra variable?

EikanWang · 2022-07-11T00:17:02Z

torch/csrc/jit/tensorexpr/operators/quantization.h

@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx);

 TORCH_API bool isQuantized(const BufHandle& qx);

+TORCH_API bool isChannelsLast(const BufHandle& buf);


BufHandle supports whether its node is channels last contiguous. Why do we need to expose this interface?
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/expr.h#L400

EikanWang · 2022-07-11T00:22:06Z

torch/csrc/jit/tensorexpr/operators/quantization.h

@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx);

 TORCH_API bool isQuantized(const BufHandle& qx);

+TORCH_API bool isChannelsLast(const BufHandle& buf);
+
+TORCH_API BufHandle makeQBufHandleContiguous(


What are the use cases that require exposing these interfaces?

EikanWang · 2022-07-11T00:24:25Z

torch/csrc/jit/tensorexpr/operators/quantization.cpp

+  ExprHandle qx_qscale = DoubleImm::make(0.0f);
+  ExprHandle qx_qzero  = LongImm::make(1l);


Is there any special reason that the default value needs to be modified?

EikanWang · 2022-07-11T00:36:11Z

torch/csrc/jit/passes/tensorexpr_fuser.cpp

+      "quantized::add(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc",
+      "quantized::mul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc",
+      "quantized::matmul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc",
+      "quantized::add_relu(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc",
+      "quantized::conv2d.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)",
+      "quantized::conv2d_relu.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)",
+      "quantized::linear(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)",
+      "quantized::linear_relu(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)",


These operators serve the FX front-end. So I think another option is to exclude these operators from the quantization set and enable the texpr_quant_enabled by default.

Then we can combine IPEX front-end and NNC quantization as a mature solution and then enhance IPEX INT8 OOB performance.

Meanwhile, I'd recommend exposing an interface to get this quantization operation set to extend it. Just like getCustomOperatorSet

EikanWang · 2022-07-11T00:48:32Z

torch/csrc/jit/tensorexpr/llvm_codegen.cpp

+    c10::optional<bool> pin_memory_opt,
+    double scale,
+    int64_t zero_point,
+    c10::optional<c10::MemoryFormat> optional_memory_format) {


optional_memory_format => memory_format

The interface difference between empty_strided and empty_strided_quantized should only be scale and zero_point. Hence, we should also pass memory_format to LLVMCodeGen::empty_strided

It is more elegant to put c10::optional parameters together.

torch/csrc/jit/tensorexpr/operators/quantization.cpp

EikanWang · 2022-07-11T05:32:36Z

torch/csrc/jit/tensorexpr/kernel.cpp

+      runArgs.emplace_back(inputTensor.data_ptr());
+      if (inputTensor.is_quantized()) {
+        at::QuantizerPtr quantizer = inputTensor.quantizer();
+        TORCH_INTERNAL_ASSERT(quantizer->qscheme() == c10::QScheme::PER_TENSOR_AFFINE,


It might be more general to use tensor to store scale and zero_point. I'm thinking how to support PER_CHANNEL_AFFINE in the future.

It also supports PER_TENSOR_SYMMETRIC. Is that correct?

EikanWang · 2022-07-11T06:21:04Z

torch/csrc/jit/tensorexpr/kernel.cpp

+      bufferArgs_.emplace_back(zero_point);
+      inBuffer.node()->set_qscale(scale.node());
+      inBuffer.node()->set_qzero(zero_point.node());
+      qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx);


Could we use a map to replace runArgs_idx?

Updated quantization operation set to support customer modification of TE quantization operation set. A new interface -- getQuantizationOperationSet() is added for this customization. Added scalar scale/zero_point version quantize_per_tensor into default quantization operation set. Added a new test case as to test the new interface.

Summary: The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created This fixes the problem reported in https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/ Test Plan: # Testing memory leak fix **With test code added in D41487340:** ``` cd ~/fbsource/xplat buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test ``` Before this diff: ``` ==2060866==ERROR: LeakSanitizer: detected memory leaks Direct leak of 608 byte(s) in 1 object(s) allocated from: #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27) #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77 Indirect leak of 1024 byte(s) in 1 object(s) allocated from: #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f) #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85 SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s). ``` After this diff: - No errors ___ # Testing op correctness ``` cd ~/fbsource/fbcode buck test caffe2/test/quantization:quantization -- test_qsoftmax ``` Passes - https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/ Differential Revision: D41487341 Pull Request resolved: pytorch#89544 Approved by: https://github.com/mcr229

jgong5 reviewed Jul 11, 2022

View reviewed changes

EikanWang reviewed Jul 11, 2022

View reviewed changes

Guobing-Chen added 2 commits July 12, 2022 10:00

Minor update to polish coding

3cd1ba0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic support of quantization in TE #1

Basic support of quantization in TE #1

Guobing-Chen commented Jul 10, 2022 •

edited

Loading

jgong5 Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

EikanWang Jul 11, 2022

		qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx);
		runArgs_idx += 2;

		@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx);

		TORCH_API bool isQuantized(const BufHandle& qx);

		TORCH_API bool isChannelsLast(const BufHandle& buf);

		ExprHandle qx_qscale = DoubleImm::make(0.0f);
		ExprHandle qx_qzero = LongImm::make(1l);

Basic support of quantization in TE #1

Are you sure you want to change the base?

Basic support of quantization in TE #1

Conversation

Guobing-Chen commented Jul 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Guobing-Chen commented Jul 10, 2022 •

edited

Loading