Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic support of quantization in TE #1

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Guobing-Chen
Copy link
Owner

@Guobing-Chen Guobing-Chen commented Jul 10, 2022

This PR aims to provide basic support of quantization in NNC, including the whole path since TE fuser until OP lowering to TE IR.

The basic philosophy is to leverage FP32 path in TE as much as possible for those OPs without dedicated hardware acceleration/benefit, which can then enable pytorch to support quantization as quick/efficient as possible while keeping those hardware-beneficial OPs with dedicated TE IR implementation.

With this philosophy, OPs like Conv/Matmul should definitely be with dedicated TE IR implementation as with CPU acceleration like AVX512-VNNI, etc. And most of the other OPs should fall into the first category to leverage related FP32 OPs in TE.

The way used to leverage FP32 OP is 'Decompose' -- which means to decompose a quantized OP into an OP sequence like dequantize/FP32 OP/quantize. E.x.: quantized:mul => aten::dequantize/aten::mul/aten::quantize_per_tensor. After TE fuser gets the sub-graph, it can run this 'Decompose' pass as part of its graph optimization phase, then all the quantization OPs will be replaced with related sequences except the quantized Conv/Matmuls, which can then be lowered into TE IR accordingly.

Besides the basic philosophy, another major part for quantization support is to embed quantization information like qscale and qzero_point during TE execution both for TE kernel compiling and running. This includes both TE framework support and TE OP level support.

Below listed more details:

  • Added dedicated interface(c/python) to control the enable/disable of quantization support in NNC.
torch/csrc/jit/passes/tensorexpr_fuser.cpp
torch/csrc/jit/passes/tensorexpr_fuser.h
torch/csrc/jit/python/init.cpp

  • Enable quantization path in TE fuser and a list of supported TE quantization OPs
torch/csrc/jit/passes/tensorexpr_fuser.cpp
  • Decompose quantized OP into non-quant OPs during Graph optimization pass
torch/csrc/jit/tensorexpr/graph_opt.cpp
torch/csrc/jit/tensorexpr/graph_opt.h
  • TE kernel compiling and running support for quantization OPs
torch/csrc/jit/tensorexpr/kernel.cpp
torch/csrc/jit/tensorexpr/kernel.h
torch/csrc/jit/tensorexpr/codegen.h
torch/csrc/jit/tensorexpr/llvm_codegen.h
torch/csrc/jit/tensorexpr/llvm_codegen.cpp
  • Updated existing NNC quantization lowering functions to support qscale and qzero obtaining from runtime
torch/csrc/jit/tensorexpr/operators/quantization.cpp
torch/csrc/jit/tensorexpr/operators/quantization.h
  • Added tests for the decompose graph pass and new quantized OP 'addrelu' as to show-case decompose capability
test/cpp/tensorexpr/test_graph_opt.cpp
test/cpp/tensorexpr/test_quantization.cpp

Added dedicated interface(c/python) to control the enable/disable of
quantization support in NNC.

Enable quantization path in TE fuser and a list of supported NNC
quantization OPs.

Decompose quantized OP into non-quant OPs during Graph optimization
pass.

NNC lowering support for quantization OPs.

Updated existing NNC quantization lowering functions to support qscale
and qzero obtaining from runtime.
Comment on lines 1047 to 1048
qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx);
runArgs_idx += 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use the size of bufferArgs_ to calculate the runArgs_idx instead of explicitly tracking it with an extra variable?

@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx);

TORCH_API bool isQuantized(const BufHandle& qx);

TORCH_API bool isChannelsLast(const BufHandle& buf);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BufHandle supports whether its node is channels last contiguous. Why do we need to expose this interface?
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/expr.h#L400

@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx);

TORCH_API bool isQuantized(const BufHandle& qx);

TORCH_API bool isChannelsLast(const BufHandle& buf);

TORCH_API BufHandle makeQBufHandleContiguous(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the use cases that require exposing these interfaces?

Comment on lines +772 to +773
ExprHandle qx_qscale = DoubleImm::make(0.0f);
ExprHandle qx_qzero = LongImm::make(1l);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any special reason that the default value needs to be modified?

Comment on lines +95 to +102
"quantized::add(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc",
"quantized::mul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc",
"quantized::matmul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc",
"quantized::add_relu(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc",
"quantized::conv2d.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)",
"quantized::conv2d_relu.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)",
"quantized::linear(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)",
"quantized::linear_relu(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These operators serve the FX front-end. So I think another option is to exclude these operators from the quantization set and enable the texpr_quant_enabled by default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we can combine IPEX front-end and NNC quantization as a mature solution and then enhance IPEX INT8 OOB performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, I'd recommend exposing an interface to get this quantization operation set to extend it. Just like getCustomOperatorSet

c10::optional<bool> pin_memory_opt,
double scale,
int64_t zero_point,
c10::optional<c10::MemoryFormat> optional_memory_format) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional_memory_format => memory_format

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface difference between empty_strided and empty_strided_quantized should only be scale and zero_point. Hence, we should also pass memory_format to LLVMCodeGen::empty_strided

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more elegant to put c10::optional parameters together.

runArgs.emplace_back(inputTensor.data_ptr());
if (inputTensor.is_quantized()) {
at::QuantizerPtr quantizer = inputTensor.quantizer();
TORCH_INTERNAL_ASSERT(quantizer->qscheme() == c10::QScheme::PER_TENSOR_AFFINE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more general to use tensor to store scale and zero_point. I'm thinking how to support PER_CHANNEL_AFFINE in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also supports PER_TENSOR_SYMMETRIC. Is that correct?

bufferArgs_.emplace_back(zero_point);
inBuffer.node()->set_qscale(scale.node());
inBuffer.node()->set_qzero(zero_point.node());
qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a map to replace runArgs_idx?

Updated quantization operation set to support customer modification of
TE quantization operation set.

A new interface -- getQuantizationOperationSet() is added for this
customization.

Added scalar scale/zero_point version quantize_per_tensor into default
quantization operation set.

Added a new test case as to test the new interface.
Guobing-Chen pushed a commit that referenced this pull request Dec 5, 2022
Summary:
The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created

This fixes the problem reported in
https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/

Test Plan:
# Testing memory leak fix

**With test code added in D41487340:**
```
cd ~/fbsource/xplat
buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test
```

Before this diff:

```
==2060866==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 608 byte(s) in 1 object(s) allocated from:
    #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27)
    #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77

Indirect leak of 1024 byte(s) in 1 object(s) allocated from:
    #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f)
    #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85

SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s).
```

After this diff:
- No errors
___

# Testing op correctness

```
cd ~/fbsource/fbcode
buck test caffe2/test/quantization:quantization -- test_qsoftmax
```
Passes
- https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/

Differential Revision: D41487341

Pull Request resolved: pytorch#89544
Approved by: https://github.com/mcr229
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants