-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic support of quantization in TE #1
base: master
Are you sure you want to change the base?
Conversation
Added dedicated interface(c/python) to control the enable/disable of quantization support in NNC. Enable quantization path in TE fuser and a list of supported NNC quantization OPs. Decompose quantized OP into non-quant OPs during Graph optimization pass. NNC lowering support for quantization OPs. Updated existing NNC quantization lowering functions to support qscale and qzero obtaining from runtime.
torch/csrc/jit/tensorexpr/kernel.cpp
Outdated
qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx); | ||
runArgs_idx += 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use the size of bufferArgs_
to calculate the runArgs_idx
instead of explicitly tracking it with an extra variable?
@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx); | |||
|
|||
TORCH_API bool isQuantized(const BufHandle& qx); | |||
|
|||
TORCH_API bool isChannelsLast(const BufHandle& buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BufHandle
supports whether its node is channels last contiguous. Why do we need to expose this interface?
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/expr.h#L400
@@ -16,6 +16,36 @@ TORCH_API ScalarType immQDType(const BufHandle& qx); | |||
|
|||
TORCH_API bool isQuantized(const BufHandle& qx); | |||
|
|||
TORCH_API bool isChannelsLast(const BufHandle& buf); | |||
|
|||
TORCH_API BufHandle makeQBufHandleContiguous( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the use cases that require exposing these interfaces?
ExprHandle qx_qscale = DoubleImm::make(0.0f); | ||
ExprHandle qx_qzero = LongImm::make(1l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any special reason that the default value needs to be modified?
"quantized::add(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc", | ||
"quantized::mul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc", | ||
"quantized::matmul(Tensor qa, Tensor qb, float scale, int zero_point)-> Tensor qc", | ||
"quantized::add_relu(Tensor qa, Tensor qb, float scale, int zero_point) -> Tensor qc", | ||
"quantized::conv2d.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)", | ||
"quantized::conv2d_relu.new(Tensor qx, __torch__.torch.classes.quantized.Conv2dPackedParamsBase packed_weight, float output_scale, int output_zero_point) -> (Tensor)", | ||
"quantized::linear(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)", | ||
"quantized::linear_relu(Tensor X, __torch__.torch.classes.quantized.LinearPackedParamsBase W_prepack, float Y_scale_i, int Y_zero_point_i) -> (Tensor Y)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These operators serve the FX front-end. So I think another option is to exclude these operators from the quantization set and enable the texpr_quant_enabled
by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we can combine IPEX front-end and NNC quantization as a mature solution and then enhance IPEX INT8 OOB performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, I'd recommend exposing an interface to get this quantization operation set to extend it. Just like getCustomOperatorSet
c10::optional<bool> pin_memory_opt, | ||
double scale, | ||
int64_t zero_point, | ||
c10::optional<c10::MemoryFormat> optional_memory_format) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional_memory_format
=> memory_format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface difference between empty_strided
and empty_strided_quantized
should only be scale
and zero_point
. Hence, we should also pass memory_format
to LLVMCodeGen::empty_strided
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is more elegant to put c10::optional
parameters together.
runArgs.emplace_back(inputTensor.data_ptr()); | ||
if (inputTensor.is_quantized()) { | ||
at::QuantizerPtr quantizer = inputTensor.quantizer(); | ||
TORCH_INTERNAL_ASSERT(quantizer->qscheme() == c10::QScheme::PER_TENSOR_AFFINE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more general to use tensor
to store scale
and zero_point
. I'm thinking how to support PER_CHANNEL_AFFINE
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also supports PER_TENSOR_SYMMETRIC
. Is that correct?
torch/csrc/jit/tensorexpr/kernel.cpp
Outdated
bufferArgs_.emplace_back(zero_point); | ||
inBuffer.node()->set_qscale(scale.node()); | ||
inBuffer.node()->set_qzero(zero_point.node()); | ||
qtensorInputIndex_.emplace(input_name_map_[input] + "_scale", runArgs_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a map to replace runArgs_idx
?
Updated quantization operation set to support customer modification of TE quantization operation set. A new interface -- getQuantizationOperationSet() is added for this customization. Added scalar scale/zero_point version quantize_per_tensor into default quantization operation set. Added a new test case as to test the new interface.
Summary: The deleter of the operator's unique_ptr doesn't get called unless the unique_ptr is created after the op has been created This fixes the problem reported in https://fb.workplace.com/groups/pytorch.edge.users/posts/1210708329799458/ Test Plan: # Testing memory leak fix **With test code added in D41487340:** ``` cd ~/fbsource/xplat buck run caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test:qsoftmax_test ``` Before this diff: ``` ==2060866==ERROR: LeakSanitizer: detected memory leaks Direct leak of 608 byte(s) in 1 object(s) allocated from: #0 0x41bcd27 in calloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcd27) #1 0x405b692 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:77 Indirect leak of 1024 byte(s) in 1 object(s) allocated from: #0 0x41bcb7f in malloc (/data/users/salilsdesai/fbsource/buck-out/gen/aab7ed39/xplat/caffe2/aten/src/ATen/native/quantized/cpu/qsoftmax_test/qsoftmax_test+0x41bcb7f) #1 0x405b6a8 in pytorch_qnnp_create_softargmax_nc_q8 xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/src/softargmax.c:85 SUMMARY- AddressSanitizer: 1632 byte(s) leaked in 2 allocation(s). ``` After this diff: - No errors ___ # Testing op correctness ``` cd ~/fbsource/fbcode buck test caffe2/test/quantization:quantization -- test_qsoftmax ``` Passes - https://www.internalfb.com/intern/testinfra/testconsole/testrun/2814749908834332/ Differential Revision: D41487341 Pull Request resolved: pytorch#89544 Approved by: https://github.com/mcr229
This PR aims to provide basic support of quantization in NNC, including the whole path since TE fuser until OP lowering to TE IR.
The basic philosophy is to leverage FP32 path in TE as much as possible for those OPs without dedicated hardware acceleration/benefit, which can then enable pytorch to support quantization as quick/efficient as possible while keeping those hardware-beneficial OPs with dedicated TE IR implementation.
With this philosophy, OPs like Conv/Matmul should definitely be with dedicated TE IR implementation as with CPU acceleration like AVX512-VNNI, etc. And most of the other OPs should fall into the first category to leverage related FP32 OPs in TE.
The way used to leverage FP32 OP is 'Decompose' -- which means to decompose a quantized OP into an OP sequence like dequantize/FP32 OP/quantize. E.x.: quantized:mul => aten::dequantize/aten::mul/aten::quantize_per_tensor. After TE fuser gets the sub-graph, it can run this 'Decompose' pass as part of its graph optimization phase, then all the quantization OPs will be replaced with related sequences except the quantized Conv/Matmuls, which can then be lowered into TE IR accordingly.
Besides the basic philosophy, another major part for quantization support is to embed quantization information like qscale and qzero_point during TE execution both for TE kernel compiling and running. This includes both TE framework support and TE OP level support.
Below listed more details: