-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-133] Model Quantization with Calibration #9552
[MXNET-133] Model Quantization with Calibration #9552
Conversation
hi, @ZihengJiang, what are the layers using quantized version to get the accuracy from quantization_github.pptx? And have you analyse the time spent on quantize, dequantize and requantize op? Thank you |
python/mxnet/quantization.py
Outdated
return hist | ||
|
||
|
||
def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide pointers that explain the significance of num_bins and num_quantized_bins. How is it being used to compute thresholds ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_quantized_bins
represents the number of values in the int8 range. If we want to use 4 bits as quantized values, num_quantized_bins would be 15.num_bins
I tried different numbers of bins from 500 to 40,000. It has little effect on the optimal thresholds. So I picked a value in between. Too small values might not be suitable considering the tensor size is large, and too big value leads to more compute time of KL divergence. Here is a good article explaining the good rules of choosing the number of bins. http://www.statisticshowto.com/choose-bin-sizes-statistics/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation!
Hi @reminisce, may I ask a few questions: |
@wentingj The quantized ops used in the benchmarks are convolution, fully-connected, avg_pooling, max_pooling, and flatten. The quantize, dequantize, and requantize each takes up about 5-10% runtime per epoch. |
From the perspective of a resource constrained Edge device, could there be
different optimal values? What platform factors could play an important
role here?
…On Fri, Jan 26, 2018 at 10:29 PM, reminisce ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In python/mxnet/quantization.py
<#9552 (comment)>
:
> + with eps multiplied by a scaling factor and taking the corresponding amount off the non-zero values.
+ Ref: http://web.engr.illinois.edu/~hanj/cs412/bk3/KL-divergence.pdf
+ """
+ is_zeros = (p == 0).astype(np.float32)
+ is_nonzeros = (p != 0).astype(np.float32)
+ n_zeros = is_zeros.sum()
+ n_nonzeros = p.size - n_zeros
+ eps1 = eps * float(n_zeros) / float(n_nonzeros)
+ assert eps1 < 1.0, 'n_zeros=%d, n_nonzeros=%d, eps1=%f' % (n_zeros, n_nonzeros, eps1)
+ hist = p.astype(np.float32)
+ hist += eps * is_zeros + (-eps1) * is_nonzeros
+ assert (hist <= 0).sum() == 0
+ return hist
+
+
+def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255):
- num_quantized_bins represents the number of values in the int8
range. If we want to use 4 bits as quantized values, num_quantized_bins
would be 15.
- num_bins I tried different numbers of bins from 500 to 40,000. It
has little effect on the optimal thresholds. So I picked a value in
between. Too small values might not be suitable considering the tensor size
is large, and too big value leads to more compute time of KL divergence.
Here is a good article explaining the good rules of choosing the number of
bins. http://www.statisticshowto.com/choose-bin-sizes-statistics/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB63nkeG0cbJy6kyTlifnSpyxJ-_Kdks5tOsI_gaJpZM4RsJmp>
.
|
|
@marcoabreu The optimal values are determined by the calibration datasets. So they are independent of platforms. So long as the platform supports int8 basic addition and multiplication, it would be able to run quantized models. We would of course need to write dedicated int8 operators for a specific platform. The current implementation only works on Nvidia GPUs with dp4a instruction. |
Oh sorry, this question was targeted towards the |
@marcoabreu Oh, I see. Since calibration is conducted offline, it's not constrained by the hardware resources of edge devices. I believe there is an optimal value of num_bins for each layer. It could become a hyperparameter for users to tune. |
Ah, that sounds great! Thanks for the explanation |
python/mxnet/quantization.py
Outdated
@@ -0,0 +1,467 @@ | |||
# Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this in contrib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will do that.
Hello Jun, I have created a slave at http://jenkins.mxnet-ci.amazon-ml.com/computer/mxnet-linux-p3-gpu10/. The label is 'mxnetlinux-gpu-p3'. You can create a job in the Jenkinsfile and set Note: This slave is entirely experimental and I had no chance to validate it, but feel free to play around with it. Reviewers: Please do NOT merge this PR as long as it contains |
Thank @marcoabreu for setting up the testing environment for the PR, I will try to run the tests on it. |
assert cond == 0 | ||
|
||
check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), False) | ||
check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when global_pool is set 'True' , why check stride >1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stride is not used for shape inference in global pooling. It's just a dummy parameter.
[](const NodeAttrs& attrs) { | ||
return std::vector<ResourceRequest>(1, ResourceRequest::kTempSpace); | ||
}) | ||
.set_attr<FNeedRequantize>("FNeedRequantize", [](const NodeAttrs& attrs) { return true; }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkldnn int8 convolution API support s8 and u8 output besides s32 , it can shrink range inside API, so may add a switch here after adding CPU support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be difficult to do since we don't know whether the output type of a quantized op is int8 or int32 when quantizing its FP32 version and we don't know whether the op is going to run on GPU or CPU. We just assume it's int32 and has to do the requantization later. We should think about how to achieve the purpose of distinguishing quantized ops of CPU and GPU.
I have two questions regarding the shrinking MKL does internally.
- How does it choose the thresholds for shrinking? We find the thresholds are essential for final inference accuracy. That's why we introduced calibration stage.
- What's the time difference between a quantized conv with int8 output and a quantized conv with int32 plus requantizing it to int8?
We can first focus on making the inference accuracy controlled for CPU and then think about the way of optimizing the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can focus on the accuracy first and then optimized flow :)
@wentingj will answer your other questions later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wentingj ping
0e2b772
to
07e205c
Compare
Hey @reminisce, looking forward to this one on the edge team (if you can't tell). If you're going to test on the p3 instance I recommend cherry-picking this commit: #9684. |
@reminisce and all, This is an awesome PR 👍 Our team (@wentingj @jinhuang415) is also working on INT8 solution based on MKL-DNN library. I have updated a slide to introduce the overall of our solution and status. Intel INT8 Solution for MXNet.pptx Feel free to let us know your questions, comments and suggestions. |
@KellenSunderland Thank you for the note. I will either cherry pick the PR or rebase with the master once your PR is merged. |
@pengzhao-intel Thank you guys for implementing quantized ops for CPU computing. We look forward to seeing and benchmarking the implementation. I propose that your team work on top this PR and submit a separate PR of your work after this one is merged. This is already a big PR (>3000 lines of code) and adding more code would make the review process overwhelming. Please also know that we still need to wait for P3 instances in the CI being officially ready to fully test the PR. |
@marcoabreu It looks like the cuDNN version (5.0) is too low for building quantization implementation. Do we have plan to upgrade the lib? |
Feel free to create a new dockerfile that uses a cuda9 based layer as base.
Am 02.02.2018 3:44 nachm. schrieb "reminisce" <notifications@github.com>:
… @marcoabreu <https://github.com/marcoabreu> It looks like the cuDNN
version (5.0) is too low for building quantization implementation. Do we
have plan to upgrade the lib?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB69kCtYJDH51b2WJxkDdGQdjW7CCHks5tQ53vgaJpZM4RsJmp>
.
|
@reminisce It makes sense. We will submit a new PR for CPU implementation. If there're big design or code changes before this PR is merged, please kindly let us know (maybe you need to write a simple summary) so we can adjust our local code. We will update more info of CPU accuracy and performance later. |
@pengzhao-intel I will definitely let you know if there are breaking changes. For testing inference, you can use the script |
@reminisce btw, is there a time schedule for the merging? |
For CI, the plan is to add p3 slaves until end of February.
Am 03.02.2018 6:56 nachm. schrieb "PatricZhao" <notifications@github.com>:
… @reminisce <https://github.com/reminisce> btw, is there a time schedule
for the merging?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ARxB6_T69YTwL94XtRIKIWfr95L1Te5uks5tRRxDgaJpZM4RsJmp>
.
|
49f1bcf
to
ef22953
Compare
@@ -0,0 +1 @@ | |||
../image-classification/common |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a sym link? Does it work on windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been addressed?
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser(description='Generate a calibrated quantized model from a FP32 model') | ||
parser.add_argument('--model', type=str, required=True, | ||
help='currently only supports imagenet1k-resnet-152 or imagenet1k-inception-bn') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using choices
option for argparse: https://docs.python.org/2/library/argparse.html#choices
include/mxnet/c_api.h
Outdated
@@ -1237,8 +1237,28 @@ MXNET_DLL int MXSymbolInferType(SymbolHandle sym, | |||
const int **aux_type_data, | |||
int *complete); | |||
|
|||
|
|||
|
|||
MXNET_DLL int MXQuantizeSymbol(SymbolHandle sym_handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing documentation for this function?
include/mxnet/op_attr_types.h
Outdated
@@ -261,6 +260,10 @@ using FInferStorageType = std::function<bool (const NodeAttrs& attrs, | |||
std::vector<int>* in_attrs, | |||
std::vector<int>* out_attrs)>; | |||
|
|||
using FQuantizedOp = std::function<nnvm::NodePtr (const NodeAttrs& attrs)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Doc?
|
||
def _get_optimal_thresholds(nd_dict, num_bins=8001, num_quantized_bins=255, logger=None): | ||
"""Given a ndarray dict, find the optimal threshold for quantizing each value of the key.""" | ||
if stats is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it better to put this check inside _get_optimal_threshold
?
python/mxnet/contrib/quantization.py
Outdated
label_name : str | ||
Label name required for creating a Module object to run forward propagation on the | ||
calibration dataset. | ||
logger : Object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add doc for the output?
Returns
-------
xxx
xxx
return Min(Abs(static_cast<float>(a)), Abs(static_cast<float>(b))); | ||
} | ||
|
||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not used?
ecc8466
to
3087be6
Compare
[Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error
Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem
83a7041
to
7be4936
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome work, Jun! Thanks a lot for the great collaboration
* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc
* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (zheng-da#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc
Thank you very much for sharing the Int8 quantize implement.Surely choose the current calibration dataset is very important,in our project we use the entropy calibration to find threshold.There are two detection models,one is using suitable dataset but other is using ugly dataset,accuracy loss quite different. Ugly dataset |
Could you please add quantized version of depthwise convolution so that MobileNetV1 and V2 can be quantized? Maybe using cuDNN's group convolution? Thank you. |
@JingrenChen Thanks for the proposal. Currently, we don't have plan to add ops using cuDNN because it was found being lack of performance advantage compared to FP32 version. In the long term, we may consider adding optimized op kernels by TVM. Nevertheless, we still welcome community contributions of adding more quantized ops using cuDNN and MKLDNN. Please feel free to submit a PR if you would like to. |
@BUG1989 Sorry I didn't notice your message earlier. Thanks for sharing the results. Could you clarify the meanings of the numbers of each row and suitable/ugly datasets? If you are interested in discussion, shall we create an Issue ticket and start from there to avoid spamming other subscribers? |
* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc
* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc
Description
This PR implements model quantization by adopting the TensorFlow approach with calibration by borrowing the idea from Nvidia's TensorRT. The focus of this work is on keeping quantized models (ConvNets for now) inference accuracy loss under control when compared to their corresponding FP32 models. It also provides a framework in MXNet for easily plugging in high-performance operators for low-bit operations generated by using TVM.
This is a joint work of @ZihengJiang and @reminisce.
Details
Please see the following slides for more details on implementation and benchmark results.
quantization_github.pptx
Code Structure
src/operator/quantization/
contains quantized operators, quantization and calibration flow, and quantization util functions.python/mxnet/quantization.py
contains one user API for generating quantized models from FP32 models.example/quantization/
contains examples of generating quantized models and using quantized models for inference.tests/python/quantization/
contains unit tests for quantization.Notes
cudnnConvolutionForward
. In addition, we have noticed that even without transposing data layouts, the INT8 convolution of NHWC is slower than FP32 of NCHW for big images such as(64, 56, 56)
. In the future, we hope to leverage the strength of TVM to generate high-performance INT8 operators to replace the current implementation of calling cuDNN for quantized convolution.tests/python/quantization
because it needs a P3 instance to run. @marcoabreu is working on setting up the testing environment. Once it's done, we will submit the unit tests under that folder to a different label from the commonly used one.We would like to thank all the following people for discussion, suggestion, providing datasets, and guidance on configuring examples. @mli @piiswrong @zhreshold @astonzhang @szha @eric-haibin-lin @srochel @madjam @bhavinthaker @marcoabreu
We would appreciate everyone's efforts of reviewing this PR.
@cjolivier01 @anirudh2290 @rahul003