[MXNET-133] Model Quantization with Calibration #9552

reminisce · 2018-01-25T00:31:15Z

Description

This PR implements model quantization by adopting the TensorFlow approach with calibration by borrowing the idea from Nvidia's TensorRT. The focus of this work is on keeping quantized models (ConvNets for now) inference accuracy loss under control when compared to their corresponding FP32 models. It also provides a framework in MXNet for easily plugging in high-performance operators for low-bit operations generated by using TVM.

This is a joint work of @ZihengJiang and @reminisce.

@ZihengJiang implemented the model quantization flow and quantized operators by calling cuDNN APIs for convolution, pooling, and fully-connected operators.
@reminisce implemented the calibration flow and refactored the operator implementation into using nnvm interfaces as well as wrote unit tests and examples, designed user-level API, conducted benchmarks, and fixed bugs to make the code mergeable to MXNet master branch.

Details

Please see the following slides for more details on implementation and benchmark results.
quantization_github.pptx

Code Structure

Backend: src/operator/quantization/ contains quantized operators, quantization and calibration flow, and quantization util functions.
Frontend: python/mxnet/quantization.py contains one user API for generating quantized models from FP32 models.
Examples: example/quantization/ contains examples of generating quantized models and using quantized models for inference.
Unit tests: tests/python/quantization/ contains unit tests for quantization.

Notes

Since we have used cuDNN for implementing the quantized operators, the quantized models generated in the examples of this PR can only run on the Nvidia GPUs supporting the dp4a instruction for inference. We performed our benchmarks on AWS P3 instances.
The inference speed of the quantized models is about 50% slower than FP32 models. This is majorly resulted from three transpose operations in the quantized convolution operator to transform data layouts between NCHW and NHWC in order to call cudnnConvolutionForward. In addition, we have noticed that even without transposing data layouts, the INT8 convolution of NHWC is slower than FP32 of NCHW for big images such as (64, 56, 56). In the future, we hope to leverage the strength of TVM to generate high-performance INT8 operators to replace the current implementation of calling cuDNN for quantized convolution.
The unit tests of quantization are put under tests/python/quantization because it needs a P3 instance to run. @marcoabreu is working on setting up the testing environment. Once it's done, we will submit the unit tests under that folder to a different label from the commonly used one.

We would like to thank all the following people for discussion, suggestion, providing datasets, and guidance on configuring examples. @mli @piiswrong @zhreshold @astonzhang @szha @eric-haibin-lin @srochel @madjam @bhavinthaker @marcoabreu

We would appreciate everyone's efforts of reviewing this PR.

@cjolivier01 @anirudh2290 @rahul003

wentingj · 2018-01-25T06:57:28Z

hi, @ZihengJiang, what are the layers using quantized version to get the accuracy from quantization_github.pptx? And have you analyse the time spent on quantize, dequantize and requantize op? Thank you

anirudh2290 · 2018-01-26T02:24:19Z

python/mxnet/quantization.py

+    return hist
+
+
+def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255):


Can you provide pointers that explain the significance of num_bins and num_quantized_bins. How is it being used to compute thresholds ?

num_quantized_bins represents the number of values in the int8 range. If we want to use 4 bits as quantized values, num_quantized_bins would be 15.

num_bins I tried different numbers of bins from 500 to 40,000. It has little effect on the optimal thresholds. So I picked a value in between. Too small values might not be suitable considering the tensor size is large, and too big value leads to more compute time of KL divergence. Here is a good article explaining the good rules of choosing the number of bins. http://www.statisticshowto.com/choose-bin-sizes-statistics/

Thanks for the explanation!

jinhuang415 · 2018-01-26T15:38:24Z

Hi @reminisce, may I ask a few questions:
(1) Do we always need to do run-time min/max calculation for weights parameter (not sure if there is any consideration to pre-calculate the min/max range of weights also to improve performance)? If it is needed, do we have any test/statistics how much overhead it may occupy?
(2) By reading "quantization_github.pptx", looks the model accuracy will drop a little bit when calibration batches increase to some extent, the more accurate ranges will be covered if using more calibration batches so from intuition the accuracy should be better using larger batches? Do we have any insight why accuracy drop down while calibration batches increases?

reminisce · 2018-01-27T06:20:05Z

@wentingj The quantized ops used in the benchmarks are convolution, fully-connected, avg_pooling, max_pooling, and flatten. The quantize, dequantize, and requantize each takes up about 5-10% runtime per epoch.

marcoabreu · 2018-01-27T06:39:31Z

From the perspective of a resource constrained Edge device, could there be different optimal values? What platform factors could play an important role here?

…

On Fri, Jan 26, 2018 at 10:29 PM, reminisce ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/mxnet/quantization.py <#9552 (comment)> : > + with eps multiplied by a scaling factor and taking the corresponding amount off the non-zero values. + Ref: http://web.engr.illinois.edu/~hanj/cs412/bk3/KL-divergence.pdf + """ + is_zeros = (p == 0).astype(np.float32) + is_nonzeros = (p != 0).astype(np.float32) + n_zeros = is_zeros.sum() + n_nonzeros = p.size - n_zeros + eps1 = eps * float(n_zeros) / float(n_nonzeros) + assert eps1 < 1.0, 'n_zeros=%d, n_nonzeros=%d, eps1=%f' % (n_zeros, n_nonzeros, eps1) + hist = p.astype(np.float32) + hist += eps * is_zeros + (-eps1) * is_nonzeros + assert (hist <= 0).sum() == 0 + return hist + + +def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255): - num_quantized_bins represents the number of values in the int8 range. If we want to use 4 bits as quantized values, num_quantized_bins would be 15. - num_bins I tried different numbers of bins from 500 to 40,000. It has little effect on the optimal thresholds. So I picked a value in between. Too small values might not be suitable considering the tensor size is large, and too big value leads to more compute time of KL divergence. Here is a good article explaining the good rules of choosing the number of bins. http://www.statisticshowto.com/choose-bin-sizes-statistics/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB63nkeG0cbJy6kyTlifnSpyxJ-_Kdks5tOsI_gaJpZM4RsJmp> .

reminisce · 2018-01-27T06:42:48Z

@jinhuang415

The parameters are quantized offline, which means the min/max values were pre-calculated before inference.
In theory, if the calibration dataset is representative enough of the real inference image sets, more examples used for calibration should lead to less accuracy loss. The purpose of using entropy calibration is to keep the accuracy loss stable with respect to the number of examples used for calibration. The naive calibration approach suffers from more calibration examples leads to bigger accuracy loss as you can see the trend in the last two tables. My guess is that if the calibration dataset contains examples that are not similar to real inference images, the quantization thresholds might be biased by those examples and result in a little drop down of accuracy.

reminisce · 2018-01-27T06:46:35Z

@marcoabreu The optimal values are determined by the calibration datasets. So they are independent of platforms. So long as the platform supports int8 basic addition and multiplication, it would be able to run quantized models. We would of course need to write dedicated int8 operators for a specific platform. The current implementation only works on Nvidia GPUs with dp4a instruction.

marcoabreu · 2018-01-27T06:48:24Z

Oh sorry, this question was targeted towards the num_bins question - the email messed it up.

reminisce · 2018-01-27T07:04:19Z

@marcoabreu Oh, I see. Since calibration is conducted offline, it's not constrained by the hardware resources of edge devices. I believe there is an optimal value of num_bins for each layer. It could become a hyperparameter for users to tune.

marcoabreu · 2018-01-27T07:10:46Z

Ah, that sounds great! Thanks for the explanation

piiswrong · 2018-01-29T23:10:08Z

python/mxnet/quantization.py

@@ -0,0 +1,467 @@
+# Licensed to the Apache Software Foundation (ASF) under one


Put this in contrib

Good catch. I will do that.

marcoabreu · 2018-02-02T02:18:50Z

Hello Jun, I have created a slave at http://jenkins.mxnet-ci.amazon-ml.com/computer/mxnet-linux-p3-gpu10/. The label is 'mxnetlinux-gpu-p3'. You can create a job in the Jenkinsfile and set node('mxnetlinux-gpu-p3') in order to schedule the job on that slave.

Note: This slave is entirely experimental and I had no chance to validate it, but feel free to play around with it.

Reviewers: Please do NOT merge this PR as long as it contains node('mxnetlinux-gpu-p3') in the Jenkinsfile as this slave-type is experimental and officially not supported.

reminisce · 2018-02-02T04:31:58Z

Thank @marcoabreu for setting up the testing environment for the PR, I will try to run the tests on it.

wentingj · 2018-02-02T05:45:12Z

tests/python/quantization/test_quantization.py

+                assert cond == 0
+
+    check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), False)
+    check_quantized_pooling((3, 4, 56, 56), (3, 3), 'max', (0, 0), (2, 2), True)


when global_pool is set 'True' , why check stride >1?

stride is not used for shape inference in global pooling. It's just a dummy parameter.

wentingj · 2018-02-02T06:12:24Z

src/operator/quantization/quantized_conv.cc

+  [](const NodeAttrs& attrs) {
+    return std::vector<ResourceRequest>(1, ResourceRequest::kTempSpace);
+  })
+.set_attr<FNeedRequantize>("FNeedRequantize", [](const NodeAttrs& attrs) { return true; })


mkldnn int8 convolution API support s8 and u8 output besides s32 , it can shrink range inside API, so may add a switch here after adding CPU support.

That might be difficult to do since we don't know whether the output type of a quantized op is int8 or int32 when quantizing its FP32 version and we don't know whether the op is going to run on GPU or CPU. We just assume it's int32 and has to do the requantization later. We should think about how to achieve the purpose of distinguishing quantized ops of CPU and GPU.

I have two questions regarding the shrinking MKL does internally.

How does it choose the thresholds for shrinking? We find the thresholds are essential for final inference accuracy. That's why we introduced calibration stage.

What's the time difference between a quantized conv with int8 output and a quantized conv with int32 plus requantizing it to int8?

We can first focus on making the inference accuracy controlled for CPU and then think about the way of optimizing the flow.

I agree we can focus on the accuracy first and then optimized flow :)
@wentingj will answer your other questions later.

@wentingj ping

KellenSunderland · 2018-02-02T12:21:33Z

Hey @reminisce, looking forward to this one on the edge team (if you can't tell). If you're going to test on the p3 instance I recommend cherry-picking this commit: #9684.

pengzhao-intel · 2018-02-02T13:38:32Z

@reminisce and all, This is an awesome PR 👍

Our team (@wentingj @jinhuang415) is also working on INT8 solution based on MKL-DNN library.
And we plan to contribute our code with this PR.

I have updated a slide to introduce the overall of our solution and status.
I think we can align our solution from high level first and then go into technical details :)

Intel INT8 Solution for MXNet.pptx

Feel free to let us know your questions, comments and suggestions.

reminisce · 2018-02-02T16:57:06Z

@KellenSunderland Thank you for the note. I will either cherry pick the PR or rebase with the master once your PR is merged.

reminisce · 2018-02-02T18:15:40Z

@pengzhao-intel Thank you guys for implementing quantized ops for CPU computing. We look forward to seeing and benchmarking the implementation.

I propose that your team work on top this PR and submit a separate PR of your work after this one is merged. This is already a big PR (>3000 lines of code) and adding more code would make the review process overwhelming. Please also know that we still need to wait for P3 instances in the CI being officially ready to fully test the PR.

reminisce · 2018-02-02T23:44:19Z

@marcoabreu It looks like the cuDNN version (5.0) is too low for building quantization implementation. Do we have plan to upgrade the lib?

marcoabreu · 2018-02-03T00:14:12Z

Feel free to create a new dockerfile that uses a cuda9 based layer as base. Am 02.02.2018 3:44 nachm. schrieb "reminisce" <notifications@github.com>:

…

@marcoabreu <https://github.com/marcoabreu> It looks like the cuDNN version (5.0) is too low for building quantization implementation. Do we have plan to upgrade the lib? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB69kCtYJDH51b2WJxkDdGQdjW7CCHks5tQ53vgaJpZM4RsJmp> .

pengzhao-intel · 2018-02-03T02:11:43Z

@reminisce It makes sense. We will submit a new PR for CPU implementation.

If there're big design or code changes before this PR is merged, please kindly let us know (maybe you need to write a simple summary) so we can adjust our local code.

We will update more info of CPU accuracy and performance later.

reminisce · 2018-02-03T04:44:55Z

@pengzhao-intel I will definitely let you know if there are breaking changes.

For testing inference, you can use the script example/quantization/imagenet_gen_qsym.py to generate quantized models (resnet-152 and inception w/ bn) and run the inference using example/quantization/imagenet_inference.py. Remember to change the ctx to mx.cpu since it's currently default to mx.gpu(0).

pengzhao-intel · 2018-02-04T02:55:36Z

@reminisce btw, is there a time schedule for the merging?

marcoabreu · 2018-02-04T03:02:04Z

For CI, the plan is to add p3 slaves until end of February. Am 03.02.2018 6:56 nachm. schrieb "PatricZhao" <notifications@github.com>:

…

@reminisce <https://github.com/reminisce> btw, is there a time schedule for the merging? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ARxB6_T69YTwL94XtRIKIWfr95L1Te5uks5tRRxDgaJpZM4RsJmp> .

eric-haibin-lin · 2018-02-05T20:07:07Z

example/quantization/common

@@ -0,0 +1 @@
+../image-classification/common


Is this a sym link? Does it work on windows?

Has this been addressed?

eric-haibin-lin · 2018-02-05T20:09:44Z

example/quantization/imagenet_gen_qsym.py

+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Generate a calibrated quantized model from a FP32 model')
+    parser.add_argument('--model', type=str, required=True,
+                        help='currently only supports imagenet1k-resnet-152 or imagenet1k-inception-bn')


Consider using choices option for argparse: https://docs.python.org/2/library/argparse.html#choices

eric-haibin-lin · 2018-02-05T20:11:40Z

include/mxnet/c_api.h

@@ -1237,8 +1237,28 @@ MXNET_DLL int MXSymbolInferType(SymbolHandle sym,
                                const int **aux_type_data,
                                int *complete);

-
-
+MXNET_DLL int MXQuantizeSymbol(SymbolHandle sym_handle,


Missing documentation for this function?

eric-haibin-lin · 2018-02-05T20:12:06Z

include/mxnet/op_attr_types.h

@@ -261,6 +260,10 @@ using FInferStorageType = std::function<bool (const NodeAttrs& attrs,
                                              std::vector<int>* in_attrs,
                                              std::vector<int>* out_attrs)>;

+using FQuantizedOp = std::function<nnvm::NodePtr (const NodeAttrs& attrs)>;


Missing Doc?

eric-haibin-lin · 2018-02-05T20:15:32Z

python/mxnet/contrib/quantization.py

+
+def _get_optimal_thresholds(nd_dict, num_bins=8001, num_quantized_bins=255, logger=None):
+    """Given a ndarray dict, find the optimal threshold for quantizing each value of the key."""
+    if stats is None:


Is it better to put this check inside _get_optimal_threshold?

eric-haibin-lin · 2018-02-05T20:17:21Z

python/mxnet/contrib/quantization.py

+    label_name : str
+        Label name required for creating a Module object to run forward propagation on the
+        calibration dataset.
+    logger : Object


Add doc for the output?

Returns ------- xxx xxx

eric-haibin-lin · 2018-02-05T20:20:37Z

src/operator/quantization/quantization_utils.h

+  return Min(Abs(static_cast<float>(a)), Abs(static_cast<float>(b)));
+}
+
+#if 0


Is this not used?

[Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error

Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem

marcoabreu

LGTM, awesome work, Jun! Thanks a lot for the great collaboration

* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc

* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (zheng-da#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc

BUG1989 · 2018-04-24T07:36:00Z

Thank you very much for sharing the Int8 quantize implement.Surely choose the current calibration dataset is very important,in our project we use the entropy calibration to find threshold.There are two detection models,one is using suitable dataset but other is using ugly dataset,accuracy loss quite different.
Suitable dataset:
Float32:
0.987543[1364,291,1408,335]
0.863533[1610,46,1650,86]
0.704229[869,142,923,196]
0.703651[765,108,808,151]
Int8:
0.985061[1365,292,1410,336]
0.834675[1611,46,1651,87]
0.698077[868,142,923,197]
0.687184[765,109,808,152]

Ugly dataset
Float32:
0.997095 [609,225,834,406]
0.970455 [95,760,278,1079]
0.899680 [594,397,702,697]
0.833142 [1043,176,1244,299]
0.809374 [254,363,342,620]
Int8:
0.992615 [610,226,837,407]
0.886571 [578,394,705,701]
0.813775 [1041,176,1244,298]
0.728242 [106,720,279,1061]
0.705122 [257,364,344,623]

JingrenChen · 2018-05-19T14:48:56Z

Could you please add quantized version of depthwise convolution so that MobileNetV1 and V2 can be quantized? Maybe using cuDNN's group convolution? Thank you.

reminisce · 2018-05-21T00:12:12Z

@JingrenChen Thanks for the proposal. Currently, we don't have plan to add ops using cuDNN because it was found being lack of performance advantage compared to FP32 version. In the long term, we may consider adding optimized op kernels by TVM. Nevertheless, we still welcome community contributions of adding more quantized ops using cuDNN and MKLDNN. Please feel free to submit a PR if you would like to.

reminisce · 2018-05-21T00:17:11Z

@BUG1989 Sorry I didn't notice your message earlier. Thanks for sharing the results. Could you clarify the meanings of the numbers of each row and suitable/ugly datasets? If you are interested in discussion, shall we create an Issue ticket and start from there to avoid spamming other subscribers?

* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (apache#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc

* [Quantization] 8bit Quantization and GPU Support [Quantization] CuDNN 8bit quantized relu v0.1 [Quantization] CuDNN 8bit quantized max_pool v0.1 [Quantization] CuDNN 8bit quantized lrn v0.1 [Quantization] CuDNN 8bit quantized convolution v0.1 [Quantization] CuDNN 8bit quantized fully connected v0.1 [Quantization] Small fix [Quantization] Implement backward method [Quantization] Convolution backward method [Quantization] Add range for matmul and conv [Quantization] New types in ndarray.py [Quantization] 8bit conv works [Quantization] conv support multiple type [Quantization] matmul works now [Quantization] matmul works well [Quantization] efactor quantization operators [Quantization] Op: quantize_down_and_shrink_range [Quantization] Complete quantize_graph_pass [Quantization] Add example [Quantization] Take zero-center quantize, accuracy fixed [Quantization] Multiple layers MLP pass [Quantization] Make quantized_conv same as Convolution [Quantization] quantized_conv works [Quantization] Fix bug [Quantization] lenet works now [Quantization] Add quantized_flatten [Quantization] Quantized max pool works well [Quantization] Make quantized_conv support NHWC [Quantization] add max_pool [Quantization] add ignore_symbols [Quantization] Save change [Quantization] Reorganize tests, 8 layers resnet works on cifar [Quantization] Support for 'NHWC' max pool [Quantization] Support for 'NHWC' quantized max pool [Quantization] Fix speed of quantize_down_and_shrink_range [Quantization] script for resnet on imagenet [Quantization] refactor for quantize offline [Quantization] Fix infershape [Quantization] Update test [Quantization] Update example [Quantization] Fix build error * [Quantization] Add calibration flow and refactor code Rebase with dmlc/master Add quantize_down_and_shrink by threshold Don't assign resource when threshold is available for quantize_down_and_shrink Fix quantize_down_and_shrink saturation Implement pass for setting calib table to node attrs Rebase with upstream master Change threshold to min/max quantized params Add c-api for setting calib table to graph Add calibration front end function Bug fixes and add unit test Add data iter type to calibration Fix bug in calibrate_quantized_model Bug fix and add example Add the second calibration approach and benchmark Fix Fix infer error and add benchmark for conv Add benchmark script Change output names and argument names Remove commented out code Change name Add layout to benchmark_convolution Remove redundant comment Remove common and add soft link More fix and benchmark Add scripts to plot images Minor fix More fix More fix and util tools Tools and support bias in quantized_conv2d Add script for getting the optimal thresholds using kl divergence Add kl divergence for optimizing thresholds Add benchmark scripts Fix compile after rebasing on master Allocate temp space only once for quantized_conv2d Change quantize_down_and_shrink_range to allocate temp space once No temp space for calib model Refactor quantize_down_and_shrink_range into requantize Refactor quantized convolution using nnvm interfaces Fix quantized_conv bug Use ConvolutionParam for QuantizedCuDNNConvOp Refactor quantized fc using nnvm interfaces Change TQuantizationNeedShrink to FNeedRequantize Refactor quantized_pooling Simplify FQuantizedOp interface Better naming Fix shape and type inference for quantized_flatten Clean up quantization frontend APIs and examples Delete quantized lrn and relu Add python script for generating quantized models Add script for running inference Add inference example Remove redundant files from example/quantization Simplify user-level python APIs Add logger Improve user-level python api Fix coding style Add unit test for quantized_conv Fix bugs in quantized_fully_connected and add unit test Add unit test for requantize Fix a bug and add python api unit tests Import test_quantization in test_operator_gpu.py Rebase with master Remove redundant files Fix test case for python3 and fix doc Fix unit tests Fix unit tests for python3 Release used ndarrays in calibration for saving memory usage Simplify releasing memory of used ndarrays for calibration Fix a bug Revert "Fix a bug" This reverts commit f7853f2. Revert "Simplify releasing memory of used ndarrays for calibration" This reverts commit 70b9e38. Clean up benchmark script and improve example Add API and example documentation and fix bugs Remove redundant test file and improve error message Merge quantize and dequantize with master impl Remove commented code Hide monitor interface from users Remove interface from Module Add license header Move quantization unittests to a separate folder so that it can be only run on P3 instances Remove quantization unittests from test_operator_gpu.py Move quantization to contrib Fix lint Add mxnetlinux-gpu-p3 to jenkins Fix jenkins Fix CI build Fix CI Update jenkins file Use cudnn7 for ci Add docker file for quantization unit test only Correctly skip build with cudnn < 6 Add doc for quantize symbol api Fix lint Fix python3 and add doc Try to fix cudnn build problem * Fix compile error * Fix CI * Remove tests that should not run on P3 * Remove unnecessary docker file * Fix registering quantized nn ops * Reformat Jenkinsfile and switch quantization to CUDA 9 (#9) * Address interface change cr * Address comments and fix bugs * Make unit test stable * Improve unit test * Address cr * Address cr * Fix flaky unit test layer_norm * Fix doc

reminisce requested a review from szha as a code owner January 25, 2018 00:31

anirudh2290 reviewed Jan 26, 2018

View reviewed changes

piiswrong reviewed Jan 29, 2018

View reviewed changes

wentingj reviewed Feb 2, 2018

View reviewed changes

reminisce force-pushed the merge_quantization_to_master branch from 0e2b772 to 07e205c Compare February 2, 2018 07:00

reminisce requested a review from marcoabreu as a code owner February 2, 2018 21:20

reminisce force-pushed the merge_quantization_to_master branch from 49f1bcf to ef22953 Compare February 5, 2018 18:43

eric-haibin-lin reviewed Feb 5, 2018

View reviewed changes

reminisce force-pushed the merge_quantization_to_master branch from ecc8466 to 3087be6 Compare March 23, 2018 20:28

ZihengJiang and others added 16 commits March 24, 2018 19:29

Fix compile error

7590312

Fix CI

790f29c

Remove tests that should not run on P3

5c593de

Remove unnecessary docker file

121e655

Fix registering quantized nn ops

e100267

Reformat Jenkinsfile and switch quantization to CUDA 9 (#9)

fa6a419

Address interface change cr

9f251ae

Address comments and fix bugs

1995b73

Make unit test stable

4f33f92

Improve unit test

33964bd

Address cr

d9f2068

Address cr

0a943c3

Fix flaky unit test layer_norm

21512de

Fix doc

7be4936

reminisce force-pushed the merge_quantization_to_master branch from 83a7041 to 7be4936 Compare March 25, 2018 02:30

marcoabreu approved these changes Mar 26, 2018

View reviewed changes

marcoabreu merged commit 66c6dda into apache:master Mar 26, 2018

pengzhao-intel mentioned this pull request May 8, 2019

Huge performance decrease by quantization #13720

Closed

TaoLv mentioned this pull request Dec 10, 2020

[FEATURE] Restore Quantization API to MXNet #19587

Merged

4 tasks

		return hist


		def _get_optimal_threshold(arr, num_bins=8001, num_quantized_bins=255):

		@@ -0,0 +1,467 @@
		# Licensed to the Apache Software Foundation (ASF) under one

		@@ -0,0 +1 @@
		../image-classification/common

[MXNET-133] Model Quantization with Calibration #9552

[MXNET-133] Model Quantization with Calibration #9552

Conversation

reminisce commented Jan 25, 2018 • edited Loading

Description

Details

Code Structure

Notes

wentingj commented Jan 25, 2018

anirudh2290 Jan 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinhuang415 commented Jan 26, 2018

reminisce commented Jan 27, 2018

marcoabreu commented Jan 27, 2018 via email

reminisce commented Jan 27, 2018

reminisce commented Jan 27, 2018

marcoabreu commented Jan 27, 2018

reminisce commented Jan 27, 2018

marcoabreu commented Jan 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu commented Feb 2, 2018 • edited Loading

reminisce commented Feb 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reminisce Feb 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KellenSunderland commented Feb 2, 2018

pengzhao-intel commented Feb 2, 2018

reminisce commented Feb 2, 2018 • edited Loading

reminisce commented Feb 2, 2018

reminisce commented Feb 2, 2018

marcoabreu commented Feb 3, 2018 via email

pengzhao-intel commented Feb 3, 2018

reminisce commented Feb 3, 2018

pengzhao-intel commented Feb 4, 2018

marcoabreu commented Feb 4, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoabreu left a comment

Choose a reason for hiding this comment

BUG1989 commented Apr 24, 2018

JingrenChen commented May 19, 2018

reminisce commented May 21, 2018

reminisce commented May 21, 2018

reminisce commented Jan 25, 2018 •

edited

Loading

anirudh2290 Jan 26, 2018 •

edited

Loading

marcoabreu commented Feb 2, 2018 •

edited

Loading

reminisce Feb 2, 2018 •

edited

Loading

reminisce commented Feb 2, 2018 •

edited

Loading