-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Quantization Workflow #2259
Comments
Given that quantization is a broad topic, it would be great to list all the possible alternatives(asymmetric and symmetric, choices of bits) and discuss the pros and cons of these perspectives and how can we support them |
@ajtulloch do you mind elaborate on what is needed for asymmetric support and pros and cons? |
This article is a systemic summary of current quantization. https://arxiv.org/pdf/1806.08342.pdf Usually asymmetric is introduced by activation, like ReLU6. What this paper is not covered is percentage quantization, which is not widely used today, but more reasonable. Percentage quantization is also a non-symmetric method. |
One thing we also want to discuss in detail is the map between the arithmetics in the quantized view into normal operators that are already defined, here are some examples:
The general goal is to avoid creating a huge list of quantized_xxx ops, but instead directly lower them to standard normal ops that operate on i8/i32 |
It depends how many hardware backends are considered. Eg, if hardware is designed to execute with i8/u8->i16->u8 mix input, then standard i8/i32 will not work for this hardware. i8/u8 is quite common in ISP based quantization. I think before make a decision on API, make a decision of what kind of hardware will be covered will be better. |
The general goal is to be able to cover all the cases in the scaffolding, and optionally extend them. The current proposal does not exclude mix signed/unsigned input though. The simple pattern matching could be done as (i8 cast to i16) mul (u8 -> cast to i16, which still can be lowered to normal ops |
Could you please check the definition of |
@grwlf I think you are right, cc @ZihengJiang |
I think |
The sign bit will affect the effective number of bits being used to represent nbit and the way people do clipping |
X-posting my comment from the PR thread as it is more relevant here: |
@ZihengJiang I am currently playing around with some ways to implement mixed-precision quantization here. Currently it is a little tricky to give the user fine-grained control. Do you think something like tagging each annotation (e.g., with some id) and potentially giving QConfigNode a mapping of id->quantization bit widths is reasonable? A simple first approach is to use the conv_counter as the id, but fancier things can also be done---this seems to be modular (orthogonal) to the current PR. |
Close by #2116 Let us open new threads for new discussions on supporting more data types |
Goal
Here are two feasible approaches to support running quantized model with TVM:
quantize
/dequantize
with TVM, then transform the quantized graph directly.Actually, these two methods are not contradictory and we can achieve both. The issue is whether the second approach is necessary and worth the extra effort.
The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.
Current Design
The current quantization workflow is composed of three passes on the Relay IR.
Annotate
Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a
simulated quantize
op. Let us review the definition ofsimulated_quantize
first:For every operator, it can register an
AnnoateRewrite
function, which rewrite the operator in the original graph. For example, it will rewrite a subgraphdata->conv->
todata->sq->conv->
. It can be overrided by users for different quantizaiton scheme.Calibrate
The
calibrate
procedure will try to calculate the content ofdom_scale
,nbit
,clip_min
,clip_max
for everysimulated_quantize
operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.Realize
The
realize
pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace thesimulated_quantize
with several fine-grained operators likeadd
,multiply
, andshift
as more as possible for performance (fusion, etc.)Demonstration
This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered
AnnotateRewrite
function for each operator.(TODO)
Support for different bits
Support for different quantization scheme
The text was updated successfully, but these errors were encountered: