Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Quantization Workflow #2259

Closed
ZihengJiang opened this issue Dec 8, 2018 · 14 comments
Closed

[RFC] Quantization Workflow #2259

ZihengJiang opened this issue Dec 8, 2018 · 14 comments

Comments

@ZihengJiang
Copy link
Contributor

ZihengJiang commented Dec 8, 2018

Goal

Here are two feasible approaches to support running quantized model with TVM:

  • Get quantized model from other frontend frameworks like TF, C2, MX. We need to add the support for low-bit kernels and operators like quantize/dequantize with TVM, then transform the quantized graph directly.
  • Implement the quantization algorithm based on Relay. Take over the quantization procedure with Relay. This approach also requires the support for low-bit TVM kernels.

Actually, these two methods are not contradictory and we can achieve both. The issue is whether the second approach is necessary and worth the extra effort.

The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.

Current Design

The current quantization workflow is composed of three passes on the Relay IR.

Annotate

Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a simulated quantize op. Let us review the definition of simulated_quantize first:

def simualted_quantize(data, dom_scale, nbit, clip_min, clip_max, sign=True, rounding=round’):
    """simulating the rounding error and saturate error"""
    scaled_data = data / dom_scale
    # select round scheme `round`/`floor`/`ceil`/`statistical_round` according to attribute `rounding`
    round_data = round(scaled_data)
    clipped_data = clip(round_data, clip_min, clip_max)
    # recover the data
    ret_data = clipped_data * dom_scale
    return ret_data

For every operator, it can register an AnnoateRewrite function, which rewrite the operator in the original graph. For example, it will rewrite a subgraph data->conv-> to data->sq->conv->. It can be overrided by users for different quantizaiton scheme.

# a pseudo naive example for registering a rewrite function for conv2d
@register_annotate_rewrite("nn.conv2d")
def conv2d_rewrite(ref_call, new_args, ctx):
    lhs, rhs = new_args
    lhs = attach_simulated_quantize(lhs, sign=True, rounding='round')
    rhs = attach_simulated_quantize(rhs, sign=True, rounding='round')
    return conv2d(lhs, rhs, ref_call.attrs)

Calibrate

The calibrate procedure will try to calculate the content of dom_scale, nbit, clip_min, clip_max for every simulated_quantize operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.

Realize

The realize pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace the simulated_quantize with several fine-grained operators like add, multiply, and shift as more as possible for performance (fusion, etc.)

Demonstration

This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered AnnotateRewrite function for each operator.

(TODO)

Support for different bits

  • i8->i32
  • i16->i32
  • i8->i24
  • i5->i16

Support for different quantization scheme

  • Symmetric
  • Asymmetric
  • Channel-wise Scale
@ZihengJiang
Copy link
Contributor Author

@tqchen tqchen changed the title [RFC] A Low-bit Quantization Workflow Can be Flexibly Customized [RFC] Quantization Workflow Dec 8, 2018
@tqchen
Copy link
Member

tqchen commented Dec 8, 2018

Given that quantization is a broad topic, it would be great to list all the possible alternatives(asymmetric and symmetric, choices of bits) and discuss the pros and cons of these perspectives and how can we support them

@tqchen
Copy link
Member

tqchen commented Dec 10, 2018

@ajtulloch do you mind elaborate on what is needed for asymmetric support and pros and cons?

@antinucleon
Copy link
Contributor

This article is a systemic summary of current quantization. https://arxiv.org/pdf/1806.08342.pdf

Usually asymmetric is introduced by activation, like ReLU6.

What this paper is not covered is percentage quantization, which is not widely used today, but more reasonable. Percentage quantization is also a non-symmetric method.

@tqchen
Copy link
Member

tqchen commented Dec 10, 2018

One thing we also want to discuss in detail is the map between the arithmetics in the quantized view into normal operators that are already defined, here are some examples:

  • quantize: can usually be represented by mul-round-maximum/mininum(clip)- cast
  • q_add: can usually directly translated to add in the storage type(int8 add).
  • q_max_pool can usually directly translated to the pooling in the storage type(int8 max_pool)
  • q_mul: depending on symmetric or asymmetric, there are some questions here.
    • Usually, need mixed precision support, which means it will be cast then mul(can be pattern matched by llvm to generate mixed precision instructions)
  • Change of scale in the domain(s1->s2) in symmetric case:
    • If scale = s2/s1 is not power of two, we can perform (y = round(x/scale) )
    • If the scale is a power of two scale=2^k, k >=1, can do (y = (x + (0.5 * 2^k)) >> k) ), note that right shift rounds to -inf, the additional constant is used to keep things balanced.

The general goal is to avoid creating a huge list of quantized_xxx ops, but instead directly lower them to standard normal ops that operate on i8/i32

@antinucleon
Copy link
Contributor

antinucleon commented Dec 10, 2018

The general goal is to avoid creating a huge list of quantized_xxx ops, but instead directly lower them to standard normal ops that operate on i8/i32

It depends how many hardware backends are considered. Eg, if hardware is designed to execute with i8/u8->i16->u8 mix input, then standard i8/i32 will not work for this hardware. i8/u8 is quite common in ISP based quantization.

I think before make a decision on API, make a decision of what kind of hardware will be covered will be better.

@tqchen
Copy link
Member

tqchen commented Dec 11, 2018

The general goal is to be able to cover all the cases in the scaffolding, and optionally extend them. The current proposal does not exclude mix signed/unsigned input though. The simple pattern matching could be done as (i8 cast to i16) mul (u8 -> cast to i16, which still can be lowered to normal ops

@sergei-mironov
Copy link
Contributor

Could you please check the definition of simualted_quantize from "Current design" section? In line ret_data = data * dom_scale should we change data to clipped_data?

@tqchen
Copy link
Member

tqchen commented Dec 26, 2018

@grwlf I think you are right, cc @ZihengJiang

@Vooblin
Copy link
Contributor

Vooblin commented Jan 11, 2019

I think simulated_quantize shouldn't have attribute sign. Only attribute nbit and method of calculating dom_scale, but not a quantization scheme in general, affect the rounding error, isn't it?

@tqchen
Copy link
Member

tqchen commented Jan 15, 2019

The sign bit will affect the effective number of bits being used to represent nbit and the way people do clipping

@eqy
Copy link
Contributor

eqy commented Jan 18, 2019

X-posting my comment from the PR thread as it is more relevant here:
Is there support for mixed quantization levels? It looks like currently we specify a global weight and activation precision only---there seems to be only one QConfigNode object. Since we can already do things like skip the first k conv layers when quantizing, it seems that this would be a useful generalization.

@eqy
Copy link
Contributor

eqy commented Jan 23, 2019

@ZihengJiang I am currently playing around with some ways to implement mixed-precision quantization here. Currently it is a little tricky to give the user fine-grained control. Do you think something like tagging each annotation (e.g., with some id) and potentially giving QConfigNode a mapping of id->quantization bit widths is reasonable? A simple first approach is to use the conv_counter as the id, but fancier things can also be done---this seems to be modular (orthogonal) to the current PR.

@tqchen
Copy link
Member

tqchen commented Feb 18, 2019

Close by #2116 Let us open new threads for new discussions on supporting more data types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants