[RFC] Quantization Workflow #2259

ZihengJiang · 2018-12-08T20:48:46Z

Goal

Here are two feasible approaches to support running quantized model with TVM:

Get quantized model from other frontend frameworks like TF, C2, MX. We need to add the support for low-bit kernels and operators like quantize/dequantize with TVM, then transform the quantized graph directly.
Implement the quantization algorithm based on Relay. Take over the quantization procedure with Relay. This approach also requires the support for low-bit TVM kernels.

Actually, these two methods are not contradictory and we can achieve both. The issue is whether the second approach is necessary and worth the extra effort.

The problem is that different hardwares may have different constraints: we may have different choices for bits, and hardware may only support shift. We also have multiple choices for quantization schemes, like symmetric, asymmetric, etc. And we want to make this procedure easier and more flexible for hardware developers, based on Relay and TVM. Again, what we want to do is not to propose "the only right way to achieve quantization in TVM". What we want to achieve is to propose a workflow that can be flexibly customized for different hardwares and different quantize scheme. And signed symmetric quantization is just one demo for this workflow.

Current Design

The current quantization workflow is composed of three passes on the Relay IR.

Annotate

Given a float32 graph, it will return a graph back which simulates the error brought by current quantization scheme. The implementation centers around rewrite function of each operator and a simulated quantize op. Let us review the definition of simulated_quantize first:

def simualted_quantize(data, dom_scale, nbit, clip_min, clip_max, sign=True, rounding=’round’):
    """simulating the rounding error and saturate error"""
    scaled_data = data / dom_scale
    # select round scheme `round`/`floor`/`ceil`/`statistical_round` according to attribute `rounding`
    round_data = round(scaled_data)
    clipped_data = clip(round_data, clip_min, clip_max)
    # recover the data
    ret_data = clipped_data * dom_scale
    return ret_data

For every operator, it can register an AnnoateRewrite function, which rewrite the operator in the original graph. For example, it will rewrite a subgraph data->conv-> to data->sq->conv->. It can be overrided by users for different quantizaiton scheme.

# a pseudo naive example for registering a rewrite function for conv2d
@register_annotate_rewrite("nn.conv2d")
def conv2d_rewrite(ref_call, new_args, ctx):
    lhs, rhs = new_args
    lhs = attach_simulated_quantize(lhs, sign=True, rounding='round')
    rhs = attach_simulated_quantize(rhs, sign=True, rounding='round')
    return conv2d(lhs, rhs, ref_call.attrs)

Calibrate

The calibrate procedure will try to calculate the content of dom_scale, nbit, clip_min, clip_max for every simulated_quantize operator. Currently, we use a quite naive approach, setting them with the upper/lower bound which default bit setting allows. There are lots of spaces to explore how to set those fields smartly here.

Realize

The realize pass will transform the simulated quantized graph, which computes with float32 actually, to a real low-bit integer graph. It will replace the simulated_quantize with several fine-grained operators like add, multiply, and shift as more as possible for performance (fusion, etc.)

Demonstration

This workflow should be able to support different choices in terms of number of bits and quantization scheme. Just need to override the registered AnnotateRewrite function for each operator.

(TODO)

Support for different bits

i8->i32
i16->i32
i8->i24
i5->i16

Support for different quantization scheme

Symmetric
Asymmetric
Channel-wise Scale

The text was updated successfully, but these errors were encountered:

ZihengJiang · 2018-12-08T20:52:39Z

@ajtulloch @tqchen @vinx13 @lixiaoquan

tqchen · 2018-12-08T21:20:03Z

Given that quantization is a broad topic, it would be great to list all the possible alternatives(asymmetric and symmetric, choices of bits) and discuss the pros and cons of these perspectives and how can we support them

tqchen · 2018-12-10T04:22:58Z

@ajtulloch do you mind elaborate on what is needed for asymmetric support and pros and cons?

antinucleon · 2018-12-10T21:06:23Z

This article is a systemic summary of current quantization. https://arxiv.org/pdf/1806.08342.pdf

Usually asymmetric is introduced by activation, like ReLU6.

What this paper is not covered is percentage quantization, which is not widely used today, but more reasonable. Percentage quantization is also a non-symmetric method.

tqchen · 2018-12-10T21:52:57Z

One thing we also want to discuss in detail is the map between the arithmetics in the quantized view into normal operators that are already defined, here are some examples:

quantize: can usually be represented by mul-round-maximum/mininum(clip)- cast
q_add: can usually directly translated to add in the storage type(int8 add).
q_max_pool can usually directly translated to the pooling in the storage type(int8 max_pool)
q_mul: depending on symmetric or asymmetric, there are some questions here.
- Usually, need mixed precision support, which means it will be cast then mul(can be pattern matched by llvm to generate mixed precision instructions)
Change of scale in the domain(s1->s2) in symmetric case:
- If scale = s2/s1 is not power of two, we can perform (y = round(x/scale) )
- If the scale is a power of two scale=2^k, k >=1, can do (y = (x + (0.5 * 2^k)) >> k) ), note that right shift rounds to -inf, the additional constant is used to keep things balanced.

The general goal is to avoid creating a huge list of quantized_xxx ops, but instead directly lower them to standard normal ops that operate on i8/i32

antinucleon · 2018-12-10T22:29:44Z

The general goal is to avoid creating a huge list of quantized_xxx ops, but instead directly lower them to standard normal ops that operate on i8/i32

It depends how many hardware backends are considered. Eg, if hardware is designed to execute with i8/u8->i16->u8 mix input, then standard i8/i32 will not work for this hardware. i8/u8 is quite common in ISP based quantization.

I think before make a decision on API, make a decision of what kind of hardware will be covered will be better.

tqchen · 2018-12-11T05:27:06Z

The general goal is to be able to cover all the cases in the scaffolding, and optionally extend them. The current proposal does not exclude mix signed/unsigned input though. The simple pattern matching could be done as (i8 cast to i16) mul (u8 -> cast to i16, which still can be lowered to normal ops

sergei-mironov · 2018-12-25T13:12:07Z

Could you please check the definition of simualted_quantize from "Current design" section? In line ret_data = data * dom_scale should we change data to clipped_data?

tqchen · 2018-12-26T18:18:18Z

@grwlf I think you are right, cc @ZihengJiang

Vooblin · 2019-01-11T09:32:18Z

I think simulated_quantize shouldn't have attribute sign. Only attribute nbit and method of calculating dom_scale, but not a quantization scheme in general, affect the rounding error, isn't it?

tqchen · 2019-01-15T17:42:32Z

The sign bit will affect the effective number of bits being used to represent nbit and the way people do clipping

eqy · 2019-01-18T22:44:17Z

X-posting my comment from the PR thread as it is more relevant here:
Is there support for mixed quantization levels? It looks like currently we specify a global weight and activation precision only---there seems to be only one QConfigNode object. Since we can already do things like skip the first k conv layers when quantizing, it seems that this would be a useful generalization.

eqy · 2019-01-23T22:38:32Z

@ZihengJiang I am currently playing around with some ways to implement mixed-precision quantization here. Currently it is a little tricky to give the user fine-grained control. Do you think something like tagging each annotation (e.g., with some id) and potentially giving QConfigNode a mapping of id->quantization bit widths is reasonable? A simple first approach is to use the conv_counter as the id, but fancier things can also be done---this seems to be modular (orthogonal) to the current PR.

tqchen · 2019-02-18T21:56:42Z

Close by #2116 Let us open new threads for new discussions on supporting more data types

tqchen changed the title ~~[RFC] A Low-bit Quantization Workflow Can be Flexibly Customized~~ [RFC] Quantization Workflow Dec 8, 2018

tqchen added the status: RFC label Dec 8, 2018

tqchen closed this as completed Feb 18, 2019

ZihengJiang mentioned this issue Jan 21, 2020

[RFC] Learning-based Automated Quantization #4757

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Quantization Workflow #2259

[RFC] Quantization Workflow #2259

ZihengJiang commented Dec 8, 2018 •

edited

Loading

ZihengJiang commented Dec 8, 2018

tqchen commented Dec 8, 2018

tqchen commented Dec 10, 2018

antinucleon commented Dec 10, 2018

tqchen commented Dec 10, 2018 •

edited

Loading

antinucleon commented Dec 10, 2018 •

edited

Loading

tqchen commented Dec 11, 2018

sergei-mironov commented Dec 25, 2018

tqchen commented Dec 26, 2018

Vooblin commented Jan 11, 2019 •

edited

Loading

tqchen commented Jan 15, 2019

eqy commented Jan 18, 2019

eqy commented Jan 23, 2019 •

edited

Loading

tqchen commented Feb 18, 2019

[RFC] Quantization Workflow #2259

[RFC] Quantization Workflow #2259

Comments

ZihengJiang commented Dec 8, 2018 • edited Loading

Goal

Current Design

Annotate

Calibrate

Realize

Demonstration

Support for different bits

Support for different quantization scheme

ZihengJiang commented Dec 8, 2018

tqchen commented Dec 8, 2018

tqchen commented Dec 10, 2018

antinucleon commented Dec 10, 2018

tqchen commented Dec 10, 2018 • edited Loading

antinucleon commented Dec 10, 2018 • edited Loading

tqchen commented Dec 11, 2018

sergei-mironov commented Dec 25, 2018

tqchen commented Dec 26, 2018

Vooblin commented Jan 11, 2019 • edited Loading

tqchen commented Jan 15, 2019

eqy commented Jan 18, 2019

eqy commented Jan 23, 2019 • edited Loading

tqchen commented Feb 18, 2019

ZihengJiang commented Dec 8, 2018 •

edited

Loading

tqchen commented Dec 10, 2018 •

edited

Loading

antinucleon commented Dec 10, 2018 •

edited

Loading

Vooblin commented Jan 11, 2019 •

edited

Loading

eqy commented Jan 23, 2019 •

edited

Loading