Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove quantization-specific params from OperandDescriptor #44

Closed
wchao1115 opened this issue Feb 13, 2020 · 3 comments
Closed

Remove quantization-specific params from OperandDescriptor #44

wchao1115 opened this issue Feb 13, 2020 · 3 comments

Comments

@wchao1115
Copy link
Collaborator

Keep OperandDescriptor scoped and versioning friendly by removing quantization specific params e.g. scale and zeropoint, while keeping OperandType enum values as a straight data type enum. Quantization-specific params could be made arguments of a new Operand making overload.

@kpu
Copy link

kpu commented Sep 8, 2020

The alternative you appear to be proposing, that operators have extra quantization arguments, is worse. You end up with MXNet's 6-9 tensors just to call matrix multiplication: https://github.com/apache/incubator-mxnet/blob/master/src/operator/quantization/quantized_fully_connected.cc which is very hard to keep track of. And users do their own scaling calculations that are easy to get wrong. Carrying quantization information with the tensor is much better.

Were there a base class without quantization information and an inherited class with quantization information, that would be acceptable.

@wchao1115
Copy link
Collaborator Author

wchao1115 commented Sep 8, 2020

The key issues in my mind with coupling quantization data with the tensor data is that, first, the quantization process itself may produce new tensors. And secondly, there are other kinds of quantization method around besides the linear function supported by most frameworks today. The artifact from the first issue is precisely why, as you pointed out, that there are more tensors needed to compute a quantized matmul operation. The scale factors and zero-point adjustments are just the artifacts for the linear quantization function. There are many other functions that would produce a different set of artifacts.

By folding these specialized artifacts from a given quantization process into the notion of a tensor, which is very specific and rather rudimentary, we are asserting that any tensor can be thought of as a bag of properties that may also carry a set of compute-specific transformation data. This means that every operator, unless being somehow exempted from it, must rationalize its own unique computation requirement with a set of transforms presented in one of its tensor operands. It is a very unsustainable situation for API longevity.

It would be much more manageable to instead dedicate additional operators that deal with a certain type of quantization function in their calculations while keeping the notion of tensor simple as it is originally intended to be, from the data structure standpoint. ONNX took this approach rather successfully in its support for linear quantization. Certainly, there are more tensor arguments going into the linear-quantized version of matmul, but that is just a natural consequence of a process that does indeed need more tensors to compute the result.

I would also point out that in a majority of post-training quantized models, not every single operator needs to be aware of quantized data, only a few do. It's common and much more efficient for a quantized model to route the quantized scalers and zero-point adjustments tensors separately from the main quantized data so that the final dequantization process (where float calculations are resurrected) can take place at the very end of the graph with all the quantized tensors combined, rather than passing them all throughout the graph and risking the unintended copying.

@wchao1115
Copy link
Collaborator Author

Per PR #94.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants