-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove quantization-specific params from OperandDescriptor #44
Comments
The alternative you appear to be proposing, that operators have extra quantization arguments, is worse. You end up with MXNet's 6-9 tensors just to call matrix multiplication: https://github.com/apache/incubator-mxnet/blob/master/src/operator/quantization/quantized_fully_connected.cc which is very hard to keep track of. And users do their own scaling calculations that are easy to get wrong. Carrying quantization information with the tensor is much better. Were there a base class without quantization information and an inherited class with quantization information, that would be acceptable. |
The key issues in my mind with coupling quantization data with the tensor data is that, first, the quantization process itself may produce new tensors. And secondly, there are other kinds of quantization method around besides the linear function supported by most frameworks today. The artifact from the first issue is precisely why, as you pointed out, that there are more tensors needed to compute a quantized matmul operation. The scale factors and zero-point adjustments are just the artifacts for the linear quantization function. There are many other functions that would produce a different set of artifacts. By folding these specialized artifacts from a given quantization process into the notion of a tensor, which is very specific and rather rudimentary, we are asserting that any tensor can be thought of as a bag of properties that may also carry a set of compute-specific transformation data. This means that every operator, unless being somehow exempted from it, must rationalize its own unique computation requirement with a set of transforms presented in one of its tensor operands. It is a very unsustainable situation for API longevity. It would be much more manageable to instead dedicate additional operators that deal with a certain type of quantization function in their calculations while keeping the notion of tensor simple as it is originally intended to be, from the data structure standpoint. ONNX took this approach rather successfully in its support for linear quantization. Certainly, there are more tensor arguments going into the linear-quantized version of matmul, but that is just a natural consequence of a process that does indeed need more tensors to compute the result. I would also point out that in a majority of post-training quantized models, not every single operator needs to be aware of quantized data, only a few do. It's common and much more efficient for a quantized model to route the quantized scalers and zero-point adjustments tensors separately from the main quantized data so that the final dequantization process (where float calculations are resurrected) can take place at the very end of the graph with all the quantized tensors combined, rather than passing them all throughout the graph and risking the unintended copying. |
Per PR #94. |
Keep OperandDescriptor scoped and versioning friendly by removing quantization specific params e.g. scale and zeropoint, while keeping OperandType enum values as a straight data type enum. Quantization-specific params could be made arguments of a new Operand making overload.
The text was updated successfully, but these errors were encountered: