-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add Auto-Round support #533
Comments
This is great to see, awesome work @yiliu30, for integrations I think the flow makes sense. I'm also wondering if it would make sense to have even tighter integrations in terms of the quantized kernel and quantization primitive ops. today in torchao, we have Affine Quantization: https://github.com/pytorch/ao/tree/main/torchao/quantization#affine-quantization, there are a few things in the stack (from highest level to lower level):
So, ideally I think we would like to integrate with the stack as high as possible (tensor subclass > quant primitive op > uint dtype tensors), the benefit is that your implementation will be able to benefit from whatever perf improvement we might have in these lower level infrastructures and we can optimize these things together as a community and all optimization work can benefit other projects as well. so for dtype tensors (e.g. uint4 dtype Tensor), I think you should always be able to integrate when that is ready. If this step is successful, we could probably reuse the same I just take a brief look so there might be things that I missed in terms of the difficulty or feasibility of reusing our quant primitives, please let me know if you have any thoughts on this. |
Thank you for the thorough issue @yiliu30! For now we are hoping ao can continue being a project with no dependencies #371 since this makes it easier for other repos to take a dependency on us but here's a few other options instead
Lmk if that makes sense! |
Hi @jerryzh168 and @msaroufim, Thanks for sharing your knowledge about the How about we divide the integration into two phases:
I can raise a PR to demonstrate more details. |
I can take a look at 1 but I can't guarantee I'll merge it some considerations I'll worry about are the size of the auto-round package and whether it causes any issues when installing or importing torchao but if it doesn't take long for you to produce that example it might make the discussion go faster on 2 |
Hi, here is the INC team from Intel. Thank you for developing this amazing project.
Motivation
Our team has developed Auto-Round, a new weight-only quantization algorithm. It has achieved superior accuracy compared to GPTQ, AWQ, and OmniQuant across 11 tasks, particularly excelling in low-bit quantization (e.g., 2-bits and 3-bits). Auto-Round supports quantization from 2 to 8 bits, involves low tuning costs, and imposes no additional overhead during inference. Key results are summarized below, with detailed information available in our paper, GitHub repository, and Hugging Face low-bit quantization leaderboard.
We would like to contribute this quantization algorithm to torchao to let users benefit from its high accuracy.
The key Idea of Auto-Round
To quantize a given tensor, Auto-Round introduces three trainable parameters (
V
, α and β) to adjust the rounding value and clipping range. For a given transformers model, Auto-Round quantizes the decoder block one by one, using block-wise output reconstruction error as loss to train these parameters.The Modeling User API
We propose the following flow for quantizing a model with Auto-Round, which is similar to the flow of static quantization requiring calibration:
Implementation Overview
The high-level idea to implement the above flow is:
ObservedBlock
for capturing the block's input.V
, α and β, then replace theLinear
layers in the observed block toQuantizedLinear
layers by applying the optimalV
, α and β.The main functions and classes to implement this flow are defined below:
Your feedback is important. Please feel free to comment on the flow mentioned above or suggest additional approaches:). Thank you in advance!
cc @thuang6 @ftian1 @wenhuach21 @hshen14 @jgong5
The text was updated successfully, but these errors were encountered: