-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support QuaRot quantization scheme #6444
Comments
It's an interesting approach that we should explore. As far as I understood, the model weights are pre-processed (rotated) and then the inference is augmented with extra (two?) operations to restore the effects from the rotation. We can start with implementing the latter and try to use existing QuaRot models to evaluate |
Similar to "Training Transformers with 4-bit Integers", except that only used Hadamard. Hadamard alone might be enough, QuaRot did do an ablation for Q alone, but not Hadamard alone. |
Thanks for your interest in our work. I am the main author of QuaRot. I would be happy to discuss/plan for this and help to integrate it into the repo. I think the general steps of QuaRot are:
With the above step, you can quantize the model easily. optionally, you can apply another rotations for quantizing Keys in the attention module (please check the Method section in our paper). Please let me know if you need any help/support from our side through my email: saleh.ashkboos@inf.ethz.ch |
@JohannesGaessler |
Before I invest time into a specialized kernel for a given quantization method I would like to first see evidence that it's better than those methods currently on master. |
😮💨 |
QuIP# seems interesting in and of itself. Weights are dequantised before use so it can't use 4bit math, which is the main attraction of QuaRot, but it does perform better. Also it doesn't try to fold and preserve the rotated space whenever possible. For inference it's just, transform->matmul->detransform. QuaRot is really elegant, but hard to follow in comparison. The QuIP# way will be easier drop into the existing code. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Can we reopen this? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Another relevant work you might be interested in.
This work learns the rotation matrices and achieves even better results than QuaRot with fewer online Hadamard rotation. 😝 If you're interested we can chat and we can provide support if needed! |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Is llama.cpp planning/able to support SpinQuant? According to meta, SpinQuant + QLora are enabling really great things, and it would be great to not have to use meta's llama-stack to take advantage of them. |
A new, interesting quantization scheme was published, which not only reduces memory consumption (like current quantization schemes), but als reduces computations.
I think it would be interesting to see if this technique, or parts of it, could be adopted in llama.cpp, to speed up inference of quantized models.
The text was updated successfully, but these errors were encountered: