-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gguf q4_k quantization #2001
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2001
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d4bb04d with merge base 3bbf42a ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
havent looked at the code, but does it also implement super block scale? |
yeah this is exactly what this PR is implementing :) Q4_K quant that has two levels of quantization |
|
||
import torch | ||
|
||
from torchao.prototype.quantization.gguf import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validate this btw by actually creating a gguf for a model and then run the resulting gguf file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't explored how to export yet, will do in next PR
|
||
|
||
@dataclass | ||
class GGUFWeightOnlyConfig(AOBaseConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you checked that this is generic enough to capture all of their superblock affine schemes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it included the number of bits per sub-block scales and mins (both are usually the same), then it would be easier to adapt to more types.
Some types use 4-bit sub-block scales and mins (Q2_K
), others 6-bit (Q3_K
, Q4_K
, Q5_K
) and others 8-bit (Q6_K
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it will be easy to extend to other variations, I just start with Q4_K for now
|
||
@staticmethod | ||
def __new__( | ||
cls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GGUF have a utility that can construct their packed tensors from these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet for k-quants, sorry. It is planned though, but it will take some time.
But it would not be a drop-in replacement here anyway since the gguf
Python package uses Numpy for its calculations, not PyTorch. (at least for now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the context @compilade, I think maybe we can try to see if we can use the export script from autoround: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/convert.py#L1159-L1169
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we can try to see if we can use the export script from autoround
This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants
in Numpy: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K
, assuming you already have them and/or they are trainable parameters.
I think the way you did it with separate tensors for the scales and mins does seem appropriate to avoid some of the complexity of the packing format (since from what I understand, this is intended to be some in-memory quantization used in QAT? Do correct me if I'm wrong). (the scales and mins in Q4_K
are notoriously packed together in 12 bytes, but that's not relevant if this is intended as an in-memory quantization with different kernels than the ones in ggml
)
You only need to worry about the packing format if you are exporting to GGUF. (Dequantization is implemented in the upstream gguf
Python package for most types (including k-quants and i-quants) already, it's only quantization which is currently limited to Q4_0
, Q4_1
, Q5_0
, Q5_1
, and Q8_0
because k-quants in Python were too slow to pack (although this could change after ggml-org/llama.cpp#12557, which might simplify the search for scales (and mins, if generalized)))
It would be possible, though, to add an API to the upstream gguf
Python package which would skip the search for the scales and mins but still allow quantizing to Q4_K
and similar, but I'm not sure how it would/should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: intel/auto-round@eb79348/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.
yes that's correct I think, we may have to adapt that code a bit for torchao quants to work, but we'll be providing scale/min from torchao (current plan is to use GPTQ, AutoRound or QAT), so won't need to run that path, will need to make sure this path is taken instead: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L326
if this is intended as an in-memory quantization with different kernels than the ones in ggml)
we do want to target ggml kernels in the end still, but overall goal here is try to leverage existing torchao post training accuracy preserving techniques like GPTQ, AutoRound etc. and quantization aware training techniques to see if we can help improve the accuracy of various gguf quantization schemes through composing with these existing techniques (and relying on user data).
Regarding search algorithms in gguf, yeah I feel they are quite complicated (make_qx_quant, make_qp_quant etc. I also haven't looked at imatrix as stuff), and it might be error prone for me to port them here, also I didn't see them anywhere else, @compilade can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with, are all of them trying to minimize the rounding error? what about clamping error? why don't you use data to improve quantization accuracy?
Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags:
thanks, I'll merge now since CI is happy, will add more docs next time |
* Add gguf q4_k_s quantization Summary: Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28 but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit) Test Plan: python test/prototype/test_gguf_quant.py Reviewers: Subscribers: Tasks: Tags: * fix * test with phi4 * pre-commit run * update * run precommit * format
Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28
but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)
The goal to port gguf format here in torchao is trying to compose that with our existing accuracy preserving techniques like GPTQ, AutoRound, QAT to see if we can help improve the accuracy.
also produced https://huggingface.co/jerryzh168/phi4-mini-torchao-gguf-q4_k with this change and verified with lm-eval that it has good accuracy
Test Plan:
python test/prototype/test_gguf_quant.py
Reviewers:
Subscribers:
Tasks:
Tags: