Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gguf q4_k quantization #2001

Merged
merged 7 commits into from
Apr 8, 2025
Merged

Add gguf q4_k quantization #2001

merged 7 commits into from
Apr 8, 2025

Conversation

jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Apr 2, 2025

Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28

but implemented a simple choose_qparams that can fit the gguf format: Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)

The goal to port gguf format here in torchao is trying to compose that with our existing accuracy preserving techniques like GPTQ, AutoRound, QAT to see if we can help improve the accuracy.

also produced https://huggingface.co/jerryzh168/phi4-mini-torchao-gguf-q4_k with this change and verified with lm-eval that it has good accuracy

Test Plan:
python test/prototype/test_gguf_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

Copy link

pytorch-bot bot commented Apr 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2001

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d4bb04d with merge base 3bbf42a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2025
@jerryzh168 jerryzh168 changed the title Add gguf q4_k_s quantization Add gguf q4_k quantization Apr 2, 2025
@jerryzh168 jerryzh168 requested a review from larryliu0820 April 2, 2025 05:41
@jerryzh168 jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 2, 2025
@kimishpatel
Copy link
Contributor

havent looked at the code, but does it also implement super block scale?

@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Apr 2, 2025

havent looked at the code, but does it also implement super block scale?

yeah this is exactly what this PR is implementing :) Q4_K quant that has two levels of quantization


import torch

from torchao.prototype.quantization.gguf import (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validate this btw by actually creating a gguf for a model and then run the resulting gguf file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't explored how to export yet, will do in next PR



@dataclass
class GGUFWeightOnlyConfig(AOBaseConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that this is generic enough to capture all of their superblock affine schemes?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it included the number of bits per sub-block scales and mins (both are usually the same), then it would be easier to adapt to more types.

Some types use 4-bit sub-block scales and mins (Q2_K), others 6-bit (Q3_K, Q4_K, Q5_K) and others 8-bit (Q6_K).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it will be easy to extend to other variations, I just start with Q4_K for now


@staticmethod
def __new__(
cls,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GGUF have a utility that can construct their packed tensors from these values?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet for k-quants, sorry. It is planned though, but it will take some time.

But it would not be a drop-in replacement here anyway since the gguf Python package uses Numpy for its calculations, not PyTorch. (at least for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the context @compilade, I think maybe we can try to see if we can use the export script from autoround: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/convert.py#L1159-L1169

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we can try to see if we can use the export script from autoround

This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L164.

Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.

I think the way you did it with separate tensors for the scales and mins does seem appropriate to avoid some of the complexity of the packing format (since from what I understand, this is intended to be some in-memory quantization used in QAT? Do correct me if I'm wrong). (the scales and mins in Q4_K are notoriously packed together in 12 bytes, but that's not relevant if this is intended as an in-memory quantization with different kernels than the ones in ggml)

You only need to worry about the packing format if you are exporting to GGUF. (Dequantization is implemented in the upstream gguf Python package for most types (including k-quants and i-quants) already, it's only quantization which is currently limited to Q4_0, Q4_1, Q5_0, Q5_1, and Q8_0 because k-quants in Python were too slow to pack (although this could change after ggml-org/llama.cpp#12557, which might simplify the search for scales (and mins, if generalized)))

It would be possible, though, to add an API to the upstream gguf Python package which would skip the search for the scales and mins but still allow quantizing to Q4_K and similar, but I'm not sure how it would/should be used.

Copy link
Contributor Author

@jerryzh168 jerryzh168 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely be very, very slow (more than 20x the C version) since they re-implement make_qkx2_quants in Numpy: intel/auto-round@eb79348/auto_round/export/export_to_gguf/quant.py#L164.
Technically, you may not necessarily need the search for the scales and mins to pack to Q4_K, assuming you already have them and/or they are trainable parameters.

yes that's correct I think, we may have to adapt that code a bit for torchao quants to work, but we'll be providing scale/min from torchao (current plan is to use GPTQ, AutoRound or QAT), so won't need to run that path, will need to make sure this path is taken instead: https://github.com/intel/auto-round/blob/eb793488638105980f18ed7a40301d29d5c75ca0/auto_round/export/export_to_gguf/quant.py#L326

if this is intended as an in-memory quantization with different kernels than the ones in ggml)

we do want to target ggml kernels in the end still, but overall goal here is try to leverage existing torchao post training accuracy preserving techniques like GPTQ, AutoRound etc. and quantization aware training techniques to see if we can help improve the accuracy of various gguf quantization schemes through composing with these existing techniques (and relying on user data).

Regarding search algorithms in gguf, yeah I feel they are quite complicated (make_qx_quant, make_qp_quant etc. I also haven't looked at imatrix as stuff), and it might be error prone for me to port them here, also I didn't see them anywhere else, @compilade can you share how these are derived at the high level, and how we can understand them better. i.e. the high level motivation/direction/constraints that you are working with, are all of them trying to minimize the rounding error? what about clamping error? why don't you use data to improve quantization accuracy?

Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28

but implemented a simple choose_qparams that can fit the gguf format:
Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)

Test Plan:
python test/prototype/test_gguf_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168
Copy link
Contributor Author

thanks, I'll merge now since CI is happy, will add more docs next time

@jerryzh168 jerryzh168 merged commit ef10f34 into pytorch:main Apr 8, 2025
18 checks passed
jainapurva pushed a commit that referenced this pull request Apr 8, 2025
* Add gguf q4_k_s quantization

Summary:
Didn't implement the algorithm to choose_qparams from gguf, since it's complicated, e.g. https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L744 and https://github.com/ggml-org/llama.cpp/blob/f423981ac806bf031d83784bcb47d2721bc70f97/ggml/src/ggml-quants.c#L827C14-L827C28

but implemented a simple choose_qparams that can fit the gguf format:
Q4_K: w = q * block_scale(6-bit) + block_min(6-bit)

Test Plan:
python test/prototype/test_gguf_quant.py

Reviewers:

Subscribers:

Tasks:

Tags:

* fix

* test with phi4

* pre-commit run

* update

* run precommit

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants