Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HQQ support #605

Merged
merged 21 commits into from
Aug 15, 2024
Merged

Add HQQ support #605

merged 21 commits into from
Aug 15, 2024

Conversation

mobicham
Copy link
Collaborator

@mobicham mobicham commented Aug 6, 2024

Adds support for HQQ quantization without using the hqq lib.

Note:

The dequantized output produced by AffineQuantizedTensor is a bit worse than that produced by the hqq-lib. You can check that by setting raw_output=True. The problem has to do with the midpoint used in the dequantization logic by AffineQuantizedTensor which produces zero-point values with very low magnitude.

Copy link

pytorch-bot bot commented Aug 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/605

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0b511f5 with merge base ffa88a4 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link

Hi @mobicham!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2024
@supriyar supriyar requested a review from HDCharles August 6, 2024 23:19
@mobicham
Copy link
Collaborator Author

mobicham commented Aug 8, 2024

hqq_ao
Llama3.1 8B Instruct lm-eval, the individual scores for each benchmark are a bit different (probably because of the mid-point thing), but the average score is the same

Copy link
Contributor

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py#L384

to have a new argument use_hqq default to False that applies the hqq quantization
so that it aligns with the existing API?

@jerryzh168
Copy link
Contributor

The problem has to do with the midpoint used in the dequantization logic by AffineQuantizedTensor which produces zero-point values with very low magnitude.

this is talking about accuracy? I can see you'd need this conversion for int4 tinygemm, but it's not required for other cases, if this is significant, maybe you can expose raw_output as a flag to the top level quant_api for hqq

@mobicham mobicham marked this pull request as ready for review August 9, 2024 08:25
@mobicham
Copy link
Collaborator Author

mobicham commented Aug 9, 2024

@jerryzh168 we can keep it this way for the moment. The other solution would be to modify the alternate minimization and update the quant parameters based ont he mid-point format instead of (Wq - zero)*scale. For the moment, the average lm-eval is about the same as hqq-lib, so I guess it's mostly fine. If there are some performance issues I can get back to it and try to change the steps to improve performance with the mid-point approach which is needed for using the tinygemm kernel.

@jerryzh168
Copy link
Contributor

@mobicham thanks, for tests right now we support both unittest.TestCase and pytest (e.g.

@pytest.mark.parametrize("bit_width", bit_widths)
), what you have in the file makes sense I think, since you have hardcoded errors for each dtype I feel it's fine to just separate separate them into different tests

@mobicham
Copy link
Collaborator Author

@mobicham thanks, for tests right now we support both unittest.TestCase and pytest (e.g.

@pytest.mark.parametrize("bit_width", bit_widths)

), what you have in the file makes sense I think, since you have hardcoded errors for each dtype I feel it's fine to just separate separate them into different tests

I separated the tests in separate classes and added a check regarding the device, otherwise the tests would fail if no gpu is available.

@mobicham
Copy link
Collaborator Author

@jerryzh168 anything missing for the merge?

@msaroufim
Copy link
Member

IIRC there was some test failure, just retriggered CI if all g ill merge

@mobicham
Copy link
Collaborator Author

mobicham commented Aug 15, 2024

@msaroufim the tests fail because of triton. I was about to delete torchao/prototype/hqq/core.py since we moved the hqq core stuff to quantization, but just saw this pr uses it : #679

  test/hqq/test_hqq_affine.py:3: in <module>
      from torchao.prototype.hqq.core import HQQQuantizer
  /opt/conda/envs/venv/lib/python3.9/site-packages/torchao/prototype/hqq/__init__.py:1: in <module>
      from .mixed_mm import triton_mixed_mm, pack_2xint4
  /opt/conda/envs/venv/lib/python3.9/site-packages/torchao/prototype/hqq/mixed_mm.py:2: in <module>
      import triton.language as tl
  E   ModuleNotFoundError: No module named 'triton'

@msaroufim
Copy link
Member

msaroufim commented Aug 15, 2024

I do see some legit looking errors for GPU https://github.com/pytorch/ao/actions/runs/10403091060/job/28816346879?pr=605#step:12:2570

And for CPU not sure I follow the connection to the version guard fix PR, basically we need to ensure HQQ Quantizer doesnt get imported by accident on CPU machines that way we dont crash on the triton import. And that HQQ quantizer tests are skipped on CPU instnaces

Also I gave you access to trigger CI yourself that way you'll get signal per commit (its just an annoying thing for first time contributors)

Some of the changes you might see on local vs CI runs are due to us running multiple pytorch versions in CI so feel free to add skips for older pytorch versions in case your code isn't working for that

EDIT: Discussed offline to skip def test_dynamic_quant_per_channel_numerics_cuda(self):

@mobicham
Copy link
Collaborator Author

It looks like it's the tensorcore test that is failing not the rest:

  test/hqq/test_hqq_affine.py::TestHQQ4bit::test_hqq_tensorcore_4bit FAILED

It was working fine just 2 days ago I think, let me clone and recheck on an instance

@mobicham
Copy link
Collaborator Author

mobicham commented Aug 15, 2024

And for CPU not sure I follow the connection to the version guard fix PR, basically we need to ensure HQQ Quantizer doesnt get imported by accident on CPU machines that way we dont crash on the triton import

It's not supposed to be imported, I wanted to delete torchao/prototype/hqq/core.py because we moved all the stuff there to torchao/quantization/quant_primitives .py. Will take a look at it

@msaroufim msaroufim merged commit 18e38f1 into pytorch:main Aug 15, 2024
16 checks passed
@@ -389,7 +389,7 @@ def int4_weight_only(group_size=128, layout_type=TensorCoreTiledLayoutType(inner
size is more fine grained, choices are [256, 128, 64, 32]
`layout_type`: layout type for quantized tensor, default is `TensorCoreTiledLayoutType(inner_k_tiles=8)`
"""
def apply_int4_weight_only_quant(weight):
def apply_int4_weight_only_quant(weight, use_hqq=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found that this flag is not used, so we don't really expose hqq to users right now, are you planning to create a new function for hqq? cc @mobicham

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understand was that @HDCharles suggested putting it there and later turning it on by default.
It is exposed though via to_affine_quantized https://github.com/pytorch/ao/pull/605/files#diff-a9708dc28f15bb9cf665417e6c66601f9e8e2f1f672d1858603b74fa879a3357R62
Let me know if there's another way of exposing it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented in #786 (comment)

jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Aug 31, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 4, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 6, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 6, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 6, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/ao that referenced this pull request Sep 6, 2024
Summary:
att, this is a follow up for pytorch#605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit that referenced this pull request Sep 6, 2024
Expose hqq through `int4_weight_only` API

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
andrewor14 pushed a commit that referenced this pull request Sep 6, 2024
Expose hqq through `int4_weight_only` API

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
jainapurva pushed a commit that referenced this pull request Sep 9, 2024
Expose hqq through `int4_weight_only` API

Summary:
att, this is a follow up for #605 to make hqq available in quantize_ API

`quantize_(model, int4_weight_only(group_size, use_hqq=True)`

Test Plan:

python generate.py --compile --quantization int4wo-hqq-64 --precision bfloat16
Average tokens/sec: 195.24
Average Bandwidth: 729.40 GB/s
Peak Memory Usage: 5.09 GB
Model Size: 3.74 GB

python eval.py --compile --quantization int4wo-hqq-64 --precision bfloat16

wikitext: {'word_perplexity,none': 12.823631773497512, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.611400903914048, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.6883154699192412, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'}

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants