-
Notifications
You must be signed in to change notification settings - Fork 362
[CPU] Introduce Int4OpaqueTensor to replace Int4CPULayout in AQT #2798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2798
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e9a5fa7 with merge base 9056c46 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Int4WoqCpuTensor as a replacement for AQT tensor with Int4CPULayout to support int4 weight-only quantization on CPU with groupwise quantization.
Key changes:
- Adds new
Int4WoqCpuTensorclass for CPU-specific int4 weight-only quantization - Integrates the new tensor type into the quantization API and workflow system
- Adds comprehensive test coverage for the new tensor implementation
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py | Implements the core Int4WoqCpuTensor class with CPU-optimized int4 quantization |
| torchao/quantization/quantize_/workflows/init.py | Adds export for the new tensor class |
| torchao/quantization/quantize_/common/packing_format.py | Adds INT4_WOQ_CPU packing format enum value |
| torchao/quantization/quant_api.py | Integrates new tensor into quantization workflow |
| torchao/quantization/init.py | Adds public API export for the tensor class |
| test/quantization/quantize_/workflows/int4/test_int4_woq_cpu_tensor.py | Comprehensive test suite for the new tensor implementation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| """ | ||
| int4_woq_cpu is referring to the format used by int4 weight-only quantization on CPU, which is a groupwise quantization format. | ||
| """ | ||
| INT4_WOQ_CPU = "int4_woq_cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we need to change the name to describe how the quantized data is laid out / packed I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I have changed it to int4_tinygemm_cpu
torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py
Outdated
Show resolved
Hide resolved
torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py
Outdated
Show resolved
Hide resolved
torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py
Outdated
Show resolved
Hide resolved
| packed_weight = weight_tensor.qdata.contiguous() | ||
| scale_and_zero = weight_tensor.qscale_and_zero.contiguous() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these contiguous call needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Thanks.
| @unittest.skipIf(not torch_version_at_least("2.6.0"), "Need pytorch 2.6+") | ||
| class TestInt4WoqCpuTensor(TestCase): | ||
| @parametrize("group_size", [32, 64, 128]) | ||
| def test_linear(self, group_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you align with
| def test_linear(self, config, sizes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Updated.
| For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64. And when packing | ||
| the last dimension, data are shuffled by lanes before packing two int4 to one int8: | ||
| block_n = 64 = 16 * 4, so we have 4 lanes, each lane has 16 int4s = [lane0, lane1, lane2, lane3]. We pack them as [lane0|lane2, lane1|lane3]. | ||
| See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also if this is based on hardware at the quantization time, what do we do if users quantize the model in one CPU and want to run the model in another CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the question. The packing and computing kernel can co-work on different machines but performance is not guaranteed.
|
we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you |
Ok, sure. |
BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks. |
packing is typically specific to kernel I think, for int4, right now we have tensor core tiled packing (for tinygemm kernel) and preshuffled packing for fbgemm preshuffled kernel, and there is another plain packing for fbgemm non-preshuffled kernel main thing is it's a fixed format and can be explained and understood in torchao |
I see. So how about requiring AVX512? |
that sounds OK, although I saw you have other hardware situations as well like AVX2 and non-vectorized, so we need one for each packing. please check slack message, need your input on the refactor effort |
| # groupwise int4 quantization | ||
| groupsize = weight_tensor.block_size[1] | ||
| y = torch.ops.aten._weight_int4pack_mm_for_cpu( | ||
| act_mat.contiguous(), packed_weight, groupsize, scale_and_zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this contiguous call needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. We assume input is contiguous in the kernel.
| quantized_and_compiled = compiled_linear(input) | ||
| self.assertTrue(compute_error(original, quantized_and_compiled) > 20) | ||
|
|
||
| @parametrize("dtype", [torch.float, torch.bfloat16, torch.half]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: torch.float32, torch.bfloat16, torch.float16 will be clearer I feel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated.
|
Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format |
Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future? |
just use the same one I think, we can add more Opaque format if more are needed I feel |
It's OK for me. However, the name sounds too general. Anyway, I will change the names to opaque then please check if that looks good to you. Thanks. |
|
Hi @jerryzh168 Please review again. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, this should be Int4OpqueTensor can you update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's updated
| This is an opaque tensor subclass, the packing format is not exposed to the rest of the system. See the note below for more details. | ||
| Tensor Attributes: | ||
| qdata: preshuffled and packed int4 weight for tinygemm, always viewed as a 2D (N, K/2) tensor, last dimension is packed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also tinygemm is a gpu library I think, is this really related to tinygemm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are reusing the "tinygemm" name for CPU in torch core I think. I can change it to the following if it's ok to you:
qdata: preshuffled and packed int4 weight for CPU tinygemm, always viewed as a 2D (N, K/2) tensor, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Summary
This PR adds
Int4OpaqueTensorto replace the AQT tensor withInt4CPULayoutsince AQT will be deprecated.Test plan