Skip to content

Conversation

@Xia-Weiwen
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen commented Aug 19, 2025

Summary
This PR adds Int4OpaqueTensor to replace the AQT tensor with Int4CPULayout since AQT will be deprecated.

Test plan

pytest -sv test/quantization/quantize_/workflows/int4/test_int4_opaque_tensor.py

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2798

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e9a5fa7 with merge base 9056c46 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 19, 2025
@Xia-Weiwen Xia-Weiwen requested a review from Copilot August 19, 2025 03:19
@Xia-Weiwen Xia-Weiwen added the topic: new feature Use this tag if this PR adds a new feature label Aug 19, 2025

This comment was marked as outdated.

@Xia-Weiwen Xia-Weiwen requested a review from Copilot August 19, 2025 03:26
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Int4WoqCpuTensor as a replacement for AQT tensor with Int4CPULayout to support int4 weight-only quantization on CPU with groupwise quantization.

Key changes:

  • Adds new Int4WoqCpuTensor class for CPU-specific int4 weight-only quantization
  • Integrates the new tensor type into the quantization API and workflow system
  • Adds comprehensive test coverage for the new tensor implementation

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
torchao/quantization/quantize_/workflows/int4/int4_woq_cpu_tensor.py Implements the core Int4WoqCpuTensor class with CPU-optimized int4 quantization
torchao/quantization/quantize_/workflows/init.py Adds export for the new tensor class
torchao/quantization/quantize_/common/packing_format.py Adds INT4_WOQ_CPU packing format enum value
torchao/quantization/quant_api.py Integrates new tensor into quantization workflow
torchao/quantization/init.py Adds public API export for the tensor class
test/quantization/quantize_/workflows/int4/test_int4_woq_cpu_tensor.py Comprehensive test suite for the new tensor implementation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Xia-Weiwen Xia-Weiwen marked this pull request as ready for review August 19, 2025 03:30
@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 August 19, 2025 03:30
"""
int4_woq_cpu is referring to the format used by int4 weight-only quantization on CPU, which is a groupwise quantization format.
"""
INT4_WOQ_CPU = "int4_woq_cpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we need to change the name to describe how the quantized data is laid out / packed I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I have changed it to int4_tinygemm_cpu

Comment on lines 168 to 169
packed_weight = weight_tensor.qdata.contiguous()
scale_and_zero = weight_tensor.qscale_and_zero.contiguous()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these contiguous call needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Thanks.

@unittest.skipIf(not torch_version_at_least("2.6.0"), "Need pytorch 2.6+")
class TestInt4WoqCpuTensor(TestCase):
@parametrize("group_size", [32, 64, 128])
def test_linear(self, group_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you align with

to test more input shapes and also add a compile test as well

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Updated.

For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64. And when packing
the last dimension, data are shuffled by lanes before packing two int4 to one int8:
block_n = 64 = 16 * 4, so we have 4 lanes, each lane has 16 int4s = [lane0, lane1, lane2, lane3]. We pack them as [lane0|lane2, lane1|lane3].
See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if this is based on hardware at the quantization time, what do we do if users quantize the model in one CPU and want to run the model in another CPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the question. The packing and computing kernel can co-work on different machines but performance is not guaranteed.

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 August 20, 2025 02:43
@Xia-Weiwen Xia-Weiwen changed the title [CPU] Introduce Int4WoqCpuTensor to replace Int4CPULayout in AQT [CPU] Introduce Int4TinyGemmCpuTensor to replace Int4CPULayout in AQT Aug 20, 2025
@jerryzh168
Copy link
Contributor

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

@Xia-Weiwen
Copy link
Collaborator Author

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

Ok, sure.

@Xia-Weiwen
Copy link
Collaborator Author

Xia-Weiwen commented Aug 21, 2025

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks.
I think we can add restrictions that the layout can only be available on platforms with AVX512 support.

@jerryzh168
Copy link
Contributor

we will discuss what to do with this autopacking stuff, it doesn't fit into the packing format abstraction since packing format is supposed to have a fixed layout, will get back to you

BTW, how does CUDA handle data layout on different platforms? Does CUDA use the same layout on all platforms? Thanks. I think we can add restrictions that the layout can only be available on platforms with AVX512 support.

packing is typically specific to kernel I think, for int4, right now we have tensor core tiled packing (for tinygemm kernel) and preshuffled packing for fbgemm preshuffled kernel, and there is another plain packing for fbgemm non-preshuffled kernel

main thing is it's a fixed format and can be explained and understood in torchao

@Xia-Weiwen
Copy link
Collaborator Author

main thing is it's a fixed format and can be explained and understood in torchao

I see. So how about requiring AVX512?

@jerryzh168
Copy link
Contributor

main thing is it's a fixed format and can be explained and understood in torchao

I see. So how about requiring AVX512?

that sounds OK, although I saw you have other hardware situations as well like AVX2 and non-vectorized, so we need one for each packing. please check slack message, need your input on the refactor effort

# groupwise int4 quantization
groupsize = weight_tensor.block_size[1]
y = torch.ops.aten._weight_int4pack_mm_for_cpu(
act_mat.contiguous(), packed_weight, groupsize, scale_and_zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this contiguous call needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. We assume input is contiguous in the kernel.

quantized_and_compiled = compiled_linear(input)
self.assertTrue(compute_error(original, quantized_and_compiled) > 20)

@parametrize("dtype", [torch.float, torch.bfloat16, torch.half])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: torch.float32, torch.bfloat16, torch.float16 will be clearer I feel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

@jerryzh168
Copy link
Contributor

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

@Xia-Weiwen
Copy link
Collaborator Author

Xia-Weiwen commented Aug 26, 2025

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future?

@jerryzh168
Copy link
Contributor

Hi @Xia-Weiwen can you use Opque packing format from #2878 for the tensor? since this does not have a fixed format

Thanks. Shell I use it or add a new one? What if we have more opaque formats in the future?

just use the same one I think, we can add more Opaque format if more are needed I feel

@Xia-Weiwen
Copy link
Collaborator Author

just use the same one I think, we can add more Opaque format if more are needed I feel

It's OK for me. However, the name sounds too general. Anyway, I will change the names to opaque then please check if that looks good to you. Thanks.

@Xia-Weiwen Xia-Weiwen changed the title [CPU] Introduce Int4TinyGemmCpuTensor to replace Int4CPULayout in AQT [CPU] Introduce OpaqueTensor to replace Int4CPULayout in AQT Aug 26, 2025
@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 August 26, 2025 02:34
@Xia-Weiwen
Copy link
Collaborator Author

Hi @jerryzh168 Please review again. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, this should be Int4OpqueTensor can you update

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's updated

This is an opaque tensor subclass, the packing format is not exposed to the rest of the system. See the note below for more details.
Tensor Attributes:
qdata: preshuffled and packed int4 weight for tinygemm, always viewed as a 2D (N, K/2) tensor, last dimension is packed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also tinygemm is a gpu library I think, is this really related to tinygemm?

Copy link
Collaborator Author

@Xia-Weiwen Xia-Weiwen Aug 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are reusing the "tinygemm" name for CPU in torch core I think. I can change it to the following if it's ok to you:
qdata: preshuffled and packed int4 weight for CPU tinygemm, always viewed as a 2D (N, K/2) tensor, ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sounds good

@Xia-Weiwen Xia-Weiwen requested a review from jerryzh168 August 26, 2025 05:07
@Xia-Weiwen Xia-Weiwen changed the title [CPU] Introduce OpaqueTensor to replace Int4CPULayout in AQT [CPU] Introduce Int4OpaqueTensor to replace Int4CPULayout in AQT Aug 26, 2025
Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@Xia-Weiwen Xia-Weiwen merged commit 6f035e8 into pytorch:main Aug 27, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature Use this tag if this PR adds a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants