Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Int4CPULayout and update int4 woq #1278

Merged
merged 5 commits into from
Nov 27, 2024

Conversation

yanbing-j
Copy link
Contributor

pytorch/pytorch#139611 is merged into PyTorch main branch.

Copy link

pytorch-bot bot commented Nov 13, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1278

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 13, 2024
@yanbing-j yanbing-j marked this pull request as ready for review November 14, 2024 02:47
@jerryzh168
Copy link
Contributor

we are doing a refactor for file structure btw: #1234 might be good to rebase after that is landed


__torch_function__ = torch._C._disabled_torch_function_impl

def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have an unpack op for tensor core tiled layout now, so this can actually be replaced with a call to the op:

m.impl("torchao::unpack_tensor_core_tiled_layout", &_unpack_tensor_core_tiled_layout);
m.impl("torchao::dequantize_tensor_core_tiled_layout", &_dequantize_tensor_core_tiled_layout);

do you plan to write similar ops for cpu?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed this, but I have no bandwidth to do so these days. If you are not urgent for this feature, I can take this task.

cc @mingfeima

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be great, thanks @yanbing-j , this is not urgent

Comment on lines 405 to 410
# if int_data_device_type == "mps":
# int_data = int_data.cpu()
if int_data_device_type != "cpu":
int_data = (int_data[::, ::2] << 4 | int_data[::, 1::2]).to(torch.uint8)
# if int_data_device_type == "mps":
# int_data = int_data.to(device="mps")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove the code that's commented out

is this equivalent to previous code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #517 (comment), << can be used in MPS backend, don't need to convert to CPU and use CPU backend. Since I don't have mps machine, I want to use CI to check if this can work. Otherwise, I can update to int_data = (torch.bitwise_left_shift(int_data[::, ::2], 4) | int_data[::, 1::2]).to(torch.uint8) instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, makes sense

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a separate PR, but can you also help add support for conversion between int4 tensor core tiled layout and int4 cpu layout, we may need a separate util for this, like we discussed in the issue: #1117 (comment)

right now we error out when converting between different devices

if not is_device(torch.device(self.device).type, device):
raise ValueError(
f"TensorCoreTiledAQTTensorImpl does not support conversion from {self.device} to {device}"
)
, this is fine I think, just need separate utils if people want to do this move.

Test can be added in

class TestAffineQuantized(TestCase):

@jerryzh168 jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Nov 15, 2024
@jerryzh168
Copy link
Contributor

I think you should also unpin pytorch version to get the latest op changes: #1283

@yanbing-j
Copy link
Contributor Author

Hi @jerryzh168 , I have updated to fix CI and involve PyTorch nightly in. Could you please take a look? I tested 2.3.0, 2.4.1, 2.5.1 and 2.6 in CPU in my local.

@jerryzh168
Copy link
Contributor

@yanbing-j we just landed a large refactor PR, can you rebase?

@yanbing-j
Copy link
Contributor Author

@jerryzh168 I have rebased, could you please take a look?

@Jack-Khuu
Copy link
Contributor

Thanks for looking into this @yanbing-j

Eagerly awaiting to pick it up in pytorch/torchchat#1367

@jerryzh168
Copy link
Contributor

@yanbing-j
Copy link
Contributor Author

@jerryzh168 Please review again.

@yanbing-j
Copy link
Contributor Author

@jerryzh168 Please review again.

@yanbing-j
Copy link
Contributor Author

Hi @jerryzh168 , 2 failures in CUDA nightly cannot be reproduced in A100 with torch 2.6.0.dev20241119+cu124. And the CPU nightly failure is related to GLIBC. I don't know how to fix.

@@ -70,6 +70,12 @@ jobs:
torch-spec: 'torch==2.5.1 --index-url https://download.pytorch.org/whl/cu121'
gpu-arch-type: "cuda"
gpu-arch-version: "12.1"
- name: CUDA Nightly
Copy link
Contributor

@jerryzh168 jerryzh168 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these tests added? can you rebase on main? I think we have some recent changes to the CI jobs:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168
Copy link
Contributor

also do you know the issue with xpu job errors in current main: https://github.com/pytorch/ao/actions/runs/11942397686/job/33289365532

@yanbing-j
Copy link
Contributor Author

yanbing-j commented Nov 21, 2024

also do you know the issue with xpu job errors in current main: https://github.com/pytorch/ao/actions/runs/11942397686/job/33289365532

@jerryzh168 I saw these XPU failures are related to Windows. Please involve @EikanWang inside.

@jerryzh168
Copy link
Contributor

@yanbing-j
Copy link
Contributor Author

yanbing-j commented Nov 21, 2024

the error still seems valid: https://github.com/pytorch/ao/actions/runs/11945348130/job/33299816407?pr=1278

It cannot be reproduced in A100 with torch 2.6.0.dev20241119+cu124. Let me try the latest one again,

@yanbing-j
Copy link
Contributor Author

@jerryzh168 Sorry, I still cannot reproduce in A100. Could you please help make a try? Thanks!

$ python test/dtypes/test_affine_quantized.py TestAffineQuantizedBasic.test_flatten_unflatten_device_cpu_bfloat16
.
----------------------------------------------------------------------
Ran 1 test in 0.266s

OK

torch 2.6.0.dev20241120+cu124
torchao 0.7.0+git25b9460 /home/pt-gpu/yanbingj/ao (This is the commit of yanbing/update_int4 branch, using pip install -e . to install)

@jerryzh168
Copy link
Contributor

there is some issue with pytorch nightly version I think, I saw: Downloading https://download.pytorch.org/whl/nightly/cu121/torch-2.6.0.dev20241112%2Bcu121-cp39-cp39-linux_x86_64.whl (767.9 MB)
in the log,

when I'm installing locally, I also installed: Successfully installed nvidia-cusparselt-cu12-0.6.2 torch-2.6.0.dev20241112+cu121

looks like the latest cu121 is: 1112+cu121 in https://download.pytorch.org/whl/nightly/torch/ right now

@yanbing-j
Copy link
Contributor Author

there is some issue with pytorch nightly version I think, I saw: Downloading https://download.pytorch.org/whl/nightly/cu121/torch-2.6.0.dev20241112%2Bcu121-cp39-cp39-linux_x86_64.whl (767.9 MB) in the log,

when I'm installing locally, I also installed: Successfully installed nvidia-cusparselt-cu12-0.6.2 torch-2.6.0.dev20241112+cu121

looks like the latest cu121 is: 1112+cu121 in https://download.pytorch.org/whl/nightly/torch/ right now

Oh, you are right. For cu121, the latest is 1112 nightly, which does not include pytorch/pytorch#139611 (20241112 merged into PyTorch). And for cu124, the latest is 1121, that's why I cannot reproduce.

So, can this PR be merged since this is a platform related issue, and can be regarded as a known issue before CI upgrades to cu124? @jerryzh168

@jerryzh168
Copy link
Contributor

jerryzh168 commented Nov 22, 2024

let's upgrade CI to use 12.4 first, I heard 12.1 is deprecated in newer pytorch versions: pytorch/pytorch#138609

jerryzh168 added a commit that referenced this pull request Nov 23, 2024
* Update nightly job to use 12.4 since 12.1 is deprecated

#1278 (comment)

* skip failed tests
sunjiweiswift pushed a commit to sunjiweiswift/ao that referenced this pull request Nov 25, 2024
* Update nightly job to use 12.4 since 12.1 is deprecated

pytorch#1278 (comment)

* skip failed tests
@jerryzh168 jerryzh168 merged commit 719440e into pytorch:main Nov 27, 2024
3 checks passed
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
…ed values (pytorch#1359)

* Update cli.py to make --device/--dtype pre-empt quantize dict-specified values

Users may expect that cli parameters override the JSON, as per pytorch#1278.  
Invert logic - case split: 
1 - if none (no value) is specified, use value specified in quantize dict, if present; else
2 - if value is specified, override the respective handler if present.

* Fix typo in cli.py

fix typo

---------

Co-authored-by: Jack-Khuu <jack.khuu.7@gmail.com>
@@ -383,3 +393,251 @@ def get_plain(self) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

def get_layout(self) -> Layout:
return self._layout


@dataclass(frozen=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry missed this one, it should have a separate file since it's a different layout, cc @yanbing-j can you help move this to a separate file under the same directly? (int4_cpu_layout.py)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryzh168 Okay, here is the PR #1419.

@yanbing-j yanbing-j deleted the yanbing/update_int4 branch December 16, 2024 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants