Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move non-NF4 tensor to device prior to quantization on copy #737

Merged
merged 2 commits into from
Aug 23, 2024

Conversation

ebsmothers
Copy link
Contributor

Addressing #642 based on @gau-nernst's suggestion here. Tested in torchtune that (a) init time and memory are similar to torchtune's inplace copy here and (b) FSDP2 recipes still succeed (since they don't with torchtune copy enabled).

Copy link

pytorch-bot bot commented Aug 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/737

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5622179 with merge base 68e4643 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2024
@ebsmothers ebsmothers changed the title Nf4 to device Move non-NF4 tensor to device prior to quantization on copy Aug 23, 2024
@msaroufim msaroufim self-requested a review August 23, 2024 05:05
Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamping so we can merge fast to unblock but one thing that might help here is a test ideally here since we'd rather catch issues early or at the very least in tune since right now if another load time regression happens we won't catch it either

@ebsmothers
Copy link
Contributor Author

Stamping so we can merge fast to unblock but one thing that might help here is a test ideally here since we'd rather catch issues early or at the very least in tune since right now if another load time regression happens we won't catch it either

Yeah good point. We don't currently have any perf-related testing but it actually shouldn't be too hard to add something like that on our end. Honestly just general sanity checks for QLoRA around load time, peak memory, etc will do us a lot of good on this front

@msaroufim msaroufim merged commit 0ed3090 into pytorch:main Aug 23, 2024
16 checks passed
yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants