-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the changes look good ! Can you just run the styling checks? make style && make quality
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for providing this fix, it should be very useful. I got an error running your code, please check my suggestion.
Also, to test your code:
patch-issue-1256.txt
This test fails with CPU on main branch but passes on your branch. WDYT? (You should be able to apply this diff as a patch; otherwise, if you update your branch against latest main, I can also create a PR against your PR if you prefer)
src/peft/utils/loftq_utils.py
Outdated
@@ -209,8 +209,8 @@ def loftq_init(weight: Union[torch.Tensor, torch.nn.Parameter], num_bits: int, r | |||
if num_bits == 4 and is_bnb_4bit_available(): | |||
qweight = bnb.nn.Params4bit( | |||
res.to("cpu"), requires_grad=False, compress_statistics=False, quant_type="nf4" | |||
).to(device) | |||
dequantized_weight = bnb.functional.dequantize_4bit(qweight.data, qweight.quant_state) | |||
).cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got an error here, TypeError: Params4bit.cuda() missing 1 required positional argument: 'device'
. Should this be .cuda(device=device.index)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, it should be ().to("cuda")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's update then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am doing some tests.
I have completed all the necessary modifications and tested on both CPU and GPU environments. Could you please rerun the unit test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great thanks! WDYT @BenjaminBossan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan Sorry I am not familiar with test, could you open a pr based on this? |
That's okay, I'll just create a PR once this one is merged, that should be easiest. Let's leave this PR open for a day to see if Yixiao Li wants to add anything, otherwise I'll merge and add the tests. |
Tests to complement PR huggingface#1256
Tests to complement PR #1256
System
peft 0.7.1.dev0
transformers 4.36.0
torch 2.0.1
bitsandbytes 0.41.3.post2
Description
If we use this example to quantize the model for LoftQ initialization, it always requires us to load the entire model on the GPU. However, for bigger models and limited GPU resources (e.g. a 70B model and a single A100 80GB GPU), we want to load the model on the CPU. Removing
device_map="auto"
from the file directly would result in the following error.https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/quantize_save_load.py
Implementation
This is because the current parameters are not adaptively loaded on the GPU during quantization. Through this PR, it is possible to quantize bigger models on limited GPU resources.