Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

Merged
merged 13 commits into from
Dec 15, 2023

Conversation

hiyouga
Copy link
Contributor

@hiyouga hiyouga commented Dec 12, 2023

System

peft 0.7.1.dev0
transformers 4.36.0
torch 2.0.1
bitsandbytes 0.41.3.post2

Description

If we use this example to quantize the model for LoftQ initialization, it always requires us to load the entire model on the GPU. However, for bigger models and limited GPU resources (e.g. a 70B model and a single A100 80GB GPU), we want to load the model on the CPU. Removing device_map="auto" from the file directly would result in the following error.
https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/quantize_save_load.py

Traceback (most recent call last):
  File "/quantize_save_load.py", line 223, in <module>
    base_dir, lora_dir = quantize_and_save()
  File "/quantize_save_load.py", line 150, in quantize_and_save
    lora_model = get_peft_model(model, lora_config)
  File "lib/python3.10/site-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "lib/python3.10/site-packages/peft/peft_model.py", line 1041, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "lib/python3.10/site-packages/peft/peft_model.py", line 123, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 90, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 247, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 192, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 312, in _create_new_module
    new_module = Linear(target, adapter_name, **kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 264, in __init__
    self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 92, in update_layer
    self.loftq_init(adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 197, in loftq_init
    qweight, lora_A, lora_B = loftq_init(weight, **kwargs)
  File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "lib/python3.10/site-packages/peft/utils/loftq_utils.py", line 217, in loftq_init
    dequantized_weight = bnb.functional.dequantize_4bit(qweight.data, qweight.quant_state)
  File "lib/python3.10/site-packages/bitsandbytes/functional.py", line 1012, in dequantize_4bit
    assert absmax is not None and out is not None
AssertionError

Implementation

This is because the current parameters are not adaptively loaded on the GPU during quantization. Through this PR, it is possible to quantize bigger models on limited GPU resources.

if weight.device.type != "cuda":
    weight = weight.to("cuda")

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the changes look good ! Can you just run the styling checks? make style && make quality

@hiyouga
Copy link
Contributor Author

hiyouga commented Dec 12, 2023

I think that the changes look good ! Can you just run the styling checks? make style && make quality

done

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for providing this fix, it should be very useful. I got an error running your code, please check my suggestion.

Also, to test your code:
patch-issue-1256.txt

This test fails with CPU on main branch but passes on your branch. WDYT? (You should be able to apply this diff as a patch; otherwise, if you update your branch against latest main, I can also create a PR against your PR if you prefer)

@@ -209,8 +209,8 @@ def loftq_init(weight: Union[torch.Tensor, torch.nn.Parameter], num_bits: int, r
if num_bits == 4 and is_bnb_4bit_available():
qweight = bnb.nn.Params4bit(
res.to("cpu"), requires_grad=False, compress_statistics=False, quant_type="nf4"
).to(device)
dequantized_weight = bnb.functional.dequantize_4bit(qweight.data, qweight.quant_state)
).cuda()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got an error here, TypeError: Params4bit.cuda() missing 1 required positional argument: 'device'. Should this be .cuda(device=device.index)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, it should be ().to("cuda")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's update then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doing some tests.

@hiyouga
Copy link
Contributor Author

hiyouga commented Dec 14, 2023

I have completed all the necessary modifications and tested on both CPU and GPU environments. Could you please rerun the unit test?

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great thanks! WDYT @BenjaminBossan ?

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot @hiyouga. Would you also be willing to apply the patch I attached above so that we have tests for this new functionality?

Also, just in case, ping @yxli2123 in case they want to take a look (not a requirement though).

@hiyouga
Copy link
Contributor Author

hiyouga commented Dec 14, 2023

@BenjaminBossan Sorry I am not familiar with test, could you open a pr based on this?

@BenjaminBossan
Copy link
Member

Sorry I am not familiar with test, could you open a pr based on this?

That's okay, I'll just create a PR once this one is merged, that should be easiest. Let's leave this PR open for a day to see if Yixiao Li wants to add anything, otherwise I'll merge and add the tests.

@BenjaminBossan BenjaminBossan merged commit bd544bb into huggingface:main Dec 15, 2023
14 checks passed
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Dec 15, 2023
BenjaminBossan added a commit that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants