LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

hiyouga · 2023-12-12T14:24:28Z

System

peft 0.7.1.dev0
transformers 4.36.0
torch 2.0.1
bitsandbytes 0.41.3.post2

Description

If we use this example to quantize the model for LoftQ initialization, it always requires us to load the entire model on the GPU. However, for bigger models and limited GPU resources (e.g. a 70B model and a single A100 80GB GPU), we want to load the model on the CPU. Removing device_map="auto" from the file directly would result in the following error.
https://github.com/huggingface/peft/blob/main/examples/loftq_finetuning/quantize_save_load.py

Traceback (most recent call last):
  File "/quantize_save_load.py", line 223, in <module>
    base_dir, lora_dir = quantize_and_save()
  File "/quantize_save_load.py", line 150, in quantize_and_save
    lora_model = get_peft_model(model, lora_config)
  File "lib/python3.10/site-packages/peft/mapping.py", line 133, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "lib/python3.10/site-packages/peft/peft_model.py", line 1041, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "lib/python3.10/site-packages/peft/peft_model.py", line 123, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 90, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 247, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 192, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/model.py", line 312, in _create_new_module
    new_module = Linear(target, adapter_name, **kwargs)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 264, in __init__
    self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 92, in update_layer
    self.loftq_init(adapter_name)
  File "lib/python3.10/site-packages/peft/tuners/lora/layer.py", line 197, in loftq_init
    qweight, lora_A, lora_B = loftq_init(weight, **kwargs)
  File "lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "lib/python3.10/site-packages/peft/utils/loftq_utils.py", line 217, in loftq_init
    dequantized_weight = bnb.functional.dequantize_4bit(qweight.data, qweight.quant_state)
  File "lib/python3.10/site-packages/bitsandbytes/functional.py", line 1012, in dequantize_4bit
    assert absmax is not None and out is not None
AssertionError

Implementation

This is because the current parameters are not adaptively loaded on the GPU during quantization. Through this PR, it is possible to quantize bigger models on limited GPU resources.

if weight.device.type != "cuda":
    weight = weight.to("cuda")

HuggingFaceDocBuilderDev · 2023-12-12T14:49:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

I think that the changes look good ! Can you just run the styling checks? make style && make quality

hiyouga · 2023-12-12T17:50:25Z

I think that the changes look good ! Can you just run the styling checks? make style && make quality

done

BenjaminBossan

Thanks a lot for providing this fix, it should be very useful. I got an error running your code, please check my suggestion.

Also, to test your code:
patch-issue-1256.txt

This test fails with CPU on main branch but passes on your branch. WDYT? (You should be able to apply this diff as a patch; otherwise, if you update your branch against latest main, I can also create a PR against your PR if you prefer)

BenjaminBossan · 2023-12-13T16:27:24Z

src/peft/utils/loftq_utils.py

@@ -209,8 +209,8 @@ def loftq_init(weight: Union[torch.Tensor, torch.nn.Parameter], num_bits: int, r
        if num_bits == 4 and is_bnb_4bit_available():
            qweight = bnb.nn.Params4bit(
                res.to("cpu"), requires_grad=False, compress_statistics=False, quant_type="nf4"
-            ).to(device)
-            dequantized_weight = bnb.functional.dequantize_4bit(qweight.data, qweight.quant_state)
+            ).cuda()


I got an error here, TypeError: Params4bit.cuda() missing 1 required positional argument: 'device'. Should this be .cuda(device=device.index)?

My bad, it should be ().to("cuda")

Okay, let's update then.

I am doing some tests.

hiyouga · 2023-12-14T13:38:15Z

I have completed all the necessary modifications and tested on both CPU and GPU environments. Could you please rerun the unit test?

younesbelkada

Looks great thanks! WDYT @BenjaminBossan ?

BenjaminBossan

LGTM, thanks a lot @hiyouga. Would you also be willing to apply the patch I attached above so that we have tests for this new functionality?

Also, just in case, ping @yxli2123 in case they want to take a look (not a requirement though).

hiyouga · 2023-12-14T14:51:19Z

@BenjaminBossan Sorry I am not familiar with test, could you open a pr based on this?

BenjaminBossan · 2023-12-14T15:19:53Z

Sorry I am not familiar with test, could you open a pr based on this?

That's okay, I'll just create a PR once this one is merged, that should be easiest. Let's leave this PR open for a day to see if Yixiao Li wants to add anything, otherwise I'll merge and add the tests.

Tests to complement PR huggingface#1256

Tests to complement PR #1256

hiyouga added 4 commits December 12, 2023 22:06

Update loftq_utils.py

ab29986

Update quantize_save_load.py

fd8f3b9

Merge branch 'huggingface:main' into patch-1

e7e5499

Update quantize_save_load.py

f2d75a2

younesbelkada reviewed Dec 12, 2023

View reviewed changes

hiyouga added 6 commits December 13, 2023 01:22

Merge branch 'huggingface:main' into patch-1

2fe6890

Update quantize_save_load.py

297849e

make file

3ceb895

Update quantize_save_load.py

6e74ae9

fix memory leak

cb07195

Merge branch 'huggingface:main' into patch-1

f3b765d

Update loftq_utils.py

40b0ad5

BenjaminBossan requested changes Dec 13, 2023

View reviewed changes

hiyouga added 2 commits December 14, 2023 21:14

Merge branch 'huggingface:main' into patch-1

d0b52f7

fix bug and move SVD operation to GPU

80420e9

younesbelkada approved these changes Dec 14, 2023

View reviewed changes

BenjaminBossan approved these changes Dec 14, 2023

View reviewed changes

BenjaminBossan merged commit bd544bb into huggingface:main Dec 15, 2023
14 checks passed

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Dec 15, 2023

TST Extend LoftQ tests to check CPU initialization

3e97554

Tests to complement PR huggingface#1256

BenjaminBossan mentioned this pull request Dec 15, 2023

TST: Extend LoftQ tests to check CPU initialization #1274

Merged

BenjaminBossan added a commit that referenced this pull request Dec 18, 2023

TST Extend LoftQ tests to check CPU initialization (#1274)

3708793

Tests to complement PR #1256

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

hiyouga commented Dec 12, 2023

HuggingFaceDocBuilderDev commented Dec 12, 2023

younesbelkada left a comment

hiyouga commented Dec 12, 2023

BenjaminBossan left a comment

BenjaminBossan Dec 13, 2023

hiyouga Dec 14, 2023

BenjaminBossan Dec 14, 2023

hiyouga Dec 14, 2023

hiyouga commented Dec 14, 2023

younesbelkada left a comment

BenjaminBossan left a comment

hiyouga commented Dec 14, 2023

BenjaminBossan commented Dec 14, 2023

LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

LoftQ: Allow quantizing models loaded on the CPU for LoftQ initialization #1256

Conversation

hiyouga commented Dec 12, 2023

System

Description

Implementation

HuggingFaceDocBuilderDev commented Dec 12, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

hiyouga commented Dec 12, 2023

BenjaminBossan left a comment

Choose a reason for hiding this comment

BenjaminBossan Dec 13, 2023

Choose a reason for hiding this comment

hiyouga Dec 14, 2023

Choose a reason for hiding this comment

BenjaminBossan Dec 14, 2023

Choose a reason for hiding this comment

hiyouga Dec 14, 2023

Choose a reason for hiding this comment

hiyouga commented Dec 14, 2023

younesbelkada left a comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

hiyouga commented Dec 14, 2023

BenjaminBossan commented Dec 14, 2023