More convenient way to initialize LoftQ #1543

BenjaminBossan · 2024-03-07T12:16:54Z

Related to #1532

At the moment, using LoftQ is quite cumbersome, as shown in this example:

https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning

Essentially, users have to:

Load the non-quantized model with LoftQ (which can be quite huge)
Modify the PEFT config
Save the adapter
Unwrap the base model with custom functions
Save the base model with modified weights (i.e. a whole copy of the base model)
Load the base model from step 5 with bnb quantization
Load the adapter from step 3

Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights.

This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights.

This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the whole model weights.

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config)
model = get_peft_model(model, LoraConfig(task_type="CAUSAL_LM"))
replace_lora_weights_loftq(model, model_path=...)

While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in #1532. I measured this by creating (1) logits with the base model, (2) logits with the quantized+LoRA model, and (3) logits with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case.

I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we keep the change and proceed to the next weight, otherwise we discard the change before proceeding. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it.

This PR is not yet finished since I ran into an issue with matching the key names from safetensors not matching.

Furthermore, for now this doesn't support 8bit quantization and the num_iter arguments of LoftQ, which I'm not sure is really working. However, I guess the replace_lora_weights_loftq function could be called multiple times in a row.

ping @yxli2123

Related to huggingface#1532 At the moment, using LoftQ is quite cumbersome, as shown in this example: https://github.com/huggingface/peft/tree/7e84dec20b3106bdd0a90ba8e80187f0aec835b7/examples/loftq_finetuning Essentially, users have to: 1. Load the non-quantized model with LoftQ (which can be quite huge) 2. Modify the PEFT config 3. Save the adapter 4. Unwrap the base model with custom functions 5. Save the base model with modified weights (i.e. a whole copy of the base model) 6. Load the base model from step 5 with bnb quantization 7. Load the adapter from step 3 Yes, there is a helper script to do this, but this still has the advantage that we need to load the non-quantized model and that we have to create a completely new model checkpoint with the modified weights. This PR aims to make this process more convenient by adding a single function replace_lora_weights_loftq. This function takes the bnb-quantized LoRA model as input. Then it goes through each module with LoRA weights, lazily loads the corresponding non-quantized weights one at a time using safetensors, computes the quantization error, and replaces the LoRA weights with LoftQ-initialized LoRA weights. This is much more convenient because we only require very little extra memory thanks to lazy loading, and we don't have to keep an extra copy of the weights. While working on this, I still found that LoftQ initialization often did not seem to help a lot, as mentioned in huggingface#1532. I measured this by creating (1) logits with the base model, (2) with the quantized+LoRA model, and (3) with the quantized+LoRA+LoftQ model. The expectation is that (1) should be closer to (3) than to (2). This was often not the case. I therefore added the possibility to run a check each time that we replace a LoRA weight with the LoftQ weights. If this check returns True, we proceed to the next weight, otherwise we discard the change. That way, we only make the replacement with LoftQ weights if we see a real improvement. Of course, this is only a form of greedy optimization, but it seems to work in practice. And since it's optional, users can choose not to use it. This PR is not yet finished since I ran into an issue with matching the key names from safetensors not matching. Furthermore, for now this doesn't support 8bit quantization and the num_iter arguments of LoftQ, which I'm not sure is really working. However, I guess the replace_lora_weights_loftq function could be called multiple times in a row.

HuggingFaceDocBuilderDev · 2024-03-07T12:20:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

review-notebook-app · 2024-03-07T15:44:44Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

stevhliu

Really cool function! 👏

I think we should either remove the section above or provide some clarification for when you should use each method, other than the replace_lora_weights_loftq is easier, in which case most users would probably just pick that way. 😛

docs/source/developer_guides/lora.md

younesbelkada

Thanks a lot @BenjaminBossan !
The API looks really great ! Thinking about it we should probably not worry about supporting .bin files for this API, but I left them as an open question, wdyt?

younesbelkada · 2024-03-11T09:45:51Z

src/peft/utils/loftq_utils.py

@@ -15,10 +15,16 @@
 # Reference code: https://github.com/yxli2123/LoftQ/blob/main/utils.py
 # Reference paper: https://arxiv.org/abs/2310.08659

+from __future__ import annotations


Why this import?

Yeah, I guess with the type annotations used in this file, it wouldn't be necessary, but I also don't think it hurts.

src/peft/utils/loftq_utils.py

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

src/peft/utils/loftq_utils.py

pacman100

Hello Banjamin, thank you For making it easier to use loftQ, however, I have left a couple comments.

pacman100 · 2024-03-19T07:04:16Z

src/peft/utils/loftq_utils.py

+    prefix = "base_model.model."
+    any_match = False
+
+    with safe_open(model_path, framework="pt", device="cpu") as f:


will this apply for model that have state dict sharded across multiple files?

For example, https://huggingface.co/meta-llama/Llama-2-70b-hf

pacman100 · 2024-03-19T07:12:40Z

src/peft/utils/loftq_utils.py

+
+
+@torch.no_grad()
+def _loftq_init_new(qweight, weight, num_bits: int, reduced_rank: int):


Asd per the loftq algo, there should be alternating steps to quantize Qt = W-B(t-1)A(t-1) and At,Bt ← SVD(W − Qt) for T steps. here, we only do a single step of A1,B1 ← SVD(W − Qbnb)

As discussed internally, since this approach intends not to adapt the quantized weights (to avoid having an extra copy of all these weights), only single step is implemented.

Makes more sense than from the automatic merge

Better results, bigger margins.

BenjaminBossan

The reviewer comments should be addressed.

@pacman100 loading sharded models should now work. It's maybe not the most elegant solution, but this is what I gleaned from combing through transformers code. I tested it on gemma-2b, which has 2 shards, and it worked.

Moreover, I adjusted the test to use all-linear, now the margin is much larger (>1.5 for MAE and > 2.5 for MSE).

Please check again.

BenjaminBossan · 2024-03-19T16:11:21Z

src/peft/utils/loftq_utils.py

@@ -15,10 +15,16 @@
 # Reference code: https://github.com/yxli2123/LoftQ/blob/main/utils.py
 # Reference paper: https://arxiv.org/abs/2310.08659

+from __future__ import annotations


Yeah, I guess with the type annotations used in this file, it wouldn't be necessary, but I also don't think it hurts.

BenjaminBossan · 2024-03-19T16:12:15Z

src/peft/utils/loftq_utils.py

+
+
+@torch.no_grad()
+def _loftq_init_new(qweight, weight, num_bits: int, reduced_rank: int):


As discussed internally, since this approach intends not to adapt the quantized weights (to avoid having an extra copy of all these weights), only single step is implemented.

pacman100

Thank you @BenjaminBossan for iterating to support larger sharded models and using all-linear inline with reducing quantization error for all layers targetted by bitsandbytes. Much better results in tests and notebook example! 🔥🚀✨

…-initialization

Fix safetensors issue, document stuff

127bb44

Make style

897fd71

BenjaminBossan marked this pull request as ready for review March 7, 2024 16:48

BenjaminBossan changed the title ~~[WIP] More convenient way to initialize LoftQ~~ More convenient way to initialize LoftQ Mar 7, 2024

BenjaminBossan requested review from pacman100, younesbelkada and stevhliu March 7, 2024 16:48

stevhliu approved these changes Mar 7, 2024

View reviewed changes

docs/source/developer_guides/lora.md Outdated Show resolved Hide resolved

docs/source/developer_guides/lora.md Show resolved Hide resolved

docs/source/developer_guides/lora.md Outdated Show resolved Hide resolved

docs/source/developer_guides/lora.md Outdated Show resolved Hide resolved

younesbelkada approved these changes Mar 11, 2024

View reviewed changes

BenjaminBossan and others added 3 commits March 11, 2024 12:09

Apply suggestions from code review

2f287da

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

Merge branch 'main' into loftq-more-convenient-initialization

c52624a

Merge branch 'main' into loftq-more-convenient-initialization

fc11323

stevhliu reviewed Mar 12, 2024

View reviewed changes

src/peft/utils/loftq_utils.py Show resolved Hide resolved

Add API reference for replace_lora_weights_loftq

bc6e0ea

pacman100 reviewed Mar 19, 2024

View reviewed changes

BenjaminBossan added 5 commits March 19, 2024 16:07

Adjust tensor loading to work with sharded files

e0cf8d2

Make style

8038a6d

Re-order tests

3f39640

Makes more sense than from the automatic merge

Target all linear layers in test

0ee3e38

Better results, bigger margins.

Update notebook to use all-linear

d6f3d29

BenjaminBossan commented Mar 19, 2024

View reviewed changes

pacman100 approved these changes Mar 20, 2024

View reviewed changes

BenjaminBossan added 2 commits March 20, 2024 10:42

Merge remote-tracking branch 'origin/main' into loftq-more-convenient…

eea22b1

…-initialization

Add a few more explanations to the docs

dd3cfab

BenjaminBossan merged commit 8e979fc into huggingface:main Mar 20, 2024
14 checks passed

BenjaminBossan deleted the loftq-more-convenient-initialization branch March 20, 2024 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More convenient way to initialize LoftQ #1543

More convenient way to initialize LoftQ #1543

BenjaminBossan commented Mar 7, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 7, 2024

review-notebook-app bot commented Mar 7, 2024

stevhliu left a comment

younesbelkada left a comment

younesbelkada Mar 11, 2024

BenjaminBossan Mar 19, 2024

pacman100 left a comment

pacman100 Mar 19, 2024

pacman100 Mar 19, 2024

pacman100 Mar 19, 2024

BenjaminBossan Mar 19, 2024

BenjaminBossan left a comment

BenjaminBossan Mar 19, 2024

BenjaminBossan Mar 19, 2024

pacman100 left a comment



		@torch.no_grad()
		def _loftq_init_new(qweight, weight, num_bits: int, reduced_rank: int):

More convenient way to initialize LoftQ #1543

More convenient way to initialize LoftQ #1543

Conversation

BenjaminBossan commented Mar 7, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 7, 2024

review-notebook-app bot commented Mar 7, 2024

stevhliu left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacman100 left a comment

Choose a reason for hiding this comment

BenjaminBossan commented Mar 7, 2024 •

edited

Loading