-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow empty initialization of adapter weight #1961
ENH: Allow empty initialization of adapter weight #1961
Conversation
This PR allows to initialize the adpater weights as empty, i.e. on meta device. It is still work in progress. Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later. In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature. A lot is still missing from this PR: - unit tests - documentation - can the argument be passed to all relevant methods yet? - adapters other than LoRA also need to be changed
Here are the results from a first test (warm cache, M2 SSD): import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model
model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda" # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"
# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
print("adapter does not exist yet, creating it")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
# choose high rank to make the effect stronger
config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
model = get_peft_model(model, config).eval()
with torch.inference_mode():
outputs_before = model(inputs)
model.save_pretrained(path)
# save outputs
torch.save(outputs_before, f"{path}/outputs_before.pt")
del model
torch.cuda.empty_cache()
gc.collect()
else:
print("adapter exists, loading")
outputs_before = torch.load(f"{path}/outputs_before.pt")
# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()
print(f"loading the base model: {toc - tic:.2f}s")
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)
print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()
# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)
print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()
|
Hey @BenjaminBossan, to work on this PR, do we need Mac to test? I could see other PRs mentioning speed-up for Mac M2. I am interested in completing it by this weekend if a Mac specifically is not required. |
I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete.
No, Mac is not needed. M2 was referring to M.2 SSDs, I should have been more precise :) |
I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete. -- Sure, @BenjaminBossan, could you please tell me, if I had to start on this, how I should approach it so that I can acquire deeper PEFT knowledge? Happy to read/work on it. |
It's hard to tell how to best get started. The code base is big but that doesn't mean you need to know it in its entirety to get up to speed (even I don't know everything). Learning by doing is probably going to work best, but I think it can be easy to get stuck, so you may want to start with a smaller PR if you notice that. For this PR, I have listed the TODOs at the top. It would probably be best to start with a unit test to ensure that the feature is working as intended and later changes don't break it. The unit tests don't need to be all comprehensive, that can be added later in the PR. For reference, this is the script that I used to measure the times. It could be adapted for a unit test: import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model
model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda" # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"
# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
print("adapter does not exist yet, creating it")
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
# choose high rank to make the effect stronger
config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
model = get_peft_model(model, config).eval()
with torch.inference_mode():
outputs_before = model(inputs)
model.save_pretrained(path)
# save outputs
torch.save(outputs_before, f"{path}/outputs_before.pt")
del model
torch.cuda.empty_cache()
gc.collect()
else:
print("adapter exists, loading")
outputs_before = torch.load(f"{path}/outputs_before.pt")
# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()
print(f"loading the base model: {toc - tic:.2f}s")
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)
print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()
# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)
print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect() Of course, the base model needs to be much smaller than Llama 8B and I'm not sure how flaky the measured times will be. |
Thank you for the detailed response. I'll look into it. |
@charchit7 Do you still plan on working on this? |
Hey @BenjaminBossan, sorry, I was held up with work. Would you mind if I took some more time on this? |
No, I don't mind, just let me know if you don't have the time to work on this. The reason why I asked is because there might be someone else who might be able to work on this, but I want to avoid duplicate work. |
Sure @BenjaminBossan, I'll start on this today. If again stuck somewhere will let you know for sure. Thank you for being patient with me. |
Thanks a lot. Again, don't feel pressured, there is no deadline. I was just asking as we had contributors in the past who wanted to work on something but then ghosted -- which is fine, just let us know so that the work is not blocked. |
Hey @BenjaminBossan I have started working on the problem, first thing I wanted to try was benchmark timings by running the runs which you posted in huggingface/diffusers#8953 (comment)
Memory consumption for the lora + model First Test :My configuration:
Second Test :My configuration:
Third Test ( for this I installed peft using your branch (allow-empty-init-of-adapter-weights)My configuration :
Please let me know if you want me to test any similar runs. Next, I will try the above code you gave (#1961 (comment)). I have your branch installed, so I am able to see the init_empty: 'bool' = False argument you updated. |
Thanks for running some tests. Regarding the third test, it won't have any effect right now, even with this branch, because diffusers would also need to be changed. Basically, we'd have to ensure that diffusers passes With the code from my comment above, you should, however, see the speedup when using this branch. |
Ahh, yeah, I missed it. Thanks for pointing that out. |
I ran the above code using your branch with the same system config as the third test.
Also, In modelling_utils.py we have,
Shall we keep default False for both the case? |
4.35 vs 5.64 isn't really a big difference. My guess is that this is i/o bound on your machine and the saved compute is only ~1sec, similar to what I found. You could try running this a few times to see if that changes anything. Regardless of that, we can still proceed with the PR though.
Are you referring to diffusers? Let's keep that part untouched for now. On the PEFT side, let's also keep the default to |
Yeah, that makes sense.
Regarding this: I was running on a notebook, if I keep running that cell, it speeds up but adds up the memory for each run +7gb. I'll start with the unit-test. PR. |
Just a reminder, @BenjaminBossan, that I will work on this. Sorry for the delay. |
Feel free to ask any questions that may come up @charchit7. |
Hey @BenjaminBossan any tips on speeding up loading a flux lora weight on an API with: I'm seeing avg load times of Thank you. |
@nachoal It's hard to say with the given information. Could you please create a separate issue and provide as much information as possible so that I can try to reproduce the error? |
Thanks @BenjaminBossan, I just created an issue with more information here: #2055 |
Hey, @BenjaminBossan I am not able to get time to work on this. Sadly getting caught up in multiple threads, gonna keep this open for anyone else if they want to continue this thread else will complete this as I get time. Again, apologies for keeping this open for long. |
Thanks for letting me know @charchit7. I'll pick this PR back up then. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@sayakpaul I hope this is not too much to ask for a review. With this PR in PEFT, we can restore the initial speed of loading LoRAS in diffusers as described in the initial comment. When it comes to the review, a lot of files have been touched but the majority consists just of the same small change to each PEFT method to account for the new argument. The core logic of this PR is actually in just these 3 lines: The rest is just passing the argument along, documentation, and tests. Currently, this feature is strictly opt-in. However, for the use case of loading adapters, I think it should be safe to make this the default (i.e. switch to opt-out). My current plan is to leave this feature as opt-in for the upcoming release with an announcement that the default will be switched and calling for users to test it. Then, for the next release after that, I would toggle the default unless users report errors. |
@BenjaminBossan of course not! I will review it tomorrow, first thing my time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for shipping this feature! I have mostly left some minor comments.
My question is do we want to call it low_cpu_mem_usage
instead of init_empty
? The former is used throughout transformers
and diffusers
.
|
||
assert model.linear.lora_A["default"].weight.device.type == "meta" | ||
set_peft_model_state_dict(model, peft_state_dict, init_empty=True) | ||
model.linear.lora_A["default"].weight.device.type == "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have the asserts in the example codebase? No strong opinions, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then let's have them within print()
s and comment the expected values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added # should be True
, not sure if that's what you meant.
|
||
### Loading adapter weights is slow | ||
|
||
Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. To speed up the loading time, you can pass the `init_empty=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could make sense to add a note on why there's an overhead on loading adapter layers (because they are first initialized and then they are populated with the pre-trained params).
src/peft/mapping.py
Outdated
@@ -210,6 +210,8 @@ def inject_adapter_in_model( | |||
The input model where the adapter will be injected. | |||
adapter_name (`str`, `optional`, defaults to `"default"`): | |||
The name of the adapter to be injected, if not provided, the default adapter name is used ("default"). | |||
init_empty (`bool`, `optional``, defaults to `False`): | |||
Create empty adapter weights on meta device. Useful to speed up the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create empty adapter weights on meta device. Useful to speed up the process. | |
Create empty adapter weights on meta device. Useful to speed up the loading process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change in the suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad. See now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I fixed all instances.
@@ -801,6 +812,9 @@ def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optio | |||
continue | |||
if adapter_name not in adapter_layer: | |||
continue | |||
if any(p.device == meta for p in adapter_layer.parameters()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, do we have to handle the "else" part of this branch?
src/peft/utils/save_and_load.py
Outdated
if init_empty: | ||
load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own knowledge:
Can you walk me through a scenario when set_peft_model_state_dict()
would be required to call with init_empty=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You replied yourself below :)
tests/test_initialization.py
Outdated
# module_name, _, param_name = new_key.rpartition(".") | ||
# new_key = f"{module_name}.default.{param_name}" | ||
remapped_dict[new_key] = val.to(device) | ||
errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay. So, set_peft_model_state_dict()
has to be called with init_empty=True
if the injection (inject_adapter_in_model()
) was made with init_empty=True
. I think that should be okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict
to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).
tests/test_initialization.py
Outdated
# module_name, _, param_name = new_key.rpartition(".") | ||
# new_key = f"{module_name}.default.{param_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this go away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, done.
|
||
# also test injecting directly | ||
del model | ||
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do this for a diffusers model class as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we would expose another argument for low_cpu_mem_usage
for loading LoRAs. Would prefer to keep the peft
level atomic feature tests within peft
, itself.
If the test suite needs a big rewrite, we could just create a Tester class with a specific model class from diffusers
and write a simple test case like this one. Will that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.
Most notably, renaming the argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the helpful comments, I think they should be addressed now.
When it comes to the new name, I agree that low_cpu_mem_usage
corresponds mostly to what is implemented here. Personally, I find the name a bit confusing, but for consistency I agree it's the correct choice.
|
||
assert model.linear.lora_A["default"].weight.device.type == "meta" | ||
set_peft_model_state_dict(model, peft_state_dict, init_empty=True) | ||
model.linear.lora_A["default"].weight.device.type == "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
src/peft/mapping.py
Outdated
@@ -210,6 +210,8 @@ def inject_adapter_in_model( | |||
The input model where the adapter will be injected. | |||
adapter_name (`str`, `optional`, defaults to `"default"`): | |||
The name of the adapter to be injected, if not provided, the default adapter name is used ("default"). | |||
init_empty (`bool`, `optional``, defaults to `False`): | |||
Create empty adapter weights on meta device. Useful to speed up the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change in the suggestion.
src/peft/peft_model.py
Outdated
@@ -1125,14 +1149,18 @@ def load_adapter( | |||
raise ValueError("Cannot set a prompt learning adapter to trainable when loading pretrained adapter.") | |||
else: | |||
peft_config.inference_mode = not is_trainable | |||
self.add_adapter(adapter_name, peft_config) | |||
self.add_adapter(adapter_name, peft_config, init_empty=init_empty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ad hoc, I can't think of a use case, but there might be. I think the description in the docstring of add_adapter
, "Don't use this option when creating a new PEFT adapter for training", is sufficient here, as users would have to opt-in to use the option.
tests/test_initialization.py
Outdated
# module_name, _, param_name = new_key.rpartition(".") | ||
# new_key = f"{module_name}.default.{param_name}" | ||
remapped_dict[new_key] = val.to(device) | ||
errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict
to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).
src/peft/utils/save_and_load.py
Outdated
if init_empty: | ||
load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You replied yourself below :)
tests/test_initialization.py
Outdated
# module_name, _, param_name = new_key.rpartition(".") | ||
# new_key = f"{module_name}.default.{param_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, done.
|
||
# also test injecting directly | ||
del model | ||
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional feedback, I think your concerns have been addressed.
|
||
# also test injecting directly | ||
del model | ||
model = self.transformers_class.from_pretrained(model_id).to(self.torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.
src/peft/mapping.py
Outdated
@@ -210,6 +210,8 @@ def inject_adapter_in_model( | |||
The input model where the adapter will be injected. | |||
adapter_name (`str`, `optional`, defaults to `"default"`): | |||
The name of the adapter to be injected, if not provided, the default adapter name is used ("default"). | |||
init_empty (`bool`, `optional``, defaults to `False`): | |||
Create empty adapter weights on meta device. Useful to speed up the process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, I fixed all instances.
|
||
assert model.linear.lora_A["default"].weight.device.type == "meta" | ||
set_peft_model_state_dict(model, peft_state_dict, init_empty=True) | ||
model.linear.lora_A["default"].weight.device.type == "cpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added # should be True
, not sure if that's what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for shipping this. LGTM!
Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
Thanks again, @BenjaminBossan, sorry for holding this up for long time. |
This PR allows to initialize the adpater weights as empty, i.e. on meta device, by passing low_cpu_mem_usage=True. Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later. In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature. For the time being, this is completely opt in. However, it should be safe to make this default for loading adapters. Therefore, in the future we may change the default there. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>
PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.
PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.
…3725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo
This PR allows to initialize the adapter weights as empty, i.e. on meta device.
Description
Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later.
In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature.
Progress
PeftMixedModel
low_cpu_mem_usage
(for context, first name wasinit_empty
).Diffusers issue
I ran the script from the original diffusers issue to confirm that this helps. On my machine, loading the LoRA adapter takes 2 sec. When I change the default to
low_cpu_mem_usage=True
, it goes down to 0.7 sec, which is in line with the previous diffusers results. Therefore, this PR indeed allows to get back to the old speed. What is missing is a PR on the diffusers side is:inject_adapter_in_model
to uselow_cpu_mem_usage=True
.set_peft_model_state_dict
to also uselow_cpu_mem_usage=True