ENH: Allow empty initialization of adapter weight #1961

BenjaminBossan · 2024-07-26T10:02:23Z

This PR allows to initialize the adapter weights as empty, i.e. on meta device.

Description

Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later.

In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature.

Progress

unit tests
for all tests to pass, FIX: Bug that prevents BOFT from loading multiple adapters #2068 needs to be merged first
documentation
make it work on CPU
the argument can be passed to all relevant methods
the original diffusers issue is addressed
adapters other than LoRA have been changed (should be trivial for LoKr etc. but probably not for prefix-tuning, which could be done in a separate PR)
adjust PeftMixedModel
Renaming: It's now called low_cpu_mem_usage (for context, first name was init_empty).

Diffusers issue

I ran the script from the original diffusers issue to confirm that this helps. On my machine, loading the LoRA adapter takes 2 sec. When I change the default to low_cpu_mem_usage=True, it goes down to 0.7 sec, which is in line with the previous diffusers results. Therefore, this PR indeed allows to get back to the old speed. What is missing is a PR on the diffusers side is:

Update all calls to inject_adapter_in_model to use low_cpu_mem_usage=True.
Update all calls to set_peft_model_state_dictto also use low_cpu_mem_usage=True
Only do this when the PEFT version is >0.12.0 (i.e. next release version).

This PR allows to initialize the adpater weights as empty, i.e. on meta device. It is still work in progress. Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later. In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature. A lot is still missing from this PR: - unit tests - documentation - can the argument be passed to all relevant methods yet? - adapters other than LoRA also need to be changed

BenjaminBossan · 2024-07-26T10:04:31Z

Here are the results from a first test (warm cache, M2 SSD):

import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model

model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda"  # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"

# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
    print("adapter does not exist yet, creating it")
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
    # choose high rank to make the effect stronger
    config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
    model = get_peft_model(model, config).eval()
    with torch.inference_mode():
        outputs_before = model(inputs)
    model.save_pretrained(path)
    # save outputs
    torch.save(outputs_before, f"{path}/outputs_before.pt")

    del model
    torch.cuda.empty_cache()
    gc.collect()
else:
    print("adapter exists, loading")
    outputs_before = torch.load(f"{path}/outputs_before.pt")

# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()

print(f"loading the base model: {toc - tic:.2f}s")

tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)

print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)

print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

loading the base model: 3.17s
loading LoRA without init_empty: 1.21s
loading LoRA with init_empty: 0.23s

charchit7 · 2024-07-29T16:42:47Z

Hey @BenjaminBossan, to work on this PR, do we need Mac to test? I could see other PRs mentioning speed-up for Mac M2. I am interested in completing it by this weekend if a Mac specifically is not required.

BenjaminBossan · 2024-07-29T17:33:32Z

I am interested in completing it by this weekend

I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete.

to work on this PR, do we need Mac to test? I could see other PRs mentioning speed-up for Mac M2

No, Mac is not needed. M2 was referring to M.2 SSDs, I should have been more precise :)

charchit7 · 2024-07-29T17:37:02Z

I'd be happy if you could pick this up, thanks. Just be aware that this is going to require a bit of deeper PEFT knowledge to complete.

-- Sure, @BenjaminBossan, could you please tell me, if I had to start on this, how I should approach it so that I can acquire deeper PEFT knowledge? Happy to read/work on it.

BenjaminBossan · 2024-07-30T09:29:24Z

could you please tell me, if I had to start on this, how I should approach it so that I can acquire deeper PEFT knowledge? Happy to read/work on it.

It's hard to tell how to best get started. The code base is big but that doesn't mean you need to know it in its entirety to get up to speed (even I don't know everything). Learning by doing is probably going to work best, but I think it can be easy to get stuck, so you may want to start with a smaller PR if you notice that.

For this PR, I have listed the TODOs at the top. It would probably be best to start with a unit test to ensure that the feature is working as intended and later changes don't break it. The unit tests don't need to be all comprehensive, that can be added later in the PR.

For reference, this is the script that I used to measure the times. It could be adapted for a unit test:

import gc
import os
import time
import torch
from transformers import AutoModelForCausalLM
from peft import LoraConfig, PeftModel, get_peft_model

model_id = "meta-llama/Meta-Llama-3-8B"
device = "cuda"  # cpu fails
dtype = torch.bfloat16
torch.manual_seed(0)
inputs = torch.arange(10).view(-1, 1).to(device)
path = f"/tmp/peft/test-empty-weights/{model_id.replace('/', '--')}"

# first save a peft adapter
if not os.path.exists(f"{path}/adapter_model.safetensors"):
    print("adapter does not exist yet, creating it")
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
    # choose high rank to make the effect stronger
    config = LoraConfig(r=64, init_lora_weights=False, target_modules="all-linear")
    model = get_peft_model(model, config).eval()
    with torch.inference_mode():
        outputs_before = model(inputs)
    model.save_pretrained(path)
    # save outputs
    torch.save(outputs_before, f"{path}/outputs_before.pt")

    del model
    torch.cuda.empty_cache()
    gc.collect()
else:
    print("adapter exists, loading")
    outputs_before = torch.load(f"{path}/outputs_before.pt")

# load without init_empty:
tic = time.perf_counter()
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
toc = time.perf_counter()

print(f"loading the base model: {toc - tic:.2f}s")

tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=False)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_not_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_not_empty.logits)

print(f"loading LoRA without init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

# load with init_empty:
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=dtype)
tic = time.perf_counter()
model = PeftModel.from_pretrained(model, path, init_empty=True)
toc = time.perf_counter()
with torch.inference_mode():
    outputs_empty = model(inputs)
assert torch.allclose(outputs_before.logits, outputs_empty.logits)

print(f"loading LoRA with init_empty: {toc - tic:.2f}s")
del model
torch.cuda.empty_cache()
gc.collect()

Of course, the base model needs to be much smaller than Llama 8B and I'm not sure how flaky the measured times will be.

charchit7 · 2024-07-30T09:33:14Z

Thank you for the detailed response. I'll look into it.

BenjaminBossan · 2024-08-06T14:39:49Z

@charchit7 Do you still plan on working on this?

charchit7 · 2024-08-06T16:37:03Z

Hey @BenjaminBossan, sorry, I was held up with work. Would you mind if I took some more time on this?
Do you have any deadline by which you would want this to close?

BenjaminBossan · 2024-08-07T10:11:09Z

Would you mind if I took some more time on this?
Do you have any deadline by which you would want this to close?

No, I don't mind, just let me know if you don't have the time to work on this. The reason why I asked is because there might be someone else who might be able to work on this, but I want to avoid duplicate work.

charchit7 · 2024-08-07T11:02:19Z

Sure @BenjaminBossan, I'll start on this today. If again stuck somewhere will let you know for sure. Thank you for being patient with me.

BenjaminBossan · 2024-08-07T12:14:42Z

Thanks a lot. Again, don't feel pressured, there is no deadline. I was just asking as we had contributors in the past who wanted to work on something but then ghosted -- which is fine, just let us know so that the work is not blocked.

charchit7 · 2024-08-08T18:38:58Z

Hey @BenjaminBossan I have started working on the problem, first thing I wanted to try was benchmark timings by running the runs which you posted in huggingface/diffusers#8953 (comment)

import time

import torch
from diffusers import StableDiffusionXLPipeline

try:
    import peft
    print("peft is installed")
    print(peft.__version__)
    print(peft.__path__)
except ImportError:
    print("peft is not installed")

device = 'cuda'
model_path = "stabilityai/stable-diffusion-xl-base-1.0"

pipe = StableDiffusionXLPipeline.from_pretrained(
    model_path, torch_dtype=torch.float16, variant="fp16", use_safetensors=True,
).to('cuda')

t1 = time.time()
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", weight_name="pytorch_lora_weights.safetensors")
print(f"load lcm lora weights time: {time.time()- t1:.2f}s")

Memory consumption for the lora + model

First Test :

My configuration:

A10G 24gb vram.
peft : 0.12.0
Diffusers : '0.30.0.dev0'
Average time in seconds for 5 runs ~4.811

Second Test :

My configuration:

A10G 24gb vram
peft : 0.12.0
Diffusers : 0.30.0 ( not the dev version)
Average time in seconds for 5 runs ~ 4.2075

Third Test ( for this I installed peft using your branch (allow-empty-init-of-adapter-weights)

My configuration :

A10G 24GB vram
Diffusers : 0.30.0
Peft 0.12.0
Average time in seconds for 5 runs ~ 5.24s

Please let me know if you want me to test any similar runs. Next, I will try the above code you gave (#1961 (comment)). I have your branch installed, so I am able to see the init_empty: 'bool' = False argument you updated.

BenjaminBossan · 2024-08-09T09:15:47Z

Thanks for running some tests. Regarding the third test, it won't have any effect right now, even with this branch, because diffusers would also need to be changed. Basically, we'd have to ensure that diffusers passes init_empty=True. Alternatively, you could edit your local PEFT code to make this the default just for testing purposes.

With the code from my comment above, you should, however, see the speedup when using this branch.

charchit7 · 2024-08-09T09:31:44Z

Ahh, yeah, I missed it. Thanks for pointing that out.

charchit7 · 2024-08-10T10:52:15Z

I ran the above code using your branch with the same system config as the third test.
These are the results. Do you think 4.35s is decent?

adapter does not exist yet, creating it
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [01:56<00:00, 29.15s/it]
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Loading checkpoint shards: 100%|██████████| 4/4 [01:53<00:00, 28.38s/it]
loading the base model: 113.76s
loading LoRA without init_empty: 5.64s
Loading checkpoint shards: 100%|██████████| 4/4 [01:37<00:00, 24.28s/it]
loading LoRA with init_empty: 4.35s

Also, In modelling_utils.py we have,

if is_torch_version(">=", "1.9.0"):
    _LOW_CPU_MEM_USAGE_DEFAULT = True
else:
    _LOW_CPU_MEM_USAGE_DEFAULT = False

Shall we keep default False for both the case?

BenjaminBossan · 2024-08-12T08:42:20Z

These are the results. Do you think 4.35s is decent?

4.35 vs 5.64 isn't really a big difference. My guess is that this is i/o bound on your machine and the saved compute is only ~1sec, similar to what I found. You could try running this a few times to see if that changes anything. Regardless of that, we can still proceed with the PR though.

Shall we keep default False for both the case?

Are you referring to diffusers? Let's keep that part untouched for now. On the PEFT side, let's also keep the default to False. Maybe there is no downside to switching the default to True for load_adapter, but I'd have to think it through fully before making the switch. Backwards compatibility is very important.

charchit7 · 2024-08-13T10:13:30Z

4.35 vs 5.64 isn't really a big difference. My guess is that this is i/o bound on your machine, and the saved compute is only ~1sec, similar to what I found.

Yeah, that makes sense.

You could try running this a few times to see if that changes anything.

Regarding this: I was running on a notebook, if I keep running that cell, it speeds up but adds up the memory for each run +7gb.

I'll start with the unit-test. PR.

charchit7 · 2024-08-26T09:50:11Z

Just a reminder, @BenjaminBossan, that I will work on this. Sorry for the delay.

BenjaminBossan · 2024-08-28T10:54:47Z

Feel free to ask any questions that may come up @charchit7.

nachoal · 2024-09-05T19:25:07Z

Hey @BenjaminBossan any tips on speeding up loading a flux lora weight on an API with: pipe.load_lora_weights(extracted_folder, weight_name="flux_train_latentgen.safetensors", adapter_name="main_lora") ?

I'm seeing avg load times of 40 seconds just on this step per api call run ( each user has a different lora), is there a performance opt that you recommend to speed this up?

Thank you.

BenjaminBossan · 2024-09-06T08:38:52Z

@nachoal It's hard to say with the given information. Could you please create a separate issue and provide as much information as possible so that I can try to reproduce the error?

nachoal · 2024-09-08T01:12:54Z

@nachoal It's hard to say with the given information. Could you please create a separate issue and provide as much information as possible so that I can try to reproduce the error?

Thanks @BenjaminBossan, I just created an issue with more information here: #2055

charchit7 · 2024-09-10T06:42:21Z

Hey, @BenjaminBossan I am not able to get time to work on this. Sadly getting caught up in multiple threads, gonna keep this open for anyone else if they want to continue this thread else will complete this as I get time. Again, apologies for keeping this open for long.

BenjaminBossan · 2024-09-10T12:18:57Z

Thanks for letting me know @charchit7. I'll pick this PR back up then.

HuggingFaceDocBuilderDev · 2024-09-18T12:29:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-09-18T13:27:28Z

@sayakpaul I hope this is not too much to ask for a review. With this PR in PEFT, we can restore the initial speed of loading LoRAS in diffusers as described in the initial comment.

When it comes to the review, a lot of files have been touched but the majority consists just of the same small change to each PEFT method to account for the new argument. The core logic of this PR is actually in just these 3 lines:

https://github.com/huggingface/peft/pull/1961/files#diff-95787fcc5356568929e383449ade2c1bac5da43deccab17318652b62ed171ae7R162-R164

The rest is just passing the argument along, documentation, and tests.

Currently, this feature is strictly opt-in. However, for the use case of loading adapters, I think it should be safe to make this the default (i.e. switch to opt-out). My current plan is to leave this feature as opt-in for the upcoming release with an announcement that the default will be switched and calling for users to test it. Then, for the next release after that, I would toggle the default unless users report errors.

sayakpaul · 2024-09-18T15:03:22Z

@BenjaminBossan of course not! I will review it tomorrow, first thing my time.

sayakpaul

Thanks for shipping this feature! I have mostly left some minor comments.

My question is do we want to call it low_cpu_mem_usage instead of init_empty? The former is used throughout transformers and diffusers.

sayakpaul · 2024-09-19T01:48:33Z

docs/source/developer_guides/low_level_api.md

+
+assert model.linear.lora_A["default"].weight.device.type == "meta"
+set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
+model.linear.lora_A["default"].weight.device.type == "cpu"


Do we need to have the asserts in the example codebase? No strong opinions, though.

Then let's have them within print()s and comment the expected values.

I added # should be True, not sure if that's what you meant.

sayakpaul · 2024-09-19T01:49:37Z

docs/source/developer_guides/troubleshooting.md

+
+### Loading adapter weights is slow
+
+Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. To speed up the loading time, you can pass the `init_empty=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].


Could make sense to add a note on why there's an overhead on loading adapter layers (because they are first initialized and then they are populated with the pre-trained params).

sayakpaul · 2024-09-19T01:49:56Z

src/peft/mapping.py

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
            The input model where the adapter will be injected.
        adapter_name (`str`, `optional`, defaults to `"default"`):
            The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
+        init_empty (`bool`, `optional``, defaults to `False`):
+            Create empty adapter weights on meta device. Useful to speed up the process.


Suggested change

Create empty adapter weights on meta device. Useful to speed up the process.

Create empty adapter weights on meta device. Useful to speed up the loading process.

There is no change in the suggestion.

Sorry my bad. See now.

Ah I see, I fixed all instances.

src/peft/mixed_model.py

src/peft/peft_model.py

sayakpaul · 2024-09-19T02:01:07Z

src/peft/tuners/tuners_utils.py

@@ -801,6 +812,9 @@ def _move_adapter_to_device_of_base_layer(self, adapter_name: str, device: Optio
                continue
            if adapter_name not in adapter_layer:
                continue
+            if any(p.device == meta for p in adapter_layer.parameters()):


Similarly, do we have to handle the "else" part of this branch?

sayakpaul · 2024-09-19T02:05:11Z

src/peft/utils/save_and_load.py

+    if init_empty:
+        load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)


For my own knowledge:

Can you walk me through a scenario when set_peft_model_state_dict() would be required to call with init_empty=True?

You replied yourself below :)

sayakpaul · 2024-09-19T02:09:08Z

tests/test_initialization.py

+            # module_name, _, param_name = new_key.rpartition(".")
+            # new_key = f"{module_name}.default.{param_name}"
+            remapped_dict[new_key] = val.to(device)
+        errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True)


Oh okay. So, set_peft_model_state_dict() has to be called with init_empty=True if the injection (inject_adapter_in_model()) was made with init_empty=True. I think that should be okay.

Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).

sayakpaul · 2024-09-19T02:09:27Z

tests/test_initialization.py

+            # module_name, _, param_name = new_key.rpartition(".")
+            # new_key = f"{module_name}.default.{param_name}"


Should this go away?

Right, done.

sayakpaul · 2024-09-19T02:10:33Z

tests/testing_common.py

+
+        # also test injecting directly
+        del model
+        model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)


Would it make sense to do this for a diffusers model class as well?

I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?

Not sure if we would expose another argument for low_cpu_mem_usage for loading LoRAs. Would prefer to keep the peft level atomic feature tests within peft, itself.

If the test suite needs a big rewrite, we could just create a Tester class with a specific model class from diffusers and write a simple test case like this one. Will that work?

I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.

fbace51

Most notably, renaming the argument.

BenjaminBossan

Thanks a lot for the helpful comments, I think they should be addressed now.

When it comes to the new name, I agree that low_cpu_mem_usage corresponds mostly to what is implemented here. Personally, I find the name a bit confusing, but for consistency I agree it's the correct choice.

BenjaminBossan · 2024-09-19T10:27:51Z

docs/source/developer_guides/low_level_api.md

+
+assert model.linear.lora_A["default"].weight.device.type == "meta"
+set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
+model.linear.lora_A["default"].weight.device.type == "cpu"


BenjaminBossan · 2024-09-19T10:30:24Z

src/peft/mapping.py

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
            The input model where the adapter will be injected.
        adapter_name (`str`, `optional`, defaults to `"default"`):
            The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
+        init_empty (`bool`, `optional``, defaults to `False`):
+            Create empty adapter weights on meta device. Useful to speed up the process.


There is no change in the suggestion.

src/peft/mixed_model.py

src/peft/peft_model.py

BenjaminBossan · 2024-09-19T10:42:53Z

src/peft/peft_model.py

@@ -1125,14 +1149,18 @@ def load_adapter(
                raise ValueError("Cannot set a prompt learning adapter to trainable when loading pretrained adapter.")
            else:
                peft_config.inference_mode = not is_trainable
-            self.add_adapter(adapter_name, peft_config)
+            self.add_adapter(adapter_name, peft_config, init_empty=init_empty)


Ad hoc, I can't think of a use case, but there might be. I think the description in the docstring of add_adapter, "Don't use this option when creating a new PEFT adapter for training", is sufficient here, as users would have to opt-in to use the option.

src/peft/tuners/boft/model.py

BenjaminBossan · 2024-09-19T10:46:38Z

tests/test_initialization.py

+            # module_name, _, param_name = new_key.rpartition(".")
+            # new_key = f"{module_name}.default.{param_name}"
+            remapped_dict[new_key] = val.to(device)
+        errors = set_peft_model_state_dict(model, remapped_dict, init_empty=True)


Yes, exactly. I guess we could try to automatically detect it in set_peft_model_state_dict to remove the argument, but I think since it's a very low level API, it's okay to be explicit here (and maybe there is some edge case where users don't want it).

BenjaminBossan · 2024-09-19T10:47:03Z

src/peft/utils/save_and_load.py

+    if init_empty:
+        load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)


You replied yourself below :)

BenjaminBossan · 2024-09-19T10:47:30Z

tests/test_initialization.py

+            # module_name, _, param_name = new_key.rpartition(".")
+            # new_key = f"{module_name}.default.{param_name}"


Right, done.

BenjaminBossan · 2024-09-19T10:56:54Z

tests/testing_common.py

+
+        # also test injecting directly
+        del model
+        model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)


I think this would require a big rewrite of the test without really improving the test coverage. Unless you think there is a specific reason why this would not work on diffusion models, I'd rather not do that. Since we plan to support this feature on the diffusers side, which will surely include tests, I think this will be covered there? WDYT?

BenjaminBossan

Thanks for the additional feedback, I think your concerns have been addressed.

BenjaminBossan · 2024-09-19T15:15:28Z

tests/testing_common.py

+
+        # also test injecting directly
+        del model
+        model = self.transformers_class.from_pretrained(model_id).to(self.torch_device)


I added dedicated stable diffusion tests. These tests should more or less reflect how the code in diffusers would look like so as to use this new feature.

fbace51

BenjaminBossan · 2024-09-19T15:19:47Z

src/peft/mapping.py

@@ -210,6 +210,8 @@ def inject_adapter_in_model(
            The input model where the adapter will be injected.
        adapter_name (`str`, `optional`, defaults to `"default"`):
            The name of the adapter to be injected, if not provided, the default adapter name is used ("default").
+        init_empty (`bool`, `optional``, defaults to `False`):
+            Create empty adapter weights on meta device. Useful to speed up the process.


Ah I see, I fixed all instances.

BenjaminBossan · 2024-09-19T15:20:37Z

docs/source/developer_guides/low_level_api.md

+
+assert model.linear.lora_A["default"].weight.device.type == "meta"
+set_peft_model_state_dict(model, peft_state_dict, init_empty=True)
+model.linear.lora_A["default"].weight.device.type == "cpu"


I added # should be True, not sure if that's what you meant.

sayakpaul

Thanks for shipping this. LGTM!

docs/source/developer_guides/low_level_api.md

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

charchit7 · 2024-09-20T11:17:29Z

Thanks again, @BenjaminBossan, sorry for holding this up for long time.

This PR allows to initialize the adpater weights as empty, i.e. on meta device, by passing low_cpu_mem_usage=True. Why would this be useful? For PEFT training, it is indeed not useful, as we need the real weights in order to train the model. However, when loading a trained PEFT adapter, it is unnecessary to initialize the adapters for real, as we override them with the loaded weights later. In the grand scheme of things, loading the base model will typically be much slower, but if the user loads, say, dozens of adapters, the overhead could add up. Of course, besides loading the model, this has no performance impact and is thus not a high priority feature. For the time being, this is completely opt in. However, it should be safe to make this default for loading adapters. Therefore, in the future we may change the default there. --------- Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers.

…3725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo

…ggingface#33725) * [PEFT] Support low_cpu_mem_usage for PEFT loading PEFT added support for low_cpu_mem_usage=True when loading adapters in huggingface/peft#1961. This feature is now available when installing PEFT v0.13.0. With this PR, this option is also supported when loading PEFT adapters directly into transformers models. Additionally, with this PR, huggingface/diffusers#9510 will be unblocked, which implements this option in diffusers. * Fix typo

BenjaminBossan added the contributions-welcome label Jul 26, 2024

BenjaminBossan mentioned this pull request Jul 26, 2024

Why loading a lora weights so low? huggingface/diffusers#8953

Closed

BenjaminBossan added 2 commits September 11, 2024 11:29

[WIP] Add tests (not all are passing)

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode

Loading
Loading status checks…

ead5ed7

Merge branch 'main' into allow-empty-init-of-adapter-weights

345d96c

BenjaminBossan mentioned this pull request Sep 12, 2024

Loading lora weights for FLUX pipeline is extremely slow #2055

Closed

4 tasks

Merge branch 'main' into allow-empty-init-of-adapter-weights

34393f6

Fix test since Windows is ****

9f34a7e

BenjaminBossan changed the title ~~[WIP] Allow empty initialization of adapter weight~~ Allow empty initialization of adapter weight Sep 18, 2024

BenjaminBossan changed the title ~~Allow empty initialization of adapter weight~~ ENH: Allow empty initialization of adapter weight Sep 18, 2024

BenjaminBossan marked this pull request as ready for review September 18, 2024 13:15

sayakpaul reviewed Sep 19, 2024

View reviewed changes

Address reviewer comments

06bf6d3

Most notably, renaming the argument.

BenjaminBossan commented Sep 19, 2024

View reviewed changes

sayakpaul approved these changes Sep 19, 2024

View reviewed changes

BenjaminBossan added 2 commits September 19, 2024 17:14

Add dedicated stable diffusion tests

fbace51

Improve docstring, docs

4bb763f

BenjaminBossan commented Sep 19, 2024

View reviewed changes

sayakpaul approved these changes Sep 19, 2024

View reviewed changes

docs/source/developer_guides/low_level_api.md Outdated Show resolved Hide resolved

Update docs/source/developer_guides/low_level_api.md

3c57205

Co-authored-by: Sayak Paul <spsayakpaul@gmail.com>

BenjaminBossan merged commit af275d2 into huggingface:main Sep 23, 2024
14 checks passed

BenjaminBossan deleted the allow-empty-init-of-adapter-weights branch September 23, 2024 09:13

sayakpaul mentioned this pull request Sep 24, 2024

[LoRA] allow loras to be loaded with low_cpu_mem_usage. huggingface/diffusers#9510

Merged

BenjaminBossan mentioned this pull request Sep 26, 2024

[PEFT] Support low_cpu_mem_usage option for PEFT loading adapters huggingface/transformers#33725

Merged

5 tasks

daanelson mentioned this pull request Sep 27, 2024

patching lora load for faster loading replicate/flux-fine-tuner#39

Merged


		### Loading adapter weights is slow

		Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. To speed up the loading time, you can pass the `init_empty=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].

	Create empty adapter weights on meta device. Useful to speed up the process.
	Create empty adapter weights on meta device. Useful to speed up the loading process.

		if init_empty:
		load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)

		# module_name, _, param_name = new_key.rpartition(".")
		# new_key = f"{module_name}.default.{param_name}"

ENH: Allow empty initialization of adapter weight #1961

ENH: Allow empty initialization of adapter weight #1961

Conversation

BenjaminBossan commented Jul 26, 2024 • edited Loading

Description

Progress

Diffusers issue

BenjaminBossan commented Jul 26, 2024 • edited Loading

charchit7 commented Jul 29, 2024

BenjaminBossan commented Jul 29, 2024

charchit7 commented Jul 29, 2024

BenjaminBossan commented Jul 30, 2024

charchit7 commented Jul 30, 2024

BenjaminBossan commented Aug 6, 2024

charchit7 commented Aug 6, 2024

BenjaminBossan commented Aug 7, 2024

charchit7 commented Aug 7, 2024 • edited Loading

BenjaminBossan commented Aug 7, 2024

charchit7 commented Aug 8, 2024 • edited Loading

First Test :

Second Test :

Third Test ( for this I installed peft using your branch (allow-empty-init-of-adapter-weights)

BenjaminBossan commented Aug 9, 2024

charchit7 commented Aug 9, 2024

charchit7 commented Aug 10, 2024

BenjaminBossan commented Aug 12, 2024

charchit7 commented Aug 13, 2024

charchit7 commented Aug 26, 2024

BenjaminBossan commented Aug 28, 2024

nachoal commented Sep 5, 2024

BenjaminBossan commented Sep 6, 2024

nachoal commented Sep 8, 2024

charchit7 commented Sep 10, 2024

BenjaminBossan commented Sep 10, 2024

HuggingFaceDocBuilderDev commented Sep 18, 2024

BenjaminBossan commented Sep 18, 2024

sayakpaul commented Sep 18, 2024

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BenjaminBossan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

charchit7 commented Sep 20, 2024

BenjaminBossan commented Jul 26, 2024 •

edited

Loading

BenjaminBossan commented Jul 26, 2024 •

edited

Loading

charchit7 commented Aug 7, 2024 •

edited

Loading

charchit7 commented Aug 8, 2024 •

edited

Loading

sayakpaul Sep 19, 2024 •

edited

Loading