-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading lora weights for FLUX pipeline is extremely slow #2055
Comments
Thanks for the detailed description. I cannot reproduce the issue of slow LoRA loading time, for me it takes 0.77 sec to load the LoRA weights (excluding download and extraction time). However, I do have a suspicion of what's going on. Based on your screenshot, your GPU does not have enough memory to load the Just for reference, I'm using 2 GPUs with 24 GB of VRAM and need to load the model like so: pipe = FluxPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
torch_dtype=torch.bfloat16,
use_safetensors=True,
device_map="balanced",
max_memory={0: "24GB", 1: "20GB"},
) When I check the devices after loading the LoRA adapter, I get:
(Using this code: attrs = ["text_encoder", "text_encoder_2", "transformer", "vae"]
for attr in attrs:
obj = getattr(pipe, attr)
if not obj:
continue
print(attr.upper())
print(f"{obj.device=}")
print("param devices:", {p.device for p in obj.parameters()})
print("param dtypes: ", {p.dtype for p in obj.parameters()})
print(f"num params: {sum(p.numel() for p in obj.parameters()):,}") ) The PR you referenced should still help with loading times on CPU, but it's unfortunately not in a state yet that it can be tested. Inference would still be super slow. |
Hi @BenjaminBossan, I added your devices check before loading the LoRA and after it and got CUDA so I don't think that transformers is being loaded on the CPU, note that times are still on the 200+ second range. I'm curious, did you use my example lora url? Could not having a lora parameter file affect the loading part? Im using an A100 SXM GPU which has 80GB VRAM so I don't think that the issue is on that side.
|
Thanks for the extra info, then my suspicion about the device proved to be incorrect. Indeed it is very strange why it would take so long for you whereas it takes less than 1 sec for me. Could you please take separate timings for the download and extraction (which I skipped on my test) vs the actual loading of the weights? Alternatively, could you just use a local file instead of downloading and deleting it each time? If you do the latter, please run the benchmark at least twice to check if that makes a difference. |
Just want to chime in here, I'm stumped by a similar slowdown using System information: OS = linux (debian-slim:12.5)
GPU = A10G, 24GB VRAM
Python=3.12.5 Here are my package versions: "diffusers==0.30.2",
"transformers==4.44.2",
"accelerate==0.34.2",
"safetensors==0.4.4",
"torch==2.4.1",
"peft==0.12.0" Here is my code: MODEL_URL = "https://civitai.com/api/download/models/290640?type=Model&format=SafeTensor&size=pruned&fp=fp16"
MODEL_FILE_PATH = "/models/checkpoints/pony_diffusion_v6_xl.safetensors"
VAE_URL="https://civitai.com/api/download/models/290640?type=VAE&format=SafeTensor"
VAE_FILE_PATH="/models/vae/pony_diffusion_v6_xl_vae.safetensors"
STYLE_LORA_URL="https://civitai.com/api/download/models/820564?type=Model&format=SafeTensor&token=<YOUR_CIVITAI_TOKEN>"
STYLE_LORA_FILE_PATH="/models/loras/pony/pony_style_lora.safetensors"
# Load VAE
vae_load_start = perf_counter()
vae = AutoencoderKL.from_single_file(VAE_FILE_PATH, torch_dtype=torch.float16)
vae_load_end = perf_counter()
print(f"VAE loaded in {vae_load_end - vae_load_start} seconds")
# Load Model
model_load_start = perf_counter()
self.pipeline = StableDiffusionXLPipeline.from_single_file(
MODEL_FILE_PATH,
vae=vae,
safety_checker=None,
torch_dtype=torch.float16,
)
model_load_end = perf_counter()
print(f"Model loaded in {model_load_end - model_load_start} seconds")
self.pipeline.to("cuda")
# Set scheduler to Euler Ancestral Discrete Scheduler
self.pipeline.scheduler = EulerAncestralDiscreteScheduler.from_config(self.pipeline.scheduler.config)
# Load LoRAs
lora_load_start = perf_counter()
self.pipeline.load_lora_weights(STYLE_LORA_FILE_PATH, adapter_name="style")
lora_load_end = perf_counter()
print(f"LoRA loaded in {lora_load_end - lora_load_start} seconds")
# Add LoRAs to the pipeline
self.pipeline.set_adapters(["style"], adapter_weights=[0.8])
# Generate the image
image_generation_start = perf_counter()
full_prompt = f"score_9, score_8_up, score_7_up, score_6_up,source_anime, {prompt}"
image = self.pipeline(
prompt=full_prompt,
negative_prompt="bad quality, score_3, score_2, score_1",
height=1024,
width=1024,
num_inference_steps=25,
guidance_scale=8.5,
).images[0]
image_generation_end = perf_counter()
print(f"Image generated in {image_generation_end - image_generation_start} seconds")
# Unload LoRAs from the pipeline
unload_lora_start = perf_counter()
self.pipeline.unload_lora_weights()
unload_lora_end = perf_counter()
print(f"LoRA unloaded in {unload_lora_end - unload_lora_start} seconds") Sample benchmarks: Model loaded in 3.0992493440000004 seconds
LoRA loaded in 2.928138479000001 seconds
... These issues seem related, Please let me know if this is the right place to bring this up or if I should open a new issue. |
Judging by the numbers you get, I don't think the issues are related. But I do think that your issue could be the same as huggingface/diffusers#8953 and will be addressed by the work in #1961. |
@nachoal Hello I have small question i have fine tuned flux dev on replicate i have 2 files config.yamal and lora.safetensors These are the files i got from replicate |
Was there a resolution to this? I'm having similar difficulties to @nachoal where flux-dev loads relatively quickly but loading any LORA, regardless of size, takes multiple minutes. This slow loading time persists if I'm loading local files as well. I'm also using an 80GB A100. Example code:
|
Unfortunately, I also can't reproduce this issue. With the adapter downloaded, loading it took 2.5 sec for me on the first run and 0.5 sec on subsequent runs (probably due to disk caching). Some questions:
|
|
@BenjaminBossan how can you load the Lora weights plz if you have any hands on material |
I tried using these exact versions but still the results don't change, loading the LoRA takes ~0.5 sec, the whole loading script is finished in 9 sec.
Maybe you could check a couple of other LoRAs to see if it's just this one or if all are slow. But I think the next logical step would be to profile the loading step to figure out what exactly it is that is so slow. Do you have some experience with profiling in Python? If I could reproduce this, I'd do it myself, but as is I can't. @BasmaElhoseny01 I'm not sure what you mean, in this issue you can find code snippets that show how LoRA is loaded. If you need more general info on how to use LoRA in diffusers, check the diffusers docs. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
System Info
Installed packages:
Python version:
3.10.12
System:
Linux 1936c0a77ae2 5.15.0-102-generic #112-Ubuntu SMP Tue Mar 5 16:50:32 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
nvidia-smi
Who can help?
@BenjaminBossan
Information
Tasks
examples
folderReproduction
Expected behavior
Issue
LoRA loading takes more than 200 seconds in the load_lora_weights step for custom loras, the script above includes a working url for a LoRA trained on a random person which you can use for testing, you will see that times are considerably long (more than 20 seconds on average) when running it even with 100% GPU usage for the script
Expected behavior
Lower load times
Reproduction steps
I'm running the script with the following command:
Then I get the following benchmarks:
Notice the time it takes to load the lora weights, the step for
pipe.load_lora_weights(extracted_folder, weight_name="flux_train_latentgen.safetensors", adapter_name="main_lora")
is taking more than 290 seconds in total. I have tried:pipe.enable_model_cpu_offload() pipe.to("cuda")
pipe.to("cuda")
as currentlyHelp
I would appreciate your help in any pointers or optimizations apart from leaving the LoRA in VRAM? (This is part of a larger process that receives a loRA url and processes images based on it so at any given time I need to be able to load and unload flux loras as fast as possible. Thanks!
The text was updated successfully, but these errors were encountered: