Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading LORA weights in diffusers with a peft backend increases in latency as more paths are added to PYTHONPATH #1576

Closed
2 of 4 tasks
tisles opened this issue Mar 21, 2024 · 4 comments · Fixed by #1584
Closed
2 of 4 tasks

Comments

@tisles
Copy link
Contributor

tisles commented Mar 21, 2024

System Info

accelerate==0.21.0
diffusers==0.26.3
peft==0.9.0
safetensors==0.3.3
tokenizers==0.15.2
torch==2.2.1
transformers==4.36.2

Who can help?

@sayakpaul

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder
  • My own task or dataset (give details below)

Reproduction

from diffusers import DiffusionPipeline
import time
import torch
import sys
import os
import shutil

pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")

loras = [
    {
        "adapter_name": "anime"
        "location": "./anime",
        "weight_name": "anime.safetensors",
        "token": "my_anime"
    },
]

def run_dynamic_lora_inference(lora):
    start_load_time = time.time()
    pipe.load_lora_weights(lora["location"], weight_name=lora["weight_name"], adapter_name=lora["adapter_name"])
    end_load_time = time.time()
    prompt = f"Illustration of a dog in the style of {lora["token"]}"

    start_fuse_time = time.time()
    pipe.fuse_lora()
    end_fuse_time = time.time()

    start_set_adapter_time = time.time()
    pipe.set_adapters(lora_name)
    end_set_adapter_time = time.time()

    start_inference_time = time.time()
    image = pipe(
        prompt, num_inference_steps=30, generator=torch.manual_seed(0)
    ).images[0]
    end_inference_time = time.time()

    start_unfuse_time = time.time()
    pipe.unfuse_lora()
    end_unfuse_time = time.time()

    start_unload_time = time.time()
    pipe.unload_lora_weights()
    end_unload_time = time.time()

    image.save(f"./{lora_name}.png")

    print("Load time:", end_load_time - start_load_time)
    print("Fuse time:", end_fuse_time - start_fuse_time)
    print("Set adapter time", end_set_adapter_time - start_set_adapter_time)
    print("Inference time:", end_inference_time - start_inference_time)
    print("Unfuse time:", end_unfuse_time - start_unfuse_time)
    print("Unload time:", end_unload_time - start_unload_time)

def add_to_python_path():
    root_path = "./folders"
    shutil.rmtree(root_path)
    os.mkdir(root_path)

    folders =  [f"folder_{x}" for x in range(0, 10000)]
    for folder in folders:
        os.mkdir(os.path.join(root_path, folder))
        sys.path.append(os.path.join(root_path, folder))

def main():
    for lora in loras:
        run_dynamic_lora_inference(lora)

main()

Flamegraph:
profile

Expected behavior

I run a system with a somewhat large PYTHONPATH that we can't truncate, and we are currently blocked from upgrading diffusers to any version that uses peft for LORA inference.

It's loosely based on this post: https://huggingface.co/blog/lora-adapters-dynamic-loading

We've observed a behavior where the time taken for load_lora_weights increases significantly with more paths added to PYTHONPATH. This can be reproduced in the example provided - with 10,000 folders added to PYTHONPATH, we get the following latencies:

Load time: 291.78441095352173
Fuse time: 0.12406659126281738
Set adapter time 0.06171250343322754
Inference time: 9.685987710952759
Unfuse time: 0.08063459396362305
Unload time: 0.15737533569335938

Benchmarking against 1, 10, 100, 1000, 10000 and 50000 entries in the PYTHONPATH, we get a pretty astounding increase in load latency:

image

Even at 100 entries, we're looking at an extra 4 seconds per load call which is a pretty significant increase.

We looked briefly at it and came to the conclusion that it's something to do with the way peft runs module imports, particularly repetitive calls to import modules, where the imports are not cached, eg, importlib.util.find_spec doesn't cache imports.

Instead of this behaviour, we'd expect that load_lora_weights retains a relatively constant load time, regardless of the length of our python path.

@BenjaminBossan
Copy link
Member

Interesting, thanks for bringing this to our attention. My first instinct would be to add a cache to all the functions that use importlib.util.find_spec, as something like:

def is_bnb_available() -> bool:
    return importlib.util.find_spec("bitsandbytes") is not None

should be safe to cache. WDYT, would that solve your issue?

@tisles
Copy link
Contributor Author

tisles commented Mar 21, 2024

Potentially, yeah - is it possible to do this once at a higher level in the code, rather than every function call? Otherwise decorating them with @functools.cache might also help :)

@BenjaminBossan
Copy link
Member

is it possible to do this once at a higher level in the code

You mean at the caller site of these functions? Very unlikely, as they can be used in many different places. However, I think that a cache on these functions should be fast enough. Do you want to give this a try?

@tisles
Copy link
Contributor Author

tisles commented Mar 25, 2024

Yup! Fix PR is at #1584, turned out to be a relatively simple one :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants