Skip to content

Commit

Permalink
add cached load_lora_weight (#524)
Browse files Browse the repository at this point in the history
add cache for loaded LoRAs based on diffusers load_lora_weights, to
avoid time cost of loading the same LoRA from disk

TODO:
- [x] support local file cached
- [x] support lora downloaded from hub cached
- [x] support unfuse lora
- [x] support custom offload
- [x] profile

diffusers 原来 load LoRA 的方法中,时间开销最大的地方是 LoRA module
的参数初始化,但这一步是在推理中不需要的,是一个主要的优化点。

这里在 examples/text_to_image_sdxl_lora.py 里面增加了多种使用 LoRA 的方法,分别是:
1. 只使用 load_lora_weights,这会改变 Linear forward 的计算路径,从而改变计算图。好处是不用 fuse
LoRA,把 LoRA 的计算推迟到推理时,坏处就是推理性能下降
2. 使用 load_lora_weights 和 fuse LoRA 来加载 LoRA,好处是推理性能不变,坏处是加载 LoRA 需要一些时间
3. 本 PR 开发的 load_and_fuse_lora,可以在保证推理性能的前提下,尽可能减少加载、切换 LoRA
的开销。具体思路是增加一个 cache,保存 LoRA 的 cpu
offload,下次导入的时候直接从内存中读取,减少磁盘读取的开销。另外手动重写了 fuse 过程,跳过了 LoRA module
参数初始化的过程,节省了大部分时间。

推理、加载速度 profile 结果(加载内存中的 LoRA dict):
```python
 /data/h/w/w/diffusers/examples  dev_wy_cached_lora *15 !1 ?13  python3 text_to_image_sdxl_lora.py
Loading pipeline components...: 100%|████████████████████████████████████| 7/7 [00:01<00:00,  5.57it/s]
[1] Elapsed time: 0.9750442989170551 seconds
100%|██████████████████████████████████████████████████████████████████| 30/30 [01:08<00:00,  2.28s/it]
100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  6.26it/s]
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
Loading pipeline components...: 100%|████████████████████████████████████| 7/7 [00:01<00:00,  5.51it/s]
100%|██████████████████████████████████████████████████████████████████| 30/30 [00:39<00:00,  1.32s/it]
[2] Elapsed time: 4.074353616917506 seconds
100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  7.18it/s]
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
[3] Elapsed time: 0.7907805619761348 seconds
100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  7.16it/s]
100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00,  7.14it/s]
```

三种方法的时间分别为
1. 0.9750442989170551 seconds
2. 4.074353616917506 seconds
3. 0.7907805619761348 seconds

加载三个 LoRA 的速度(不跑推理,LoRA dict):
```python
 /data/h/w/w/diffusers/examples  dev_wy_cached_lora *15 !1 ?13  python3 /data/home/wangyi/workspace/temp/test.py                                               1 х  8s  py10 Py  wangyi@oneflow-28  05:57:56
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00,  5.38it/s]
[1] Elapsed time: 3.8003906158264726 seconds
[2] Elapsed time: 5.7611241028644145 seconds
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT.
[3] Elapsed time: 2.2499090780038387 seconds
```
三种方法的速度分别是:
1. 3.8003906158264726 seconds
2. 5.7611241028644145 seconds
3. 2.2499090780038387 seconds

profile 了一下用时占比,可以看到用时从高到低是:getattr(DualModule 的设计问题),linear fuse,linear
unfuse
```
   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.258    0.258    1.390    1.390 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:179(load_and_fuse_lora)
11999/7640    0.016    0.000    0.599    0.000 {built-in method builtins.getattr}
7996/4359    0.015    0.000    0.583    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:82(__getattr__)
     2322    0.025    0.000    0.500    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:120(__init__)
      722    0.058    0.000    0.322    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:30(linear_fuse_lora)
    11788    0.006    0.000    0.279    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:159(__init__)
    11788    0.016    0.000    0.273    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:21(__init__)
  1063466    0.160    0.000    0.160    0.000 {method 'replace' of 'str' objects}
    11788    0.006    0.000    0.145    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:157(get_mixed_dual_module)
    14110    0.136    0.000    0.145    0.000 /home/wangyi/miniconda3/envs/py10/lib/python3.10/site-packages/torch/nn/modules/module.py:437(__init__)
    11788    0.134    0.000    0.139    0.000 {built-in method builtins.__build_class__}
    23576    0.020    0.000    0.133    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:105(__setattr__)
    25978    0.067    0.000    0.127    0.000 /home/wangyi/miniconda3/envs/py10/lib/python3.10/site-packages/torch/nn/modules/module.py:1617(__setattr__)
      722    0.036    0.000    0.120    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:75(linear_unfuse_lora)
 1446/723    0.002    0.000    0.117    0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:303(__getattr__)
```
  • Loading branch information
marigoold authored Jan 26, 2024
1 parent 07184c5 commit f8484d1
Show file tree
Hide file tree
Showing 4 changed files with 567 additions and 17 deletions.
88 changes: 71 additions & 17 deletions examples/text_to_image_sdxl_lora.py
Original file line number Diff line number Diff line change
@@ -1,45 +1,99 @@
import torch
from pathlib import Path
from huggingface_hub import hf_hub_download
from diffusers import DiffusionPipeline
from diffusers.utils import DIFFUSERS_CACHE
from onediff.infer_compiler import oneflow_compile
from onediff.infer_compiler.utils import TensorInplaceAssign

try:
from diffusers_extensions.utils.lora import load_and_fuse_lora, unfuse_lora
except ImportError:
raise RuntimeError("OneDiff diffusers_extensions is not installed. Please check onediff_diffusers_extensions/README.md to install diffusers_extensions.")

MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(
MODEL_ID, variant="fp16", torch_dtype=torch.float16
).to("cuda")
LORA_MODEL_ID = "hf-internal-testing/sdxl-1.0-lora"
LORA_FILENAME = "sd_xl_offset_example-lora_1.0.safetensors"
lora_file = Path(DIFFUSERS_CACHE) / LORA_FILENAME
if not lora_file.exists():
hf_hub_download(
repo_id=LORA_MODEL_ID,
filename=LORA_FILENAME,
local_dir=DIFFUSERS_CACHE,
)

pipe.unet = oneflow_compile(pipe.unet)
pipe.load_lora_weights(lora_file)
generator = torch.manual_seed(0)

# There are three methods to load LoRA into OneDiff compiled model
# 1. pipe.load_lora_weights (Low Performence)
# 2. pipe.load_lora_weights + TensorInplaceAssign + pipe.fuse_lora (Deprecated)
# 3. onediff.utils.load_and_fuse_lora (RECOMMENDED)


# 1. pipe.load_lora_weights (Low Performence)
# use load_lora_weights without fuse_lora is not recommended,
# due to the disruption of attention optimization, the inference speed is slowed down
pipe.load_lora_weights(LORA_MODEL_ID, weight_name=LORA_FILENAME)
images_fusion = pipe(
"masterpiece, best quality, mountain",
generator=generator,
height=1024,
width=1024,
num_inference_steps=30,
).images[0]
images_fusion.save("test_sdxl_lora_method1.png")
pipe.unload_lora_weights()


# need to rebuild UNet because method 1 has different computer graph with naive UNet
generator = torch.manual_seed(0)
pipe = DiffusionPipeline.from_pretrained(
MODEL_ID, variant="fp16", torch_dtype=torch.float16
).to("cuda")
pipe.unet = oneflow_compile(pipe.unet)
images_fusion = pipe(
"masterpiece, best quality, mountain",
generator=generator,
height=1024,
width=1024,
num_inference_steps=30,
).images[0]


# 2. pipe.load_lora_weights + TensorInplaceAssign + pipe.fuse_lora (Deprecated)
# The 'fuse_lora' API is not available in diffuser versions prior to 0.21.0.
generator = torch.manual_seed(0)
pipe.load_lora_weights(LORA_MODEL_ID, weight_name=LORA_FILENAME)
if hasattr(pipe, "fuse_lora"):
# TensorInplaceAssign is DEPRECATED and NOT RECOMMENDED, please use onediff.utils.load_and_fuse_lora
with TensorInplaceAssign(pipe.unet):
pipe.fuse_lora(lora_scale=1.0)
images_fusion = pipe(
"masterpiece, best quality, mountain",
generator=generator,
height=1024,
width=1024,
num_inference_steps=30,
).images[0]
images_fusion.save("test_sdxl_lora_method2.png")

if hasattr(pipe, "unfuse_lora"):
with TensorInplaceAssign(pipe.unet):
pipe.unfuse_lora()
pipe.unload_lora_weights()

# load LoRA twice to for checking result consistency
pipe.load_lora_weights(lora_file)
if hasattr(pipe, "fuse_lora"):
with TensorInplaceAssign(pipe.unet):
pipe.fuse_lora(lora_scale=1.0)

# 3. onediff.utils.load_and_fuse_lora (RECOMMENDED)
# load_and_fuse_lora is equivalent to load_lora_weights + fuse_lora
generator = torch.manual_seed(0)
load_and_fuse_lora(pipe, LORA_MODEL_ID, weight_name=LORA_FILENAME, lora_scale=1.0)
images_fusion = pipe(
"masterpiece, best quality, mountain",
generator=generator,
height=1024,
width=1024,
num_inference_steps=30,
).images[0]

images_fusion.save("test_sdxl_lora_method3.png")

# 4. unfuse_lora can uninstall LoRA weights and restore the weights of UNet
generator = torch.manual_seed(0)
unfuse_lora(pipe.unet)
images_fusion = pipe(
"masterpiece, best quality, mountain",
generator=generator,
Expand All @@ -48,4 +102,4 @@
num_inference_steps=30,
).images[0]

images_fusion.save("test_sdxl_lora.png")
images_fusion.save("test_sdxl_lora_without_lora.png")
41 changes: 41 additions & 0 deletions onediff_diffusers_extensions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,47 @@ OneDiff Enterprise offers a quantization method that reduces memory usage, incre

If you possess a OneDiff Enterprise license key, you can access instructions on OneDiff quantization and related models by visiting [Hugginface/siliconflow](https://huggingface.co/siliconflow). Alternatively, you can [contact](#contact) us to inquire about purchasing the OneDiff Enterprise license.

## LoRA loading and switching speed up

OneDiff provides a faster implementation of loading LoRA, by invoking `diffusers_extensions.utils.lora.load_and_fuse_lora` you can load and fuse LoRA to pipeline.

```python
import torch
from diffusers import DiffusionPipeline
from onediff.infer_compiler import oneflow_compile
from diffusers_extensions.utils.lora import load_and_fuse_lora, unfuse_lora

MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(
MODEL_ID, variant="fp16", torch_dtype=torch.float16
).to("cuda")

LORA_MODEL_ID = "hf-internal-testing/sdxl-1.0-lora"
LORA_FILENAME = "sd_xl_offset_example-lora_1.0.safetensors"

pipe.unet = oneflow_compile(pipe.unet)

# use onediff load_and_fuse_lora
load_and_fuse_lora(pipe, LORA_MODEL_ID, weight_name=LORA_FILENAME, lora_scale=1.0)
images_fusion = pipe(
"masterpiece, best quality, mountain",
height=1024,
width=1024,
num_inference_steps=30,
).images[0]
images_fusion.save("test_sdxl_lora.png")
```

We compared different methods of loading LoRA. The comparison of loading LoRA once is as shown in the table below.

| Method | Speed | Inference speed | LoRA loading speed |
|----------------------------------|-------|------------------|-----------------------|
| load_lora_weight | 1.10s | low | high |
| load_lora_weight + fuse_lora | 1.38s | high | low |
| onediff load_and_fuse_lora | 0.56s | **high** | **high** |

If you want to unload LoRA and then load a new LoRA, you only need to call `load_and_fuse_lora` again. There is no need to manually call `unfuse_lora`, cause it will be called implicitly in `load_and_fuse_lora`. You can also manually call `unfuse_lora` to restore the model's weights.

## Contact

For users of OneDiff Community, please visit [GitHub Issues](https://github.com/siliconflow/onediff/issues) for bug reports and feature requests.
Expand Down
Loading

0 comments on commit f8484d1

Please sign in to comment.