add cached load_lora_weight (#524)

add cache for loaded LoRAs based on diffusers load_lora_weights, to avoid time cost of loading the same LoRA from disk TODO: - [x] support local file cached - [x] support lora downloaded from hub cached - [x] support unfuse lora - [x] support custom offload - [x] profile diffusers 原来 load LoRA 的方法中，时间开销最大的地方是 LoRA module 的参数初始化，但这一步是在推理中不需要的，是一个主要的优化点。这里在 examples/text_to_image_sdxl_lora.py 里面增加了多种使用 LoRA 的方法，分别是： 1. 只使用 load_lora_weights，这会改变 Linear forward 的计算路径，从而改变计算图。好处是不用 fuse LoRA，把 LoRA 的计算推迟到推理时，坏处就是推理性能下降 2. 使用 load_lora_weights 和 fuse LoRA 来加载 LoRA，好处是推理性能不变，坏处是加载 LoRA 需要一些时间 3. 本 PR 开发的 load_and_fuse_lora，可以在保证推理性能的前提下，尽可能减少加载、切换 LoRA 的开销。具体思路是增加一个 cache，保存 LoRA 的 cpu offload，下次导入的时候直接从内存中读取，减少磁盘读取的开销。另外手动重写了 fuse 过程，跳过了 LoRA module 参数初始化的过程，节省了大部分时间。推理、加载速度 profile 结果（加载内存中的 LoRA dict）： ```python /data/h/w/w/diffusers/examples dev_wy_cached_lora *15 !1 ?13 python3 text_to_image_sdxl_lora.py Loading pipeline components...: 100%|████████████████████████████████████| 7/7 [00:01<00:00, 5.57it/s] [1] Elapsed time: 0.9750442989170551 seconds 100%|██████████████████████████████████████████████████████████████████| 30/30 [01:08<00:00, 2.28s/it] 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00, 6.26it/s] You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. Loading pipeline components...: 100%|████████████████████████████████████| 7/7 [00:01<00:00, 5.51it/s] 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:39<00:00, 1.32s/it] [2] Elapsed time: 4.074353616917506 seconds 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00, 7.18it/s] You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. [3] Elapsed time: 0.7907805619761348 seconds 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00, 7.16it/s] 100%|██████████████████████████████████████████████████████████████████| 30/30 [00:04<00:00, 7.14it/s] ``` 三种方法的时间分别为 1. 0.9750442989170551 seconds 2. 4.074353616917506 seconds 3. 0.7907805619761348 seconds 加载三个 LoRA 的速度（不跑推理，LoRA dict）： ```python /data/h/w/w/diffusers/examples dev_wy_cached_lora *15 !1 ?13 python3 /data/home/wangyi/workspace/temp/test.py 1 х 8s py10 Py wangyi@oneflow-28 05:57:56 Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:01<00:00, 5.38it/s] [1] Elapsed time: 3.8003906158264726 seconds [2] Elapsed time: 5.7611241028644145 seconds You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights,you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT. [3] Elapsed time: 2.2499090780038387 seconds ``` 三种方法的速度分别是： 1. 3.8003906158264726 seconds 2. 5.7611241028644145 seconds 3. 2.2499090780038387 seconds profile 了一下用时占比，可以看到用时从高到低是：getattr（DualModule 的设计问题），linear fuse，linear unfuse ``` Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.258 0.258 1.390 1.390 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:179(load_and_fuse_lora) 11999/7640 0.016 0.000 0.599 0.000 {built-in method builtins.getattr} 7996/4359 0.015 0.000 0.583 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:82(__getattr__) 2322 0.025 0.000 0.500 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:120(__init__) 722 0.058 0.000 0.322 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:30(linear_fuse_lora) 11788 0.006 0.000 0.279 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:159(__init__) 11788 0.016 0.000 0.273 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:21(__init__) 1063466 0.160 0.000 0.160 0.000 {method 'replace' of 'str' objects} 11788 0.006 0.000 0.145 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:157(get_mixed_dual_module) 14110 0.136 0.000 0.145 0.000 /home/wangyi/miniconda3/envs/py10/lib/python3.10/site-packages/torch/nn/modules/module.py:437(__init__) 11788 0.134 0.000 0.139 0.000 {built-in method builtins.__build_class__} 23576 0.020 0.000 0.133 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:105(__setattr__) 25978 0.067 0.000 0.127 0.000 /home/wangyi/miniconda3/envs/py10/lib/python3.10/site-packages/torch/nn/modules/module.py:1617(__setattr__) 722 0.036 0.000 0.120 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/utils/lora.py:75(linear_unfuse_lora) 1446/723 0.002 0.000 0.117 0.000 /data/home/wangyi/workspace/diffusers/src/onediff/infer_compiler/with_oneflow_compile.py:303(__getattr__) ```
siliconflow · Jan 26, 2024 · f8484d1 · f8484d1
1 parent 07184c5
commit f8484d1
Show file tree

Hide file tree

Showing 4 changed files with 567 additions and 17 deletions.
diff --git a/examples/text_to_image_sdxl_lora.py b/examples/text_to_image_sdxl_lora.py
@@ -1,45 +1,99 @@
 import torch
 from pathlib import Path
-from huggingface_hub import hf_hub_download
 from diffusers import DiffusionPipeline
-from diffusers.utils import DIFFUSERS_CACHE
 from onediff.infer_compiler import oneflow_compile
 from onediff.infer_compiler.utils import TensorInplaceAssign
 
+try:
+    from diffusers_extensions.utils.lora import load_and_fuse_lora, unfuse_lora
+except ImportError:
+    raise RuntimeError("OneDiff diffusers_extensions is not installed. Please check onediff_diffusers_extensions/README.md to install diffusers_extensions.")
 
 MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
 pipe = DiffusionPipeline.from_pretrained(
     MODEL_ID, variant="fp16", torch_dtype=torch.float16
 ).to("cuda")
 LORA_MODEL_ID = "hf-internal-testing/sdxl-1.0-lora"
 LORA_FILENAME = "sd_xl_offset_example-lora_1.0.safetensors"
-lora_file = Path(DIFFUSERS_CACHE) / LORA_FILENAME
-if not lora_file.exists():
-    hf_hub_download(
-        repo_id=LORA_MODEL_ID,
-        filename=LORA_FILENAME,
-        local_dir=DIFFUSERS_CACHE,
-    )
 
 pipe.unet = oneflow_compile(pipe.unet)
-pipe.load_lora_weights(lora_file)
 generator = torch.manual_seed(0)
 
+# There are three methods to load LoRA into OneDiff compiled model
+# 1. pipe.load_lora_weights (Low Performence)
+# 2. pipe.load_lora_weights + TensorInplaceAssign + pipe.fuse_lora (Deprecated)
+# 3. onediff.utils.load_and_fuse_lora (RECOMMENDED)
+
+
+# 1. pipe.load_lora_weights (Low Performence)
+# use load_lora_weights without fuse_lora is not recommended,
+# due to the disruption of attention optimization, the inference speed is slowed down
+pipe.load_lora_weights(LORA_MODEL_ID, weight_name=LORA_FILENAME)
+images_fusion = pipe(
+    "masterpiece, best quality, mountain",
+    generator=generator,
+    height=1024,
+    width=1024,
+    num_inference_steps=30,
+).images[0]
+images_fusion.save("test_sdxl_lora_method1.png")
+pipe.unload_lora_weights()
+
+
+# need to rebuild UNet because method 1 has different computer graph with naive UNet
+generator = torch.manual_seed(0)
+pipe = DiffusionPipeline.from_pretrained(
+    MODEL_ID, variant="fp16", torch_dtype=torch.float16
+).to("cuda")
+pipe.unet = oneflow_compile(pipe.unet)
+images_fusion = pipe(
+    "masterpiece, best quality, mountain",
+    generator=generator,
+    height=1024,
+    width=1024,
+    num_inference_steps=30,
+).images[0]
+
+
+# 2. pipe.load_lora_weights + TensorInplaceAssign + pipe.fuse_lora (Deprecated)
 # The 'fuse_lora' API is not available in diffuser versions prior to 0.21.0.
+generator = torch.manual_seed(0)
+pipe.load_lora_weights(LORA_MODEL_ID, weight_name=LORA_FILENAME)
 if hasattr(pipe, "fuse_lora"):
+    # TensorInplaceAssign is DEPRECATED and NOT RECOMMENDED, please use onediff.utils.load_and_fuse_lora
     with TensorInplaceAssign(pipe.unet):
         pipe.fuse_lora(lora_scale=1.0)
+    images_fusion = pipe(
+        "masterpiece, best quality, mountain",
+        generator=generator,
+        height=1024,
+        width=1024,
+        num_inference_steps=30,
+    ).images[0]
+    images_fusion.save("test_sdxl_lora_method2.png")
 
-if hasattr(pipe, "unfuse_lora"):
     with TensorInplaceAssign(pipe.unet):
         pipe.unfuse_lora()
+pipe.unload_lora_weights()
 
-# load LoRA twice to for checking result consistency
-pipe.load_lora_weights(lora_file)
-if hasattr(pipe, "fuse_lora"):
-    with TensorInplaceAssign(pipe.unet):
-        pipe.fuse_lora(lora_scale=1.0)
 
+# 3. onediff.utils.load_and_fuse_lora (RECOMMENDED)
+# load_and_fuse_lora is equivalent to load_lora_weights + fuse_lora
+generator = torch.manual_seed(0)
+load_and_fuse_lora(pipe, LORA_MODEL_ID, weight_name=LORA_FILENAME, lora_scale=1.0)
+images_fusion = pipe(
+    "masterpiece, best quality, mountain",
+    generator=generator,
+    height=1024,
+    width=1024,
+    num_inference_steps=30,
+).images[0]
+
+images_fusion.save("test_sdxl_lora_method3.png")
+
+# 4. unfuse_lora can uninstall LoRA weights and restore the weights of UNet 
+generator = torch.manual_seed(0)
+unfuse_lora(pipe.unet)
 images_fusion = pipe(
     "masterpiece, best quality, mountain",
     generator=generator,
@@ -48,4 +102,4 @@
     num_inference_steps=30,
 ).images[0]
 
-images_fusion.save("test_sdxl_lora.png")
+images_fusion.save("test_sdxl_lora_without_lora.png")
diff --git a/onediff_diffusers_extensions/README.md b/onediff_diffusers_extensions/README.md
@@ -101,6 +101,47 @@ OneDiff Enterprise offers a quantization method that reduces memory usage, incre
 
 If you possess a OneDiff Enterprise license key, you can access instructions on OneDiff quantization and related models by visiting [Hugginface/siliconflow](https://huggingface.co/siliconflow). Alternatively, you can [contact](#contact) us to inquire about purchasing the OneDiff Enterprise license.
 
+## LoRA loading and switching speed up
+
+OneDiff provides a faster implementation of loading LoRA, by invoking `diffusers_extensions.utils.lora.load_and_fuse_lora` you can load and fuse LoRA to pipeline.
+
+```python
+import torch
+from diffusers import DiffusionPipeline
+from onediff.infer_compiler import oneflow_compile
+from diffusers_extensions.utils.lora import load_and_fuse_lora, unfuse_lora
+
+MODEL_ID = "stabilityai/stable-diffusion-xl-base-1.0"
+pipe = DiffusionPipeline.from_pretrained(
+    MODEL_ID, variant="fp16", torch_dtype=torch.float16
+).to("cuda")
+
+LORA_MODEL_ID = "hf-internal-testing/sdxl-1.0-lora"
+LORA_FILENAME = "sd_xl_offset_example-lora_1.0.safetensors"
+
+pipe.unet = oneflow_compile(pipe.unet)
+
+# use onediff load_and_fuse_lora
+load_and_fuse_lora(pipe, LORA_MODEL_ID, weight_name=LORA_FILENAME, lora_scale=1.0)
+images_fusion = pipe(
+    "masterpiece, best quality, mountain",
+    height=1024,
+    width=1024,
+    num_inference_steps=30,
+).images[0]
+images_fusion.save("test_sdxl_lora.png")
+```
+
+We compared different methods of loading LoRA. The comparison of loading LoRA once is as shown in the table below.
+
+| Method                           | Speed | Inference speed | LoRA loading speed    |
+|----------------------------------|-------|------------------|-----------------------|
+| load_lora_weight                 | 1.10s | low              | high                  |
+| load_lora_weight + fuse_lora     | 1.38s | high             | low                   |
+| onediff load_and_fuse_lora       | 0.56s | **high**         | **high**              |
+
+If you want to unload LoRA and then load a new LoRA, you only need to call `load_and_fuse_lora` again. There is no need to manually call `unfuse_lora`, cause it will be called implicitly in `load_and_fuse_lora`. You can also manually call `unfuse_lora` to restore the model's weights.
+
 ## Contact
 
 For users of OneDiff Community, please visit [GitHub Issues](https://github.com/siliconflow/onediff/issues) for bug reports and feature requests.