siliconflow · ccssu · Jun 25, 2024 · Jun 13, 2024 · Jun 14, 2024 · Jun 20, 2024
diff --git a/onediff_comfy_nodes/sd3/README.md b/onediff_comfy_nodes/sd3/README.md
@@ -0,0 +1,130 @@
+## Accelerate SD3 by using onediff
+huggingface: https://huggingface.co/stabilityai/stable-diffusion-3-medium 
+
+## Environment setup
+### Set UP requirements
+```shell
+# python 3.10 
+COMFYUI_DIR=$pwd/ComfyUI
+# install ComfyUI
+git clone https://github.com/comfyanonymous/ComfyUI.git
+
+# install onediff & onediff_comfy_nodes
+git clone https://github.com/siliconflow/onediff.git 
+cd onediff && pip install -r onediff_comfy_nodes/sd3_demo/requirements.txt && pip install -e .
+ln -s $pwd/onediff/onediff_comfy_nodes  $COMFYUI_DIR/custom_nodes
+```
+
+<details close>
+<summary> test_install.py </summary>
+
+```python
+# Compile arbitrary models (torch.nn.Module)
+import torch
+from onediff.utils.import_utils import is_nexfort_available
+assert is_nexfort_available() == True
+
+import onediff.infer_compiler as infer_compiler
+
+class MyModule(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.lin = torch.nn.Linear(100, 10)
+
+    def forward(self, x):
+        return torch.nn.functional.relu(self.lin(x))
+
+mod = MyModule().to("cuda").half()
+with torch.inference_mode():
+    compiled_mod = infer_compiler.compile(mod,
+        backend="nexfort",
+        options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
+    )
+    print(compiled_mod(torch.randn(10, 100, device="cuda").half()).shape)
+
+print("Successfully installed～")
+```
+
+</details>
+
+### Download relevant models
+
+- step1: Get User Access Tokens here https://huggingface.co/settings/tokens
+
+- step2: Download relevant models
+```shell
+export ACCESS_TOKEN="User Access Tokens"
+wget --header="Authorization: Bearer $ACCESS_TOKEN" \
+https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/sd3_medium.safetensors -O models/checkpoints/sd3_medium.safetensors 
+
+wget --header="Authorization: Bearer $ACCESS_TOKEN" \
+https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/clip_g.safetensors -O models/clip/clip_g.safetensors
+
+wget --header="Authorization: Bearer $ACCESS_TOKEN" \
+https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/clip_l.safetensors -O models/clip/clip_l.safetensors
+
+# wget --header="Authorization: Bearer $ACCESS_TOKEN" \
+# https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/t5xxl_fp16.safetensors -O models/clip/t5xxl_fp16.safetensors
+
+wget --header="Authorization: Bearer $ACCESS_TOKEN" \
+https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/t5xxl_fp8_e4m3fn.safetensors -O models/clip/t5xxl_fp8_e4m3fn.safetensors
+```
+
+
+## Usage Example
+### Run ComfyUI
+```shell
+# For graph cache to speedup compilation
+export TORCHINDUCTOR_FX_GRAPH_CACHE=1
+# For persistent cache dir
+export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor_cache
+cd $COMFYUI_DIR && python main.py --gpu-only --disable-cuda-malloc
+```
+
+### WorkFlow
+Here is a very basic example how to use it:
+[workflow_sd3_speedup.json](https://github.com/user-attachments/files/15907863/sd3_suppedup.json)
+![sd3_speedup_workflow](https://github.com/siliconflow/onediff/assets/109639975/c1e955ae-7cc5-4197-9635-7cc05d5fd7a6)
+
+
+## Performance Comparison
+
+- Testing on NVIDIA GeForce RTX 4090, with image size of 1024*1024, iterating 28 steps. 
+- OneDiff[Nexfort] Compile mode: 
+`max-optimize:max-autotune:low-precision`
+
+
+| Metric                                           | NVIDIA GeForce RTX 4090 (1024 * 1024) |
+| ------------------------------------------------ | ------------------------------------- |
+| Data update date(yyyy-mm-dd)                     | 2024-06-19                            |
+| PyTorch E2E time                                 | 4.27 s                                |
+| OneDiff E2E time                                 | 3.17 s(-25.7%)                        |
+| PyTorch Max Mem Used                             | 18.445GiB                             |
+| OneDiff Max Mem Used                             | 19.199GiB                             |
+| PyTorch Warmup with Run time                     | 10s                                   |
+| OneDiff Warmup with Compilation time<sup>1</sup> | 209s                                  |
+| OneDiff Warmup with Cache time                   | 45s                                   |
+
+ <sup>1</sup> OneDiff Warmup with Compilation time is tested on  AMD EPYC 7543 32-Core Processor CPU. Note this is just for reference, and it varies a lot on different CPU.
+
+
+
+## Dynamic shape for SD3.
+
+**Q: How to use different resolutions in a production environment?**
+
+A: Warmup: Perform inference at different resolutions before deployment to ensure stability and performance;
+
+
+**Q: Why is warmup necessary when switching resolutions?**
+
+A: Warmup is necessary because NVIDIA AUTO TUNE automatically optimizes GPU settings during this process to enhance system efficiency when switching resolutions.
+
+
+## Quality
+
+The following table shows the comparison of the plot, seed=1, Baseline (non optimized) on the left, and OneDiff (optimized) on the right
+
+|                                                                                                                      |                                                                                                                     |
+| -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
+| ![sd3_baseline_00001_](https://github.com/siliconflow/onediff/assets/109639975/c86f2dc8-fc6f-4cc7-b85d-d4d973594ee6) | ![sd3_speedup_00001_](https://github.com/siliconflow/onediff/assets/109639975/c81b3fc9-d588-4ba1-9911-ae3a8a8d2454) |
diff --git a/onediff_comfy_nodes/sd3/main.py b/onediff_comfy_nodes/sd3/main.py
@@ -0,0 +1,78 @@
+import json
+from urllib import request
+
+workflow_api_path = "./workflow_api.json"
+
+
+def queue_prompt(prompt):
+    p = {"prompt": prompt}
+    data = json.dumps(p).encode("utf-8")
+    req = request.Request(
+        "http://127.0.0.1:9999/prompt", data=data
+    )  # comfyui start port
+    request.urlopen(req)
+
+
+with open(workflow_api_path, "r") as fp:
+    prompt = json.load(fp)
+
+
+def generate_texts(min_length=50, max_length=302):
+    # 50 world
+    base_text = "a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She"
+
+    # Additional words pool
+    additional_words = [
+        "gracefully",
+        "beautifully",
+        "elegant",
+        "radiant",
+        "mysteriously",
+        "vibrant",
+        "softly",
+        "gently",
+        "luminescent",
+        "sparkling",
+        "delicately",
+        "glowing",
+        "brightly",
+        "shimmering",
+        "enchanting",
+        "gloriously",
+        "magnificent",
+        "majestic",
+        "fantastically",
+        "dazzlingly",
+    ]
+    for i in range(min_length, max_length):
+        idx = i % len(additional_words)
+        base_text = base_text + " " + additional_words[idx]
+        yield base_text
+
+
+generated_texts = list(generate_texts(max_length=101))
+generated_texts.reverse()
+
+cout = 0
+dimensions = [
+    (1024, 1024),
+    (1024, 768),
+    (1024, 576),
+    (1024, 512),
+    (512, 1024),
+    (768, 512),
+    (512, 512),
+]
+
+for width, height in dimensions:
+    # Set the width and height in the prompt
+    prompt["135"]["inputs"]["width"] = width
+    prompt["135"]["inputs"]["height"] = height
+
+    # Loop through each generated text and send the prompt to the server
+    for text in generated_texts:
+        prompt["6"]["inputs"]["text"] = text
+        queue_prompt(prompt)
+        print(f"{cout=}")
+        cout += 1
+        break
diff --git a/...iff_comfy_nodes/sd3_demo/requirements.txt → onediff_comfy_nodes/sd3/requirements.txt b/...iff_comfy_nodes/sd3_demo/requirements.txt → onediff_comfy_nodes/sd3/requirements.txt
diff --git a/onediff_comfy_nodes/sd3_demo/README.md b/onediff_comfy_nodes/sd3_demo/README.md