Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix onediff_comfy_nodes/sd3_demo/README.md #949

Merged
merged 12 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 130 additions & 0 deletions onediff_comfy_nodes/sd3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
## Accelerate SD3 by using onediff
huggingface: https://huggingface.co/stabilityai/stable-diffusion-3-medium

## Environment setup
### Set UP requirements
```shell
# python 3.10
COMFYUI_DIR=$pwd/ComfyUI
# install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git

# install onediff & onediff_comfy_nodes
git clone https://github.com/siliconflow/onediff.git
cd onediff && pip install -r onediff_comfy_nodes/sd3_demo/requirements.txt && pip install -e .
strint marked this conversation as resolved.
Show resolved Hide resolved
ln -s $pwd/onediff/onediff_comfy_nodes $COMFYUI_DIR/custom_nodes
```

<details close>
<summary> test_install.py </summary>

```python
# Compile arbitrary models (torch.nn.Module)
import torch
from onediff.utils.import_utils import is_nexfort_available
assert is_nexfort_available() == True

import onediff.infer_compiler as infer_compiler

class MyModule(torch.nn.Module):
def __init__(self):
super().__init__()
self.lin = torch.nn.Linear(100, 10)

def forward(self, x):
return torch.nn.functional.relu(self.lin(x))

mod = MyModule().to("cuda").half()
with torch.inference_mode():
compiled_mod = infer_compiler.compile(mod,
backend="nexfort",
options={"mode": "max-autotune:cudagraphs", "dynamic": True, "fullgraph": True},
)
print(compiled_mod(torch.randn(10, 100, device="cuda").half()).shape)

print("Successfully installed~")
```

</details>

### Download relevant models

- step1: Get User Access Tokens here https://huggingface.co/settings/tokens

- step2: Download relevant models
```shell
export ACCESS_TOKEN="User Access Tokens"
wget --header="Authorization: Bearer $ACCESS_TOKEN" \
https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/sd3_medium.safetensors -O models/checkpoints/sd3_medium.safetensors

wget --header="Authorization: Bearer $ACCESS_TOKEN" \
https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/clip_g.safetensors -O models/clip/clip_g.safetensors

wget --header="Authorization: Bearer $ACCESS_TOKEN" \
https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/clip_l.safetensors -O models/clip/clip_l.safetensors

# wget --header="Authorization: Bearer $ACCESS_TOKEN" \
# https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/t5xxl_fp16.safetensors -O models/clip/t5xxl_fp16.safetensors

wget --header="Authorization: Bearer $ACCESS_TOKEN" \
https://huggingface.co/stabilityai/stable-diffusion-3-medium/resolve/main/text_encoders/t5xxl_fp8_e4m3fn.safetensors -O models/clip/t5xxl_fp8_e4m3fn.safetensors
```


## Usage Example
### Run ComfyUI
```shell
# For graph cache to speedup compilation
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
# For persistent cache dir
export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor_cache
cd $COMFYUI_DIR && python main.py --gpu-only --disable-cuda-malloc
```

### WorkFlow
Here is a very basic example how to use it:
[workflow_sd3_speedup.json](https://github.com/user-attachments/files/15907863/sd3_suppedup.json)
![sd3_speedup_workflow](https://github.com/siliconflow/onediff/assets/109639975/c1e955ae-7cc5-4197-9635-7cc05d5fd7a6)


## Performance Comparison

- Testing on NVIDIA GeForce RTX 4090, with image size of 1024*1024, iterating 28 steps.
- OneDiff[Nexfort] Compile mode:
`max-optimize:max-autotune:low-precision`


| Metric | NVIDIA GeForce RTX 4090 (1024 * 1024) |
| ------------------------------------------------ | ------------------------------------- |
| Data update date(yyyy-mm-dd) | 2024-06-19 |
| PyTorch E2E time | 4.27 s |
| OneDiff E2E time | 3.17 s(-25.7%) |
| PyTorch Max Mem Used | 18.445GiB |
| OneDiff Max Mem Used | 19.199GiB |
| PyTorch Warmup with Run time | 10s |
| OneDiff Warmup with Compilation time<sup>1</sup> | 209s |
| OneDiff Warmup with Cache time | 45s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on AMD EPYC 7543 32-Core Processor CPU. Note this is just for reference, and it varies a lot on different CPU.



## Dynamic shape for SD3.

**Q: How to use different resolutions in a production environment?**

A: Warmup: Perform inference at different resolutions before deployment to ensure stability and performance;


**Q: Why is warmup necessary when switching resolutions?**

A: Warmup is necessary because NVIDIA AUTO TUNE automatically optimizes GPU settings during this process to enhance system efficiency when switching resolutions.


## Quality

The following table shows the comparison of the plot, seed=1, Baseline (non optimized) on the left, and OneDiff (optimized) on the right

| | |
| -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| ![sd3_baseline_00001_](https://github.com/siliconflow/onediff/assets/109639975/c86f2dc8-fc6f-4cc7-b85d-d4d973594ee6) | ![sd3_speedup_00001_](https://github.com/siliconflow/onediff/assets/109639975/c81b3fc9-d588-4ba1-9911-ae3a8a8d2454) |
78 changes: 78 additions & 0 deletions onediff_comfy_nodes/sd3/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import json
from urllib import request

workflow_api_path = "./workflow_api.json"


def queue_prompt(prompt):
p = {"prompt": prompt}
data = json.dumps(p).encode("utf-8")
req = request.Request(
"http://127.0.0.1:9999/prompt", data=data
) # comfyui start port
request.urlopen(req)


with open(workflow_api_path, "r") as fp:
prompt = json.load(fp)


def generate_texts(min_length=50, max_length=302):
# 50 world
base_text = "a female character with long, flowing hair that appears to be made of ethereal, swirling patterns resembling the Northern Lights or Aurora Borealis. The background is dominated by deep blues and purples, creating a mysterious and dramatic atmosphere. The character's face is serene, with pale skin and striking features. She"

# Additional words pool
additional_words = [
"gracefully",
"beautifully",
"elegant",
"radiant",
"mysteriously",
"vibrant",
"softly",
"gently",
"luminescent",
"sparkling",
"delicately",
"glowing",
"brightly",
"shimmering",
"enchanting",
"gloriously",
"magnificent",
"majestic",
"fantastically",
"dazzlingly",
]
for i in range(min_length, max_length):
idx = i % len(additional_words)
base_text = base_text + " " + additional_words[idx]
yield base_text


generated_texts = list(generate_texts(max_length=101))
generated_texts.reverse()

cout = 0
dimensions = [
(1024, 1024),
(1024, 768),
(1024, 576),
(1024, 512),
(512, 1024),
(768, 512),
(512, 512),
]

for width, height in dimensions:
# Set the width and height in the prompt
prompt["135"]["inputs"]["width"] = width
prompt["135"]["inputs"]["height"] = height

# Loop through each generated text and send the prompt to the server
for text in generated_texts:
prompt["6"]["inputs"]["text"] = text
queue_prompt(prompt)
print(f"{cout=}")
cout += 1
break
97 changes: 0 additions & 97 deletions onediff_comfy_nodes/sd3_demo/README.md

This file was deleted.

Loading