Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add diffusers nexfort example #998

Merged
merged 18 commits into from
Jul 12, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions benchmarks/text_to_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ def parse_args():
parser.add_argument("--input-image", type=str, default=INPUT_IMAGE)
parser.add_argument("--control-image", type=str, default=CONTROL_IMAGE)
parser.add_argument("--output-image", type=str, default=OUTPUT_IMAGE)
parser.add_argument("--print-output", action="store_true")
parser.add_argument("--throughput", action="store_true")
parser.add_argument("--deepcache", action="store_true")
parser.add_argument(
Expand Down Expand Up @@ -384,6 +385,14 @@ def get_kwarg_inputs():
print(f"Max used CUDA memory : {cuda_mem_after_used:.3f}GiB")
print("=======================================")

if args.print_output:
from onediff.utils.import_utils import is_nexfort_available
if is_nexfort_available():
from nexfort.utils.term_image import print_image

for image in output_images:
print_image(image, max_width=80)

if args.output_image is not None:
output_images[0].save(args.output_image)
else:
Expand Down
Binary file added imgs/nexfort_sd1-5_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/nexfort_sd2_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added imgs/nexfort_sdxl_demo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
108 changes: 108 additions & 0 deletions onediff_diffusers_extensions/examples/sd1.5/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
# Run SD1.5 with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SD1.5](#set-up-sd15)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SD1.5](#dynamic-shape-for-sd15)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SD1.5
Model version for diffusers: https://huggingface.co/runwayml/stable-diffusion-v1-5

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/overview.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--seed 1 \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--seed 1 \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 512*512, iterating 20 steps:
| Metric | RTX3090, 512*512 | RTX4090, 512*512 |
| ------------------------------------ | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 21.20 it/s | 34.46 it/s |
| OneDiff iteration speed | 48.00 it/s (+126.4%) | 81.81 it/s (+137.4%) |
| PyTorch E2E time | 1.07 s | 0.67 s |
| OneDiff E2E time | 0.48 s (-55.1%) | 0.28 s (-58.2%) |
| PyTorch Max Mem Used | 2.627 GiB | 2.616 GiB |
| OneDiff Max Mem Used | 2.587 GiB | 2.709 GiB |
| PyTorch Warmup with Run time | | |
| OneDiff Warmup with Compilation time | 233.61 s <sup>1</sup> | 177.321s <sup>2</sup> |
| OneDiff Warmup with Cache time | 41.120 s | 30.019s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.

## Dynamic shape for SD1.5

<!-- TODO -->

Run:

```shell
python3 benchmarks/text_to_image.py \
--model runwayml/stable-diffusion-v1-5 \
--height 512 --width 512 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-v1-5-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sd1-5_demo.png">
</p>
105 changes: 105 additions & 0 deletions onediff_diffusers_extensions/examples/sd2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Run SD2 with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SD2](#set-up-sd2)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SD2](#dynamic-shape-for-sd2)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SD2
Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-2

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:benchmark:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"triton.fuse_attention_allow_fp16_reduction": false, "inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 786\*768 and 512\*512, iterating 20 steps:

| Metric | RTX3090, 768*768 | RTX3090, 512*512 | RTX4090, 768*768 | RTX4090, 512*512 |
| ------------------------------------ | -------------------- | -------------------- | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 10.45 it/s | 22.84 it/s | 12.34 it/s | 39.06 it/s |
| OneDiff iteration speed | 15.93 it/s (+52.4%) | 44.84 it/s (+96.3%) | 31.63 it/s (+156.3%) | 83.63 it/s (+114.1%) |
| PyTorch E2E time | 2.10 s | 0.97 s | 1.78s | 0.58 s |
| OneDiff E2E time | 1.35 s (-35.7%) | 0.49 s (-49.5%) | 0.68s (-61.8%) | 0.26 s (-55.2%) |
| PyTorch Max Mem Used | 3.767 GiB | 3.025 GiB | 3.767 GiB | 3.024 GiB |
| OneDiff Max Mem Used | 3.558 GiB | 3.018 GiB | 3.567 GiB | 3.016 GiB |
| PyTorch Warmup with Run time | | | | |
| OneDiff Warmup with Compilation time | 301.54 s<sup>1</sup> | 222.18 s<sup>1</sup> | 195.34 s <sup>2</sup> | 165.29 s <sup>1</sup> |
| OneDiff Warmup with Cache time | 113.04 s | 44.94 s | 32.41 s | 30.10 s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.

## Dynamic shape for SD2

Run:

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-2-1 \
--height 768 --width 768 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-2-1-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sd2_demo.png">
</p>
109 changes: 109 additions & 0 deletions onediff_diffusers_extensions/examples/sdxl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Run SDXL with nexfort backend (Beta Release)

1. [Environment Setup](#environment-setup)
- [Set Up OneDiff](#set-up-onediff)
- [Set Up NexFort Backend](#set-up-nexfort-backend)
- [Set Up Diffusers Library](#set-up-diffusers)
- [Set Up SDXL](#set-up-sdxl)
2. [Execution Instructions](#run)
- [Run Without Compilation (Baseline)](#run-without-compilation-baseline)
- [Run With Compilation](#run-with-compilation)
3. [Performance Comparison](#performance-comparison)
4. [Dynamic Shape for SDXL](#dynamic-shape-for-sdxl)
5. [Quality](#quality)

## Environment setup
### Set up onediff
https://github.com/siliconflow/onediff?tab=readme-ov-file#installation

### Set up nexfort backend
https://github.com/siliconflow/onediff/tree/main/src/onediff/infer_compiler/backends/nexfort

### Set up diffusers

```
pip3 install --upgrade diffusers[torch]
```
### Set up SDXL
Model version for diffusers: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

HF pipeline: https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md

## Run

### Run without compilation (Baseline)
```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler none \
--variant fp16 \
--seed 1 \
--print-output
```

### Run with compilation

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "benchmark:cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}}' \
--variant fp16 \
--seed 1 \
--print-output
```

## Performance comparison

Testing on NVIDIA GeForce RTX 3090 / 4090, with image size of 1024*1024, iterating 20 steps:
| Metric | RTX 3090 1024*1024 | RTX 4090 1024*1024 |
| ------------------------------------ | --------------------- | --------------------- |
| Data update date (yyyy-mm-dd) | 2024-07-10 | 2024-07-10 |
| PyTorch iteration speed | 4.08 it/s | 6.93 it/s |
| OneDiff iteration speed | 7.21 it/s (+76.7%) | 13.92 it/s (+100.9%) |
| PyTorch E2E time | 5.60 s | 3.23 s |
| OneDiff E2E time | 3.41 s (-39.1%) | 1.67 s (-48.3%) |
| PyTorch Max Mem Used | 10.467 GiB | 10.467 GiB |
| OneDiff Max Mem Used | 12.004 GiB | 12.021 GiB |
marigoold marked this conversation as resolved.
Show resolved Hide resolved
| PyTorch Warmup with Run time | | |
| OneDiff Warmup with Compilation time | 474.36 s <sup>1</sup> | 236.54 s <sup>2</sup> |
| OneDiff Warmup with Cache time | 306.84 s | 104.57 s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz. Note this is just for reference, and it varies a lot on different CPU.

<sup>2</sup> AMD EPYC 7543 32-Core Processor.


## Dynamic shape for SDXL

Run:

```shell
python3 benchmarks/text_to_image.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--height 1024 --width 1024 \
--scheduler none \
--steps 20 \
--output-image ./stable-diffusion-xl-compile.png \
--prompt "beautiful scenery nature glass bottle landscape, , purple galaxy bottle," \
--compiler nexfort \
--compiler-config '{"mode": "cudagraphs:max-autotune:low-precision:cache-all", "memory_format": "channels_last", "options": {"inductor.optimize_linear_epilogue": false, "overrides.conv_benchmark": true, "overrides.matmul_allow_tf32": true}, "dynamic": true}' \
--run_multiple_resolutions 1
```

## Quality
When using nexfort as the backend for onediff compilation acceleration, the generated images are lossless.

<p align="center">
<img src="../../../imgs/nexfort_sdxl_demo.png">
</p>
Loading