Skip to content

Commit

Permalink
Add nexfort cache docstr (#917)
Browse files Browse the repository at this point in the history
This PR is done:

close: #900

- [x] Add nexfort cache doc for onediff.
  • Loading branch information
lixiang007666 authored May 28, 2024
1 parent 34f98c4 commit fb025ba
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 3 deletions.
7 changes: 4 additions & 3 deletions onediff_diffusers_extensions/examples/pixart_alpha/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
## Performance comparation
### nexfort compile config
- compiler-config default is `{"mode": "max-optimize:max-autotune:freezing:benchmark:cudagraphs", "memory_format": "channels_last"}` in `/benchmarks/text_to_image.py`
- setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time to 57.863s and just slightly reduce the performance
- setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time and just slightly reduce the performance
- setting `--compiler-config '{"mode": "jit:disable-runtime-fusion", "memory_format": "channels_last"}'` will reduce compilation time to 21.832s, but will reduce the performance
- fuse_qkv_projections: True

### Metric
Expand All @@ -46,8 +47,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
| PyTorch Max Mem Used | 14.445GiB |
| OneDiff Max Mem Used | 13.855GiB |
| PyTorch Warmup with Run time | 4.100s |
| OneDiff Warmup with Compilation time<sup>1</sup> | 115.309s |
| OneDiff Warmup with Cache time | TODO |
| OneDiff Warmup with Compilation time<sup>1</sup> | 776.170s |
| OneDiff Warmup with Cache time | 111.563s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz. Note this is just for reference, and it varies a lot on different CPU.

18 changes: 18 additions & 0 deletions src/onediff/infer_compiler/backends/nexfort/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,21 @@ Performance on NVIDIA A100-PCIE-40GB:
- Inference time: 2.045s
- Iterations per second: 10.743
- Max used CUDA memory: 13.855GiB

### Local cache speeds up recompilation

Setting cache:
```
# Enabled Inductor - FX Graph Cache. Default Off
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
# Setting Inductor - Autotuning Cache DIR. This cache is enabled by default.
export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor
```

Clear Cache:
```
python3 -m nexfort.utils.clear_inductor_cache
```

Advanced cache functionality is currently in development.

0 comments on commit fb025ba

Please sign in to comment.