Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nexfort cache docstr #917

Merged
merged 2 commits into from
May 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions onediff_diffusers_extensions/examples/pixart_alpha/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
## Performance comparation
### nexfort compile config
- compiler-config default is `{"mode": "max-optimize:max-autotune:freezing:benchmark:cudagraphs", "memory_format": "channels_last"}` in `/benchmarks/text_to_image.py`
- setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time to 57.863s and just slightly reduce the performance
- setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time and just slightly reduce the performance
- setting `--compiler-config '{"mode": "jit:disable-runtime-fusion", "memory_format": "channels_last"}'` will reduce compilation time to 21.832s, but will reduce the performance
strint marked this conversation as resolved.
Show resolved Hide resolved
- fuse_qkv_projections: True

### Metric
Expand All @@ -46,8 +47,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
| PyTorch Max Mem Used | 14.445GiB |
| OneDiff Max Mem Used | 13.855GiB |
| PyTorch Warmup with Run time | 4.100s |
| OneDiff Warmup with Compilation time<sup>1</sup> | 115.309s |
| OneDiff Warmup with Cache time | TODO |
| OneDiff Warmup with Compilation time<sup>1</sup> | 776.170s |
| OneDiff Warmup with Cache time | 111.563s |

<sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz. Note this is just for reference, and it varies a lot on different CPU.

18 changes: 18 additions & 0 deletions src/onediff/infer_compiler/backends/nexfort/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,21 @@ Performance on NVIDIA A100-PCIE-40GB:
- Inference time: 2.045s
- Iterations per second: 10.743
- Max used CUDA memory: 13.855GiB

### Local cache speeds up recompilation

Setting cache:
```
# Enabled Inductor - FX Graph Cache. Default Off
export TORCHINDUCTOR_FX_GRAPH_CACHE=1

# Setting Inductor - Autotuning Cache DIR. This cache is enabled by default.
export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor
```

Clear Cache:
```
python3 -m nexfort.utils.clear_inductor_cache
```

Advanced cache functionality is currently in development.