siliconflow · strint · May 28, 2024 · May 28, 2024 · May 28, 2024
diff --git a/onediff_diffusers_extensions/examples/pixart_alpha/README.md b/onediff_diffusers_extensions/examples/pixart_alpha/README.md
@@ -32,7 +32,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
 ## Performance comparation
 ### nexfort compile config
 - compiler-config default is `{"mode": "max-optimize:max-autotune:freezing:benchmark:cudagraphs", "memory_format": "channels_last"}` in `/benchmarks/text_to_image.py`
-  - setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time to 57.863s and just slightly reduce the performance
+  - setting `--compiler-config '{"mode": "max-autotune", "memory_format": "channels_last"}'` will reduce compilation time and just slightly reduce the performance
+  - setting `--compiler-config '{"mode": "jit:disable-runtime-fusion", "memory_format": "channels_last"}'` will reduce compilation time to 21.832s, but will reduce the performance
 - fuse_qkv_projections: True
 
 ### Metric
@@ -46,8 +47,8 @@ python3 ./benchmarks/text_to_image.py --model /data/hf_models/PixArt-XL-2-1024-M
 | PyTorch Max Mem Used                             | 14.445GiB                           |
 | OneDiff Max Mem Used                             | 13.855GiB                           |
 | PyTorch Warmup with Run time                     | 4.100s                              |
-| OneDiff Warmup with Compilation time<sup>1</sup> | 115.309s                            |
-| OneDiff Warmup with Cache time                   | TODO                                |
+| OneDiff Warmup with Compilation time<sup>1</sup> | 776.170s                            |
+| OneDiff Warmup with Cache time                   | 111.563s                            |
 
  <sup>1</sup> OneDiff Warmup with Compilation time is tested on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz. Note this is just for reference, and it varies a lot on different CPU.
 
diff --git a/src/onediff/infer_compiler/backends/nexfort/README.md b/src/onediff/infer_compiler/backends/nexfort/README.md
@@ -31,3 +31,21 @@ Performance on NVIDIA A100-PCIE-40GB:
 - Inference time: 2.045s
 - Iterations per second: 10.743
 - Max used CUDA memory: 13.855GiB
+
+### Local cache speeds up recompilation
+
+Setting cache:
+```
+# Enabled Inductor - FX Graph Cache. Default Off
+export TORCHINDUCTOR_FX_GRAPH_CACHE=1
+
+# Setting Inductor - Autotuning Cache DIR. This cache is enabled by default.
+export TORCHINDUCTOR_CACHE_DIR=~/.torchinductor
+```
+
+Clear Cache:
+```
+python3 -m nexfort.utils.clear_inductor_cache
+```
+
+Advanced cache functionality is currently in development.