🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

sashaok123 · 2024-08-28T04:32:46Z

I have requested from the developers of xformers, A precompiled version of xFormers that is compatible with CUDA 12.4 and PyTorch 2.4.
facebookresearch/xformers#1079

They have compiled precompiled wheels for CUDA 12.4 and PyTorch 2.4
https://github.com/facebookresearch/xformers/actions/runs/10559887009

Now you can fully add xformers to the fresh Forge

sashaok123 · 2024-08-28T05:59:29Z

(Viruses on board!)

How can I ban a user or send a complaint to the administrators?

wuliaodexiaoluo · 2024-08-28T06:37:49Z

How can I ban a user or send a complaint to the administrators?

You can report inauthentic account activity to GitHub Support by visiting the account home page, where there is a Block or report option under the account avatar

EDIT:By the way,this account was been deleted

sais-github · 2024-08-28T07:51:38Z

https://app.any.run/tasks/abb4419a-a8cb-4707-946d-e73a9d3561bb
Usual Lumma stealer...
I don't know if you get notifications for every message in an issue so: @lllyasviel
bad files x.x

dongxiat · 2024-08-28T12:43:31Z

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

sais-github · 2024-08-28T12:54:13Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

HMRMike · 2024-08-28T20:14:52Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge.
Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image.
2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time)
d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec.
d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

adrianschubek · 2024-08-28T20:45:37Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

Yes!
Did some benchmarks on my RTX 3070 with Flux Q8, 28steps, euler, simple, 1024x1024:
Forge with CUDA 12.1 + Pytorch 2.3.1: 3.61s/it
Forge with CUDA 12.4 + Pytorch 2.4: 3.05s/it (15% faster)
Forge with CUDA 12.4 + Pytorch 2.4 + xformers: 2.85s/it (21% faster)

yamfun · 2024-08-30T15:47:49Z

wow

l33tx0 · 2024-09-02T17:41:45Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args ,
i'm getting around 2.2 s/it with same config

using
COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

HMRMike · 2024-09-02T19:02:03Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8.
I use pretty much the same args usually.
Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings?
To retest in my current updated commit (stuff changes in 5 days):
1024x1024, Euler Simple, 30 steps, Queue/Shared swap.
Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae
Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom:
version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

l33tx0 · 2024-09-02T22:11:52Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this
pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

HMRMike · 2024-09-03T00:00:55Z

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.
Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da • python: 3.10.6 • torch: 2.4.0+cu124 • xformers: 0.0.28.dev893+cu124 • gradio: 4.40.0 • checkpoint: d9b5d2777c

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

Yeah the Triton thing is only for Linux, apparently. It's not a real issue on Windows, you can ignore this message.
AUTOMATIC1111/stable-diffusion-webui#7115

I'm getting almost exactly the same speed with SDXL (it fluctuates, up to 3.6, but effectively identical to yours) and these settings, so it leaves something weird with Flux.
Just to make sure, I'm using the Q8 model from here:
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main
Otherwise all the versions seem identical, we get the same startup stuff. Even without xformers it should be quite a bit faster.
Just as a sanity check in such cases I like to just git clone a fresh copy and see if there are any differences, maybe erase the VENV folder and let stuff rebuild if the fresh copy was indeed faster. Makes hunting for a specific issue less frustrating.

Hujikuio · 2024-11-13T21:10:41Z

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

Can you explain how to install the wheel in Forge without a venv (the cuda12.4 / pytorch2.4 .zip on main page)? I know it uses embedded python and sets the paths via environment.bat, but I still can't get pip to work.

EDIT: I think I figured it out, it's the same with ComfyUI's embedded python.

The embedded python.exe is in system\python\python.exe then you just add -m pip install after the .exe

You can laugh at me now.

github-staff deleted a comment Aug 28, 2024

Repository owner deleted a comment from Giribot Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

sashaok123 commented Aug 28, 2024

sashaok123 commented Aug 28, 2024

wuliaodexiaoluo commented Aug 28, 2024 •

edited

Loading

sais-github commented Aug 28, 2024

dongxiat commented Aug 28, 2024

sais-github commented Aug 28, 2024

HMRMike commented Aug 28, 2024 •

edited

Loading

adrianschubek commented Aug 28, 2024

yamfun commented Aug 30, 2024

l33tx0 commented Sep 2, 2024

HMRMike commented Sep 2, 2024 •

edited

Loading

l33tx0 commented Sep 2, 2024

HMRMike commented Sep 3, 2024 •

edited

Loading

Hujikuio commented Nov 13, 2024 •

edited

Loading

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

Comments

sashaok123 commented Aug 28, 2024

sashaok123 commented Aug 28, 2024

wuliaodexiaoluo commented Aug 28, 2024 • edited Loading

sais-github commented Aug 28, 2024

dongxiat commented Aug 28, 2024

sais-github commented Aug 28, 2024

HMRMike commented Aug 28, 2024 • edited Loading

adrianschubek commented Aug 28, 2024

yamfun commented Aug 30, 2024

l33tx0 commented Sep 2, 2024

HMRMike commented Sep 2, 2024 • edited Loading

l33tx0 commented Sep 2, 2024

HMRMike commented Sep 3, 2024 • edited Loading

Hujikuio commented Nov 13, 2024 • edited Loading

wuliaodexiaoluo commented Aug 28, 2024 •

edited

Loading

HMRMike commented Aug 28, 2024 •

edited

Loading

HMRMike commented Sep 2, 2024 •

edited

Loading

HMRMike commented Sep 3, 2024 •

edited

Loading

Hujikuio commented Nov 13, 2024 •

edited

Loading