Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚀 Precompiled xFormers for CUDA 12.4 and PyTorch 2.4 Compatibility #1553

Open
sashaok123 opened this issue Aug 28, 2024 · 13 comments
Open

Comments

@sashaok123
Copy link

I have requested from the developers of xformers, A precompiled version of xFormers that is compatible with CUDA 12.4 and PyTorch 2.4.
facebookresearch/xformers#1079

They have compiled precompiled wheels for CUDA 12.4 and PyTorch 2.4
https://github.com/facebookresearch/xformers/actions/runs/10559887009

Now you can fully add xformers to the fresh Forge

@sashaok123
Copy link
Author

(Viruses on board!)

How can I ban a user or send a complaint to the administrators?

@wuliaodexiaoluo
Copy link

wuliaodexiaoluo commented Aug 28, 2024

How can I ban a user or send a complaint to the administrators?

You can report inauthentic account activity to GitHub Support by visiting the account home page, where there is a Block or report option under the account avatar

EDIT:By the way,this account was been deleted

@sais-github
Copy link

https://app.any.run/tasks/abb4419a-a8cb-4707-946d-e73a9d3561bb
Usual Lumma stealer...
I don't know if you get notifications for every message in an issue so: @lllyasviel
bad files x.x

@github-staff github-staff deleted a comment Aug 28, 2024
@dongxiat
Copy link

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

image

image

Repository owner deleted a comment from Giribot Aug 28, 2024
@sais-github
Copy link

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

@HMRMike
Copy link

HMRMike commented Aug 28, 2024

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge.
Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image.
2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time)
d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec.
d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

@adrianschubek
Copy link

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

Yes!
Did some benchmarks on my RTX 3070 with Flux Q8, 28steps, euler, simple, 1024x1024:
Forge with CUDA 12.1 + Pytorch 2.3.1: 3.61s/it
Forge with CUDA 12.4 + Pytorch 2.4: 3.05s/it (15% faster)
Forge with CUDA 12.4 + Pytorch 2.4 + xformers: 2.85s/it (21% faster)

@yamfun
Copy link

yamfun commented Aug 30, 2024

wow

@l33tx0
Copy link

l33tx0 commented Sep 2, 2024

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args ,
i'm getting around 2.2 s/it with same config

using
COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

@HMRMike
Copy link

HMRMike commented Sep 2, 2024

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config

using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8.
I use pretty much the same args usually.
Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings?
To retest in my current updated commit (stuff changes in 5 days):
1024x1024, Euler Simple, 30 steps, Queue/Shared swap.
Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae
Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom:
version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c
image

@l33tx0
Copy link

l33tx0 commented Sep 2, 2024

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.

Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c image

when i start i got an error like this
pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

@HMRMike
Copy link

HMRMike commented Sep 3, 2024

Is there a noticeable enough performance boost to want to reinstall with newer pytorch?

With RTX 3090 I'm seeing bit of a boost, when it takes 1.5 seconds per step every little bit helps. But it's also because of great developments in Forge. Tested with Flux GGUF Q8, 20 steps Euler simple at 1024X1024 after 3-4 warm up runs for fastest possible time per image. 2f0555f: Queue/shared was 31.7sec. Async was 33.7 (didn't work well at the time) d339600 --disable-xformers: Queue/shared:28.7sec. Async/Shared: 28.5sec. d339600+xformers: Queue/shared 26.3sec. Async/shared actually isn't any faster now at 26.6-26.8sec.

can you share your command line args , i'm getting around 2.2 s/it with same config
using COMMANDLINE_ARGS= --xformers --skip-torch-cuda-test --cuda-stream

2.2 definitely seems a bit on the slow side for Q8. I use pretty much the same args usually. Out of curiosity I removed them all, only --xformers remains. The speed was not impacted at all! Maybe it's just because of the simple generation settings? To retest in my current updated commit (stuff changes in 5 days): 1024x1024, Euler Simple, 30 steps, Queue/Shared swap. Model: flux1-dev-Q8_0, Module 1: t5-v1_1-xxl-encoder-Q8_0, Module 2: clip_l, Module 3: ae Console reports 1.3s/it, and after "settling" for 2-3 runs, the fastest time per image was reported at 39.6sec.
Versions from UI bottom: version: f2.0.1v1.10.1-previous-495-g4f64f6da  •  python: 3.10.6  •  torch: 2.4.0+cu124  •  xformers: 0.0.28.dev893+cu124  •  gradio: 4.40.0  •  checkpoint: d9b5d2777c image

when i start i got an error like this pytorch version: 2.4.0+cu124 WARNING:xformers:A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "..forge\venv\lib\site-packages\xformers\__init__.py", line 57, in _is_triton_available import triton # noqa ModuleNotFoundError: No module named 'triton' xformers version: 0.0.28.dev893+cu124 Set vram state to: NORMAL_VRAM VAE dtype preferences: [torch.bfloat16, torch.float32] -> torch.bfloat16 CUDA Using Stream: True Using xformers cross attention Using xformers attention for VAE

with sdxl i'm getting 3.57it/s night rain ancient era Steps: 20, Sampler: DPM++ 2M SDE, Schedule type: Karras, CFG scale: 7.5, Seed: 3617511334, Size: 1024x1024, Model hash: 7b91764cf2, Model: copaxTimelessxlSDXL1_v122, Version: f2.0.1v1.10.1-previous-501-g668e87f9, Module 1: sdxl_vae_fp16_fixv2, Source Identifier: Stable Diffusion web UI

if you can confirm the issue is only with flux or my installation

Yeah the Triton thing is only for Linux, apparently. It's not a real issue on Windows, you can ignore this message.
AUTOMATIC1111/stable-diffusion-webui#7115

I'm getting almost exactly the same speed with SDXL (it fluctuates, up to 3.6, but effectively identical to yours) and these settings, so it leaves something weird with Flux.
Just to make sure, I'm using the Q8 model from here:
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main
Otherwise all the versions seem identical, we get the same startup stuff. Even without xformers it should be quite a bit faster.
Just as a sanity check in such cases I like to just git clone a fresh copy and see if there are any differences, maybe erase the VENV folder and let stuff rebuild if the fresh copy was indeed faster. Makes hunting for a specific issue less frustrating.

@Hujikuio
Copy link

Hujikuio commented Nov 13, 2024

for who dont know how to install xformers w\ CUDA 12.4 and PyTorch 2.4 and python 3.10

read this

link to download .whl file : https://github.com/facebookresearch/xformers/actions/runs/10559887009

Can you explain how to install the wheel in Forge without a venv (the cuda12.4 / pytorch2.4 .zip on main page)? I know it uses embedded python and sets the paths via environment.bat, but I still can't get pip to work.

EDIT: I think I figured it out, it's the same with ComfyUI's embedded python.

The embedded python.exe is in system\python\python.exe then you just add -m pip install after the .exe

You can laugh at me now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants