Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUTLASS Fused multi head attention #1112

Open
yoon5862 opened this issue Sep 25, 2024 · 2 comments
Open

CUTLASS Fused multi head attention #1112

yoon5862 opened this issue Sep 25, 2024 · 2 comments

Comments

@yoon5862
Copy link

❓ Questions and Help

Hello, I am watching fused multi-head attention in 3rdparty/cutlass.
In cutlass/examples, fused multi head attention is upstream to xformers.
And CUTLASS said fused multi head attention examples is same as flash attention-2.
Is it true that cutlass fused multi head attention and flash attention-2 kernel is same things?
Thank you.

@danthe3rd
Copy link
Contributor

And CUTLASS said fused multi head attention examples is same as flash attention-2.

I believe those are not the same thing. Where did you see that?
Flash-Attention 2 is built using the CUTLASS library, but what we call "cutlass" implementation in xFormers, and what is in cutlass/examples is something else.

@yoon5862
Copy link
Author

yoon5862 commented Oct 8, 2024

thank you for relpy.
In CUTLASS examples, is said it's code is upstream to xformers.

Acknowledgement: Fixed-sequence-length FMHA code was upstreamed by Meta xFormers (https://github.com/facebookresearch/xformers).

therefore I think xformers use cutlass custom kernel and tuned it's kernels for oracle setting for kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants