CUTLASS Fused multi head attention #1112

yoon5862 · 2024-09-25T08:20:55Z

❓ Questions and Help

Hello, I am watching fused multi-head attention in 3rdparty/cutlass.
In cutlass/examples, fused multi head attention is upstream to xformers.
And CUTLASS said fused multi head attention examples is same as flash attention-2.
Is it true that cutlass fused multi head attention and flash attention-2 kernel is same things?
Thank you.

danthe3rd · 2024-10-01T09:30:38Z

And CUTLASS said fused multi head attention examples is same as flash attention-2.

I believe those are not the same thing. Where did you see that?
Flash-Attention 2 is built using the CUTLASS library, but what we call "cutlass" implementation in xFormers, and what is in cutlass/examples is something else.

yoon5862 · 2024-10-08T10:05:20Z

thank you for relpy.
In CUTLASS examples, is said it's code is upstream to xformers.

Acknowledgement: Fixed-sequence-length FMHA code was upstreamed by Meta xFormers (https://github.com/facebookresearch/xformers).

therefore I think xformers use cutlass custom kernel and tuned it's kernels for oracle setting for kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS Fused multi head attention #1112

CUTLASS Fused multi head attention #1112

yoon5862 commented Sep 25, 2024

danthe3rd commented Oct 1, 2024

yoon5862 commented Oct 8, 2024

CUTLASS Fused multi head attention #1112

CUTLASS Fused multi head attention #1112

Comments

yoon5862 commented Sep 25, 2024

❓ Questions and Help

danthe3rd commented Oct 1, 2024

yoon5862 commented Oct 8, 2024