Implement Flash Attention Option #19

dustydecapod · 2023-03-11T18:57:36Z

Would love to see a faster, more memory efficient attention implemented like Flash Attention. :)

ggerganov · 2023-03-12T06:27:03Z

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder).
I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

ggml-org/whisper.cpp#284

Orevantum · 2023-03-13T09:16:30Z

Is it possible to implement multi-query attention then?

In whisper.cpp I tried using FA in the Decoder and it did not help (it does help a lot in the Encoder). I guess it is a matter of the tensor sizes, but of course, maybe I didn't implement it properly.

ggerganov/whisper.cpp#284

xloem · 2023-04-05T07:47:45Z

also note in flexgen they use top 10% sparse attention

Clarify build instructions in README.

Simplify Softmax

jamesbiederbeck · 2023-08-21T03:12:13Z

also note in flexgen they use top 10% sparse attention

Sparse attention is cool, but lossy. Flash attention is exact.

ggerganov added the enhancement New feature or request label Mar 12, 2023

SlyEcho pushed a commit to SlyEcho/llama.cpp that referenced this issue Jun 11, 2023

Merge pull request ggml-org#19 from lesaun/master

d6d263f

Clarify build instructions in README.

ggerganov closed this as completed Jul 28, 2023

rooprob pushed a commit to rooprob/llama.cpp that referenced this issue Aug 2, 2023

Merge pull request ggml-org#19 from mcognetta/master

7d62088

Simplify Softmax

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Add verbose flag. Closes ggml-org#19

c137789

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Flash Attention Option #19

Implement Flash Attention Option #19

dustydecapod commented Mar 11, 2023

ggerganov commented Mar 12, 2023

Orevantum commented Mar 13, 2023

xloem commented Apr 5, 2023

jamesbiederbeck commented Aug 21, 2023

Implement Flash Attention Option #19

Implement Flash Attention Option #19

Comments

dustydecapod commented Mar 11, 2023

ggerganov commented Mar 12, 2023

Orevantum commented Mar 13, 2023

xloem commented Apr 5, 2023

jamesbiederbeck commented Aug 21, 2023