[WIP] Support for cached multi-query attention towards speculative decoding #1679

skrider · 2023-11-16T02:40:55Z

Initial prototype of cached multi-query attention that takes advantage of implementation details of the single query cached attention kernel to adapt it to the multi-query setting.

Given n sequences with maximum draft length of k to be verified, greedily caches all keys and values, then calls paged_attention on n * k query vectors, "symbolically linking" the KV caches of drafts of the same sequence to the original, masking out "future" tokens by interpolating the sequence_len passed to paged attention kernel from context_len to context_len + draft_len.

While this kernel has support for dynamic draft lengths, this is facilitated somewhat inefficiently by masking rather than by dynamic shape. Potential room for improvement.

Performance has yet to be profiled. The intention behind this PR is to serve as a reference implementation against which a more performant MQA kernel can be developed.

beginlner · 2023-11-23T01:07:22Z

I've made a pull request to flash-attention that enables support for blocked KV cache in flash-decoding which supports MQA. The performance is nearly identical to the original. You might want to check it out.
Dao-AILab/flash-attention#678

skrider · 2023-11-27T23:27:01Z

@beginlner thanks for the info. Reading https://github.com/microsoft/DeepSpeed-Kernels/blob/main/dskernels/inf_flash_attn/blocked_flash/flash_fwd_kernel.h as well.

Lvjinhong · 2023-12-20T02:39:35Z

@beginlner thanks for the info. Reading https://github.com/microsoft/DeepSpeed-Kernels/blob/main/dskernels/inf_flash_attn/blocked_flash/flash_fwd_kernel.h as well.

So far, is there any progress on enabling speculative decoding for vLLM? Additionally, I'm wondering if the implementation of this kernel might result in increased GPU memory usage.

Lvjinhong · 2023-12-25T03:51:30Z

When can this branch be merged? In the version I am currently using, there is:

op=xops.fmha.MemoryEfficientAttentionFlashAttentionOp[0] if
                (is_hip()) else None,

Is the Flash operation supported only for HIP?

skrider marked this pull request as draft November 16, 2023 02:41

jFkd1 mentioned this pull request Nov 16, 2023

Feature: Speculative sampling / Assisted Generation NVIDIA/TensorRT-LLM#169

Closed

skrider force-pushed the cached-mqa branch from e4af593 to ebb3531 Compare November 28, 2023 05:49

skrider added 7 commits November 30, 2023 23:48

add multi query attn stub

5b512da

add test for cached multiquery attention

3e13363

fix slot_mapping construction

3fe9911

multi-query cached attention implementation passing tests

1f5e28c

remove debugger breakpoints

9253ef0

comments and minor improvement

7a8f8d9

refactor multi query attention to match attention changes

55eee30

skrider force-pushed the cached-mqa branch from ebb3531 to 55eee30 Compare December 1, 2023 22:28

skrider added 3 commits December 1, 2023 22:34

vendor nvidia/cutlass at v3.1.0

8c515d6

copy DeepSpeed-Kernels blocked flash attention

ce2ae38

adds flash attention operator stub to _C module

09de61c

skrider added 3 commits December 23, 2023 07:36

Merge branch 'main' into cached-mqa

ad27c57

flash attention stub

1322191

kernel launch working; errors with segfualt

7e0ee61

Merge branch 'main' into cached-mqa

c53a958

casper-hansen mentioned this pull request Jan 9, 2024

[Performance] Use optimized kernels for MQA/GQA #1880

Open

sighingnow mentioned this pull request Feb 25, 2024

Introduce speculative decoding with draft models to vLLM #3029

Closed

skrider closed this Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Support for cached multi-query attention towards speculative decoding #1679

[WIP] Support for cached multi-query attention towards speculative decoding #1679

skrider commented Nov 16, 2023

beginlner commented Nov 23, 2023 •

edited

Loading

skrider commented Nov 27, 2023

Lvjinhong commented Dec 20, 2023

Lvjinhong commented Dec 25, 2023

[WIP] Support for cached multi-query attention towards speculative decoding #1679

[WIP] Support for cached multi-query attention towards speculative decoding #1679

Conversation

skrider commented Nov 16, 2023

beginlner commented Nov 23, 2023 • edited Loading

skrider commented Nov 27, 2023

Lvjinhong commented Dec 20, 2023

Lvjinhong commented Dec 25, 2023

beginlner commented Nov 23, 2023 •

edited

Loading