Add support for small page sizes #824

skrider · 2024-02-13T08:16:28Z

Recently, support has been added for paged attention with large page sizes of 256 tokens. However, projects which use paged attention prefer smaller page sizes of around 16. This PR adds support for smaller page sizes by reshaping the GMEM -> SMEM copy to ensure that in each iteration of the mainloop each thread fetches only from a single page. Hence physical page addresses need only be resolved at the beginning of each mainloop iteration and can be resolved per-thread rather than per-CTA.

Preliminary benchmarking with ncu on the unit testing suite shows no degradation in performance.

zhaoyang-star · 2024-02-20T03:01:11Z

Thanks for your great work! Small page size is important for llm inference framework. Expect this pr could be merged soon.

skrider · 2024-02-26T07:32:52Z

Fixed issue with fused RoPE embeddings - should be ready for review.

*apply change from pull request : Dao-AILab/flash-attention#824

rkooo567 · 2024-02-28T09:13:17Z

Hi, I am waiting for this PR! Is this planning to be merged soon? Also, can I ask when it is planned to be released?

skrider · 2024-02-29T22:30:32Z

Not sure - @tridao if you have time, would greatly appreciate a review so I can make any changes necessary to get this PR merged!

ymwangg · 2024-03-02T04:14:36Z

Hi @skrider, thanks for the great work! Based on my test, this kernel is 1.5-4x faster than the triton equivalent. But when I use it for end-to-end testing in vLLM, I hit RuntimeError: CUDA error: an illegal memory access was encountered.
Below is the minimum code to reproduce:

import torch
from flash_attn import flash_attn_with_kvcache

def cdiv(a, b):
    return (a + b - 1) // b

block_size = 16
num_blocks = 1000*16//block_size
bs = 4
seq_len = 170
num_heads = 32
head_dim = 128

key_cache = torch.rand([num_blocks, block_size, num_heads, head_dim]).half().cuda()
value_cache = torch.rand([num_blocks, block_size, num_heads, head_dim]).half().cuda()
cache_seqlens = torch.zeros(bs, dtype=torch.int32).cuda()

for _ in range(1000):
    query = torch.rand([bs, seq_len, num_heads, head_dim], dtype=torch.float16, device="cuda")
    key = torch.rand([bs, seq_len, num_heads, head_dim], dtype=torch.float16, device="cuda")
    value = torch.rand([bs, seq_len, num_heads, head_dim], dtype=torch.float16, device="cuda")
    block_tables = torch.randint(0, num_blocks, size=(bs, cdiv(seq_len, block_size)), dtype=torch.int32, device="cuda")
    output = flash_attn_with_kvcache(
        query,
        key_cache,
        value_cache,
        k=key,
        v=value,
        cache_seqlens=cache_seqlens,
        block_table=block_tables,
        causal=True,
    )

Error message:

Traceback (most recent call last):
  File "/home/ubuntu/src/vllm-test/debug.py", line 21, in <module>
    value = torch.rand([bs, seq_len, num_heads, head_dim], dtype=torch.float16, device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Some observations:

This only occurs in the prefill stage and it happens sporadically. Using it for decoding (single query or multi-query) seems fine.
The error is gone after increasing the block_size to 256.
The error still exists after removing (k=key, v=value). So the illegal memory access may happen when reading from page blocks.

skrider · 2024-03-02T06:10:15Z

@ymwangg Thanks for the heads up - I will look into it.

Reproducing with the provided code, I believe the error is with an async copy not being properly awaited. Synchronizing after every launch and setting a manual seed does not get rid of the nondeterminism. Additionally, if I only run the kernel every two iterations, then the kernel never errors. Small num_heads also gets rid of the issue. To me this suggests that the state of the L2 cache is correlated with the error. Signing off for tonight but will revisit when I have time.

gnap · 2024-03-11T09:36:56Z

@ymwangg is it possible that passing k=key and v=value each iteration causing seq_len=170 tokens appended to the kvcache each time, which overflows after couple of iterations?

ymwangg · 2024-03-11T17:27:15Z

@ymwangg is it possible that passing k=key and v=value each iteration causing seq_len=170 tokens appended to the kvcache each time, which overflows after couple of iterations?

My understanding is that this function does not allocate new memory but rather using block_table to identify the memory address to read/write. So as long as the block_id in block_table is valid, it should not cause overflow issue.

gnap · 2024-03-14T11:49:44Z

after inspection locally found that the illegal mem access is caused by table_diff calculation overflows and propagated to further iterations.

since the n_block is iterated in reverse order, the calculated virtual_page_idx_next of the page_table may be larger than the table allocated in the first round, getting undetermined table_diffs and never get fixed by advancing tKgK.data() relatively.

so the fix is straight forward: use init_thread_kv_page_slice_offset(..., n_block, ...) to calculate the absolute offset and add to gK.data()/gV.data() directly. the copy_w_min_idx() would guarantee that only rows in range are copied.

tested locally the illegal access is gone and the flash_attn_kvcache tests are passed.

rkooo567 · 2024-03-19T02:01:41Z

Btw, do we plan to merge this soon?

davidthomas426 · 2024-03-20T21:55:24Z

The fix mentioned by @gnap was implemented by @ymwangg in this commit: ymwangg@7354198.

@gnap, would you mind checking it to verify that's what you had in mind? @ymwangg told me the illegal access is gone with that commit on top of this PR.

Could someone pull that fix into this PR and fix the conflicts so this can be merged?

gnap · 2024-03-21T08:14:50Z

@davidthomas426 @ymwangg I have checked that commit and the modification is mutually identical with my local change. currently I am conducting more tests with our internal inference engine. but if the vllm community tests okay, feel free to commit or notify @skrider to update this PR.

mjp9527 · 2024-03-21T11:56:25Z

Thanks for your great work! Does this PR support varlen with KV block: 2a15840

skrider · 2024-03-22T23:53:12Z

Thank you everyone for all the help! I will review locally and push the fix. I used the difference between page indices rather than calculating the offset directly because that's how it was done originally. Besides saving a register I am not sure if there are any advantages to doing this.

@gnap curious what your process was for finding the bug?

Thanks for your great work! Does this PR support varlen with KV block: 2a15840

In progress, expect it sometime next week

skrider · 2024-03-26T06:36:38Z

These changes pass unit tests for standard and varlen APIs as well as the example provided above by @ymwangg

gnap · 2024-03-28T13:21:34Z

@gnap curious what your process was for finding the bug?

by ran the compute-santinizer --tool memcheck against the reproduction code @ymwangg provided, which showed that some threads did access memory addresses way smaller than gK, gV's gmem_ptr, then with some printings did find that table_diffs could be larger than the partitioned copy tile's strides.

tridao · 2024-04-10T01:56:29Z

Thanks so much for your work @skrider. Can you rebase and then I'll merge?

davidthomas426 · 2024-04-29T22:36:39Z

@skrider Are you going to rebase this so it can get merged?

skrider · 2024-05-02T07:02:29Z

@tridao absolutely! Sorry, just seeing this. Notification fell through the cracks.

rkooo567 · 2024-05-03T12:43:47Z

@skrider It looks like rebasing was pretty easy. would you mind if I just create a PR to your branch? (I just ran git merge main, and no conflict)

pingzhuu · 2024-05-03T13:59:19Z

csrc/flash_attn/src/utils.h

+// assumes that the tensor has already been positioned at the correct head.
+template <typename Kernel_traits>
+__forceinline__ __device__
+int resolve_thread_kv_page_slice_offset(const int tidx, const int n_block_max, const int page_block_size, 


Thanks for your great work!
I think int64 should be used here, because using int32 may overflow when page is multiplied by stride. The following code will raise an illegal memory access error in my test (A100 40G)

import torch from flash_attn import flash_attn_varlen_func torch.manual_seed(0) num_pages = 4048 page_id = 4000 page_size = 256 num_heads = 32 head_size = 128 seq_len = 13 q = torch.randn(seq_len, 32, 128, device="cuda", dtype=torch.float16) k_cache = torch.zeros(num_pages, page_size, num_heads, head_size, device="cuda", dtype=torch.float16) v_cache = torch.zeros(num_pages, page_size, num_heads, head_size, device="cuda", dtype=torch.float16) cu_seqlens_q = torch.tensor([0, seq_len], device="cuda", dtype=torch.int32) seqlens_k = torch.tensor([seq_len], device="cuda", dtype=torch.int32) cu_seqlens_k = torch.tensor([0, seq_len], device="cuda", dtype=torch.int32) block_table=torch.tensor([[page_id]], device='cuda', dtype=torch.int32) k_cache[page_id, :seq_len] = torch.randn(seq_len, num_heads, head_size, device="cuda", dtype=torch.float16) v_cache[page_id, :seq_len] = torch.randn(seq_len, num_heads, head_size, device="cuda", dtype=torch.float16) flash_attn_varlen_func( q=q, k=k_cache, v=v_cache, cu_seqlens_q=cu_seqlens_q, cu_seqlens_k=cu_seqlens_k, max_seqlen_q=seq_len, max_seqlen_k=seq_len, causal=True, block_table=block_table, )

yangelaboy · 2024-06-11T12:17:25Z

@skrider Hi, any updates on this PR?

jorgeantonio21 · 2024-08-29T17:04:42Z

Any updates on the current PR ? @skrider

itsliupeng · 2024-08-30T02:29:09Z

This PR has already been merged into this repository and is now part of the version of flash_attention, which vllm depends on.

skrider marked this pull request as draft February 13, 2024 08:16

skrider mentioned this pull request Feb 13, 2024

Support blocked KV cache for flash decoding #678

Closed

zhaoyang-star mentioned this pull request Feb 20, 2024

[Performance] Support MQA/GQA in decode phase by using FlashAttention vllm-project/vllm#2744

Open

skrider marked this pull request as ready for review February 26, 2024 07:32

skrider mentioned this pull request Feb 26, 2024

Introduce flash-attn (>= 2.5.0). vllm-project/vllm#3010

Closed

guocuimi added a commit to vectorch-ai/ScaleLLM that referenced this pull request Feb 27, 2024

added support for small page size.

a027751

*apply change from pull request : Dao-AILab/flash-attention#824

guocuimi mentioned this pull request Feb 27, 2024

added support for small page size. vectorch-ai/ScaleLLM#53

Merged

guocuimi added a commit to vectorch-ai/ScaleLLM that referenced this pull request Feb 27, 2024

added support for small page size. (#53)

23362e2

*apply change from pull request : Dao-AILab/flash-attention#824

guocuimi mentioned this pull request Mar 22, 2024

[fix] added small page size support for flash attention. vectorch-ai/ScaleLLM#95

Merged

skrider added 7 commits March 26, 2024 06:01

add print statements for debugging

8efeb7f

add print statements for debugging

ac5e78a

reshape gmem copy

14b190b

only test trivial block size

409431b

implement kv page iteration functions

70dd049

rearrange initial offset computation

59e76be

tests passing for single page k

175369f

skrider added 12 commits March 26, 2024 06:02

paged copy refactor working for page size 256

3691677

allow small page sizes in flash api

c05b857

remove print statements

347a625

tidy flash_fwd_kernel

3bb71a9

compiles for all h but 128

fa13c6b

all working except rotary embedding

bde5aec

add page size 16 to tests

bc66858

reshape rotary sin/cos copy to align with paged KV copy

0f5a45e

revert hardcoded rotcossin thread layout

802dd6a

resolve page offsets absolutely not relatively

135a1da

add test for page table overflow

a63157e

allow smaller page sizes in varlen api

7968148

skrider force-pushed the small-pages branch from b47b419 to 7968148 Compare March 26, 2024 06:34

skrider mentioned this pull request Mar 26, 2024

[Kernel] Use flash-attn for decoding vllm-project/vllm#3648

Merged

pingzhuu reviewed May 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for small page sizes #824

Add support for small page sizes #824

skrider commented Feb 13, 2024 •

edited

Loading

zhaoyang-star commented Feb 20, 2024

skrider commented Feb 26, 2024

rkooo567 commented Feb 28, 2024 •

edited

Loading

skrider commented Feb 29, 2024

ymwangg commented Mar 2, 2024

skrider commented Mar 2, 2024 •

edited

Loading

gnap commented Mar 11, 2024

ymwangg commented Mar 11, 2024

gnap commented Mar 14, 2024 •

edited

Loading

rkooo567 commented Mar 19, 2024

davidthomas426 commented Mar 20, 2024

gnap commented Mar 21, 2024

mjp9527 commented Mar 21, 2024

skrider commented Mar 22, 2024 •

edited

Loading

skrider commented Mar 26, 2024 •

edited

Loading

gnap commented Mar 28, 2024

tridao commented Apr 10, 2024

davidthomas426 commented Apr 29, 2024

skrider commented May 2, 2024

rkooo567 commented May 3, 2024

pingzhuu May 3, 2024

yangelaboy commented Jun 11, 2024

jorgeantonio21 commented Aug 29, 2024

itsliupeng commented Aug 30, 2024

Add support for small page sizes #824

Are you sure you want to change the base?

Add support for small page sizes #824

Conversation

skrider commented Feb 13, 2024 • edited Loading

zhaoyang-star commented Feb 20, 2024

skrider commented Feb 26, 2024

rkooo567 commented Feb 28, 2024 • edited Loading

skrider commented Feb 29, 2024

ymwangg commented Mar 2, 2024

skrider commented Mar 2, 2024 • edited Loading

gnap commented Mar 11, 2024

ymwangg commented Mar 11, 2024

gnap commented Mar 14, 2024 • edited Loading

rkooo567 commented Mar 19, 2024

davidthomas426 commented Mar 20, 2024

gnap commented Mar 21, 2024

mjp9527 commented Mar 21, 2024

skrider commented Mar 22, 2024 • edited Loading

skrider commented Mar 26, 2024 • edited Loading

gnap commented Mar 28, 2024

tridao commented Apr 10, 2024

davidthomas426 commented Apr 29, 2024

skrider commented May 2, 2024

rkooo567 commented May 3, 2024

pingzhuu May 3, 2024

Choose a reason for hiding this comment

yangelaboy commented Jun 11, 2024

jorgeantonio21 commented Aug 29, 2024

itsliupeng commented Aug 30, 2024

skrider commented Feb 13, 2024 •

edited

Loading

rkooo567 commented Feb 28, 2024 •

edited

Loading

skrider commented Mar 2, 2024 •

edited

Loading

gnap commented Mar 14, 2024 •

edited

Loading

skrider commented Mar 22, 2024 •

edited

Loading

skrider commented Mar 26, 2024 •

edited

Loading