add paged-attention #20

kaiyuanm · 2024-02-27T09:17:07Z

Add paged attention operation commonly used in large language model inference. About paged attention view https://arxiv.org/pdf/2309.06180.pdf.
This implementation requires triton 2.2.0.

There is some performance data between triton and vllm.

vllm_paged_attention-B32-G8-D128-bs16-v2:
   context_len     triton  vllm-0.3.0
0        512.0   6.618643    3.582180
1       1024.0   7.798634    3.925020
2       2048.0   8.560921    4.286430
3       4096.0  12.241113    4.460762
4       8192.0  13.589705    4.558541
5      16384.0  14.094599    4.609148

Note:

Input layout is different, performance is for reference only.
Some input shapes is still being optimized.

benchmark/paged_benchmark.py

src/flag_attn/paged.py

tongxin · 2024-02-27T15:09:49Z

src/flag_attn/paged.py

+    elif num_splits > 1:
+        partition_size = triton.cdiv(max_context_len, num_splits)
+        partition_size = triton.next_power_of_2(partition_size)
+        assert partition_size >= kv_block_size


else assert num_splits == 1

Refer to flash_attn flash_attn_with_kvcache definition.

num_splits: int. If > 1, split the key/value into this many chunks along the sequence.
If num_splits == 1, we don't split the key/value. If num_splits == 0, we use a heuristic
to automatically determine the number of splits.

Yes, here you need to guard num_splits against negatives

tongxin · 2024-02-28T04:08:52Z

src/flag_attn/paged.py

+        kv_mask = mask_offset[:, None] < context_len
+
+        # k: [KV_BLOCK_SIZE, HEAD_SIZE]
+        k = tl.load(k_cache_ptr + kv_block_offset, mask=kv_mask, other=0.0)


Is it possible to omit kv_mask here and the mask the output?

Looks like ok. What do you think? @iclementine

Wehen loading, the mask is obligatory if it would access illegal memory otherwise. However you can use the modulo trick to avoid masking. Whether it is more efficient than masking depends on the cost .

src/flag_attn/paged.py

examples/paged_example.py

src/flag_attn/paged.py

iclementine

lgtm

root and others added 2 commits February 26, 2024 02:54

add paged-attention

b19ef15

optimize paged-attn implementation & fix bugs

19b8d40

tongxin reviewed Feb 28, 2024

View reviewed changes

iclementine reviewed Feb 28, 2024

View reviewed changes

examples/paged_example.py Show resolved Hide resolved

iclementine reviewed Feb 28, 2024

View reviewed changes

src/flag_attn/paged.py Show resolved Hide resolved

iclementine reviewed Feb 28, 2024

View reviewed changes

src/flag_attn/paged.py Outdated Show resolved Hide resolved

iclementine reviewed Feb 28, 2024

View reviewed changes

src/flag_attn/paged.py Show resolved Hide resolved

root added 3 commits February 29, 2024 07:53

fix according to review & fix dependencies

e2935a5

fix bug

dd420f8

note

2b60211

iclementine approved these changes Mar 7, 2024

View reviewed changes

iclementine merged commit b0045fb into FlagOpen:main Mar 7, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add paged-attention #20

add paged-attention #20

kaiyuanm commented Feb 27, 2024

tongxin Feb 27, 2024

kaiyuanm Feb 28, 2024

tongxin Mar 1, 2024

tongxin Feb 28, 2024

kaiyuanm Feb 29, 2024

iclementine Mar 5, 2024

iclementine left a comment

add paged-attention #20

add paged-attention #20

Conversation

kaiyuanm commented Feb 27, 2024

tongxin Feb 27, 2024

Choose a reason for hiding this comment

kaiyuanm Feb 28, 2024

Choose a reason for hiding this comment

tongxin Mar 1, 2024

Choose a reason for hiding this comment

tongxin Feb 28, 2024

Choose a reason for hiding this comment

kaiyuanm Feb 29, 2024

Choose a reason for hiding this comment

iclementine Mar 5, 2024

Choose a reason for hiding this comment

iclementine left a comment

Choose a reason for hiding this comment