Paged attention #1

liangan1 · 2024-06-05T04:28:51Z

No description provided.

test/kernel/test_paged_attention.py

jgong5 · 2024-06-18T02:14:24Z

torchao/kv_cache.py

+        cache: torch.Tensor,
+        block_tables: torch.Tensor,
+        context_lens: torch.Tensor,
+    ):


Consider adding document on these args? Also, would be helpful to note what are owned by the object and what are shared? I guess cache is shared among multiple PagedTensors? What are the shapes for these tensors?

jgong5 · 2024-06-18T02:18:19Z

torchao/kv_cache.py

+    ):
+        self.block_tables = block_tables
+        self.cache = cache
+        self.context_lens = context_lens


Not sure if it is good or general enough to incorporate "context length" into the semantics of a PagedTensor. The context length sounds like an app-level concept, not a general tensor-level concept?

jgong5 · 2024-06-18T02:59:15Z

torchao/kv_cache.py

+    key_cache = key_tensor.cache
+    value_cache = value_tensor.cache
+    num_kv_head = key_cache.size(1)
+    num_queries_per_kv = query.size(1) // num_kv_head


Should we add an assertion here to make sure query.size(1) % num_kv_head == 0?

jgong5 · 2024-06-18T03:05:54Z

torchao/kv_cache.py

+        query,
+        key_cache,
+        value_cache,
+        head_mapping,


Can we remove this head_mapping and move it into the implementation? Assume we always do the even mapping here.

jgong5 · 2024-06-18T05:44:58Z

torchao/csrc/cpu/paged_attention_kernel.cpp

+                         : 0;
+  int64_t mStrideM = has_attn_mask ? attn_mask.value().stride(2) : 0;
+
+  auto max_num_partitions =


nit: The name partition sounds too general. Suggest to specify it is for sequence, e.g., max_num_seq_partitions. Same comments for other related names.

jgong5 · 2024-06-18T09:13:35Z

torchao/csrc/cpu/paged_attention_kernel.cpp

+ * @param out           Output tensor [num_seqs, 1, num_heads, head_size].
+ * @param query         Query tensor [num_seqs, 1, num_heads, head_size].


Please add runtime assertion in the code to make sure the query has seq length 1 here. BTW, can we extend the implementation to support query seq length > 1 which can benefit chunked prefill and multi-turn conversation cases?

jgong5 · 2024-06-18T09:14:56Z

torchao/kv_cache.py

+    head_mapping = torch.repeat_interleave(
+        torch.arange(num_kv_head, dtype=torch.int32, device="cpu"), num_queries_per_kv
+    )


Can we do this inside the paged_attention c++ kernel so that we don't need to pass this head_mapping arg to it? This simplifies the kernel interface.

jgong5 · 2024-06-18T09:18:37Z

torchao/csrc/cpu/paged_attention_kernel.cpp

+    reshape_attn_mask_to_4d(attn_mask.value(), num_seqs, num_heads, q_len,
+                            attn_mask.value().size(-1));


Does this convert the attn_mask to 4D or just view it as 4D? Since we are working on raw pointers, perhaps we don't need to expand it as 4D view here?

jgong5 · 2024-06-18T09:21:34Z

torchao/csrc/cpu/paged_attention_kernel.cpp

+        if (has_attn_mask) {
+          _scale_attn_mask_fusion_kernel<accum_t, accum_t>(
+              logits,
+              attn_mask_ptr + seq_id * mStrideB + head_id * mStrideH +


I guess we need to carefully handle the case where the size is 1 in some dim of the mask here.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>

* feat: starting layout implementation fix: namespace of common modules chore: remove not needed test file fix: op name being registered chore: can compile the cuda kernel fix: segmentation fault chore: wip - paste test code just to check if everything passes feat: wip - adding layout. unpack not working fix: circular import feat: wip - can almost revert feat: can unpack. just needs cleanup chore: improve layout code chore: wip - mm needs work feat: wip - something seems wrong fix: e2e test feat: wip - add group param fix: unpack weights feat: marlin is implemented and correct chore: rebase chore: remove old import feat: use int4 instead of dequantizing chore: remove unused fn feat: add checks and validation feat: add new kernel and refactor code (#1) * feat: wip - adding new kernel * feat: wip - continue working on the unpack * feat: wip - working on unpacking * feat: remove old op * feat: more code changes * chore: remove old code * feat: more code * chore: more code changes * chore: more code changes * feat: add more documentation * fix: dataclass * feat: add more docs * feat: remove assert chore: block 8 bits chore: update comment feat: refactor dispatch chore: add validation on group size chore: wip - working on fixing unpack feat: add small readme with sources feat: add checks feat: tests pass & can execute llama2 * compile kind of working * fix: batching and layout outputs correct results * fix: torch.compile * wip * feat: wip * chore: cleanup * chore: review * chore: review v2 * update benchmarks + README --------- Co-authored-by: Jesse Cai <jcjessecai@gmail.com>

* Lint fixes; * Ruff auto-format

Revert "Lint fixes #1 torchao/dtypes (pytorch#827)" This reverts commit 144445a. Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

liangan1 added 11 commits May 19, 2024 23:48

Add pagedattention kernel for CPU

de3ac6b

Add ut

2b843e6

Add UT

63a826a

Refine code

ab17859

Add PagedAttention KV Cache manager

784f503

Enable flash decodeing for paged attention.

57c4faa

Update kv cache manager

d744dff

clang-format and recover test/kernel/test_fused_kernels.py

3ee352f

Merge branch 'main' into liangan1/paged_attention

101c4db

Update test_paged_attention.py

302dd70

Update test_paged_attention.py

52d5924

jgong5 suggested changes Jun 18, 2024

View reviewed changes

liangan1 and others added 11 commits June 19, 2024 10:04

Update test/kernel/test_paged_attention.py

0511027

Co-authored-by: Jiong Gong <jiong.gong@intel.com>

Refine code

51ad5cb

Update according to the review suggestions.

b48b5ff

Update test_ops.py

fcabfce

Update test_paged_attention.py

11d4f47

Remove redundant test

ad7caaa

update

1daba57

update format

93c5a7d

Merge branch 'main' into liangan1/paged_attention

8f75d4b

Enable subclassing for paged attention design

2e4c048

Merge branch 'main' into liangan1/paged_attention

5133692

liangan1 pushed a commit that referenced this pull request May 22, 2025

Lint fixes #1 torchao/dtypes (pytorch#827)

144445a

* Lint fixes; * Ruff auto-format

liangan1 pushed a commit that referenced this pull request May 22, 2025

Revert "Lint fixes #1 torchao/dtypes" (pytorch#836)

1ce7da9

Revert "Lint fixes #1 torchao/dtypes (pytorch#827)" This reverts commit 144445a. Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Paged attention #1

Paged attention #1

Uh oh!

liangan1 commented Jun 5, 2024

Uh oh!

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

jgong5 Jun 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		* @param out Output tensor [num_seqs, 1, num_heads, head_size].
		* @param query Query tensor [num_seqs, 1, num_heads, head_size].

		reshape_attn_mask_to_4d(attn_mask.value(), num_seqs, num_heads, q_len,
		attn_mask.value().size(-1));

Paged attention #1

Are you sure you want to change the base?

Paged attention #1

Uh oh!

Conversation

liangan1 commented Jun 5, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants