[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel #21197

jvlunteren · 2025-07-18T16:11:25Z

This PR introduces modifications and extensions to the Triton unified attention kernel (#16828 and #19152) that enable support for hybrid models such Granite 4.0 which combines Mamba-2 and Transformer layers.

Key changes:

~~adopt FlashInfer-style KV cache layout (num_blocks, 2, block_size, num_kv_heads, head_size)~~
~~implement reorder_batch() function~~
support for larger, non-power-of-two block sizes

In addition, prefill support has been removed from the included split-KV attention kernel. This kernel is now dedicated solely to decode operations, as prefill typically involves sufficient parallelism to efficiently utilize GPU compute resources without requiring the split-KV optimization.
The removal of prefill support from the included split-KV attention kernel has been reverted to simplify the review process, as this change is orthogonal to the main focus of the PR. Prefill support will be addressed in a future PR.

The FlashInfer-style KV cache layout and the reorder_batch() function have also been removed, as these are not needed anymore.

Performance

This PR extends the functionality of the Triton unified attention kernel to support hybrid models. We need to ensure that these changes do not degrade performance for conventional models. This was validated by running benchmark_serving.py using the meta-llama/Llama-3.1-8B-Instruct model on an NVIDIA H100 GPU.

Current Triton unified attention kernel:

$ python benchmarks/benchmark_serving.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  21.93
Total input tokens:                      210752
Total generated tokens:                  195369
Request throughput (req/s):              44.87
Output token throughput (tok/s):         8909.60
Total Token throughput (tok/s):          18520.72
---------------Time to First Token----------------
Mean TTFT (ms):                          3822.98
Median TTFT (ms):                        3783.49
P99 TTFT (ms):                           6847.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.86
Median TPOT (ms):                        49.16
P99 TPOT (ms):                           232.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           39.06
Median ITL (ms):                         26.22
P99 ITL (ms):                            236.50
==================================================

Updated Triton unified attention kernel (this PR):

$ python benchmarks/benchmark_serving.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  21.28
Total input tokens:                      211024
Total generated tokens:                  194422
Request throughput (req/s):              46.25
Output token throughput (tok/s):         9137.96
Total Token throughput (tok/s):          19056.22
---------------Time to First Token----------------
Mean TTFT (ms):                          3595.88
Median TTFT (ms):                        3526.50
P99 TTFT (ms):                           6669.08
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          85.63
Median TPOT (ms):                        48.81
P99 TPOT (ms):                           227.73
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.40
Median ITL (ms):                         25.91
P99 ITL (ms):                            233.44
==================================================

The above results confirm that performance remains stable, with a slight improvement observed.

Correctness

To verify correctness for conventional models, we compare lm_eval results between this PR and FlashAttention using the meta-llama/Llama-3.1-8B-Instruct model.

FlashAttention:

VLLM_USE_V1=1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.792|±  |0.0182|
|     |       |strict-match    |     5|exact_match|↑  |0.770|±  |0.0188|

Updated Triton unified attention kernel (this PR):

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.798|±  |0.0180|
|     |       |strict-match    |     5|exact_match|↑  |0.782|±  |0.0185|

The correctness for hybrid models is validated by comparing this PR with FlashInfer using the ibm-granite/granite-4.0-tiny-base-preview model.

FlashInfer:

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=ibm-granite/granite-4.0-tiny-base-preview,enable_prefix_caching=False,enforce_eager=True,tensor_parallel_size=2 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.604|±  |0.0219|
|     |       |strict-match    |     5|exact_match|↑  |0.576|±  |0.0221|

Updated Triton unified attention kernel (this PR):

VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 lm_eval --model vllm --model_args pretrained=ibm-granite/granite-4.0-tiny-base-preview,enable_prefix_caching=False,enforce_eager=True,tensor_parallel_size=2 --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 500

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.610|±  |0.0218|
|     |       |strict-match    |     5|exact_match|↑  |0.592|±  |0.0220|

How were the changes realized?

Most of the changes targeted by this PR were implemented in a relatively straightforward manner. Support for larger, non-power-of-two block sizes was realized using a tiling strategy.

The current Triton unified kernel computes partial attention scores at the block level and merges them using the online-softmax approach. In this PR, the partial attention score calculation is instead performed at the tile level. These tiles have a power-of-two size that is independent of the block size and can be chosen to optimize GPU resource utilization.

For conventional models, the tile size is typically chosen to be equal to or larger than the block size. In contrast, for hybrid models with very large block sizes, a significantly smaller tile size is selected.

@tdoublep @bringlein

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

github-actions · 2025-07-18T16:11:33Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces modifications to the Triton unified attention kernel to enable support for hybrid models. A critical issue was identified in vllm/v1/attention/backends/triton_attn.py regarding the KV cache layout that needs to be addressed before merging.

gemini-code-assist · 2025-07-18T16:13:28Z

vllm/v1/attention/backends/triton_attn.py

            key_cache, value_cache = PagedAttention.split_kv_cache(
                kv_cache, self.num_kv_heads, self.head_size)
        else:
-            key_cache, value_cache = kv_cache.unbind(0)


The line key_cache, value_cache = kv_cache.unbind(0) is incorrect given the new KV cache layout (num_blocks, 2, block_size, num_kv_heads, head_size). This will lead to runtime errors. It should be key_cache, value_cache = kv_cache.unbind(1) to correctly unbind the key and value caches.

key_cache, value_cache = kv_cache.unbind(1)

key_cache, value_cache = kv_cache.unbind(1) is part of this PR. The gemini-code-assist seems to have missed that.

You are absolutely correct, jvlunteren! My apologies for the oversight. The change to key_cache, value_cache = kv_cache.unbind(1) is indeed present in this pull request (as shown in DIFF 4 of vllm/v1/attention/backends/triton_attn.py).

This change is crucial for correctly handling the new KV cache layout (num_blocks, 2, block_size, num_kv_heads, head_size), and I'm glad to see it's already implemented. Thank you for pointing out my mistake!

mergify · 2025-07-21T16:16:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jvlunteren.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren · 2025-07-29T16:38:57Z

The removal of prefill support from the included split-KV attention kernel has been reverted to simplify the review process.

mergify · 2025-08-02T07:02:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @jvlunteren.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren · 2025-09-16T12:56:53Z

The FlashInfer-style KV cache layout and the reorder_batch() function have been removed, as these are not needed anymore.

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

tdoublep · 2025-09-17T06:55:45Z

vllm/attention/ops/triton_unified_attention.py

+    # calculate the number of tiles that need to be processed to
+    # cover the longest sequence prefix (due to causal masking, tiles beyond
+    # this prefix can be skipped)
+    num_tiles = cdiv_fn(max_seq_prefix_len, TILE_SIZE)


How come the 3D kernel didn't need to compute max_seq_prefix_len before this change? It looks like the 2D kernel did need to.

The value max_seq_prefix_len relates to the number of tokens preceding the last query token in a Q block (for prefill, there can be multiple query tokens in a Q block), and is determined also considering potential padding. Because the 3D kernel is only used for decodes and therefore each Q block will only contain one query token, consequently the number of tokens preceding that single query token can be determined directly from the sequence length and max_seq_prefix_len does need to be calculated.

When I added prefill support back to 3d (split-kv) kernel, to simplify the review process (c4f7d67), I also added using max_seq_prefix_len back to the 3d kernel for the same purpose, by making the 2d and 3d kernels very similar.

In order to synchronize the branch of the fork involved in this PR with the vLLM main branch, it was easier to temporarily remove my updates, do the synchronization with the main vLLM branch, and then apply the updates again. Because the 3d kernel in the vLLM main branch does not use max_seq_prefix_len, it looks like the update only happened with the last change, but as indicated above, it was already included with commit c4f7d67. In a follow-up PR, I intend to simplify the 3d kernel by removing all functionality needed to support prefills, including the use of max_seq_prefix_len.

tdoublep · 2025-09-17T06:57:12Z

vllm/attention/ops/triton_unified_attention.py

+    TILE_SIZE_PREFILL = 32
+    TILE_SIZE_DECODE = 32


Why do we set these to 32 by default? If I understand correctly, if we want to keep the default behaviour the same as on main we should set them to 16? I'm not saying we need to do that, just would be good to justify the change.

I'm especially surprised that the decode would benefit from using a value here bigger than the block size

The tile sizes for prefill and decode were set to 32 to prevent issues with tl.dot which imposes certain restrictions on the shapes of the tensors that are multiplied, which also depend on the data type (e.g., fp8).

I ran several experiments with a few models, and it appeared that slightly larger tile sizes worked better for prefill, while a tile size of 16 or 32 worked fine for decode. However, this may be different for other models.

Ideally, these tile sizes should be tuned dependent on GPU and model characteristics. I will update the assignment of the default values.

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren · 2025-09-17T14:40:17Z

A check previously based on the minimum block size required by the tl.dot operation (also taking into account the data type) for tensor multiplication in the attention kernel has been replaced with a check for the tile size, as the latter now determines the shapes of the tensors involved in the computation. The corresponding test test_triton_uniform_attention.py has also been updated.

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren · 2025-09-17T14:58:40Z

The default tile sizes for prefill and decode are now assigned to always satisfy the shape constraints imposed by tl.dot. The check has been removed.

tdoublep

LGTM

This can enable large block size support for hybrid models, but also makes it significantly easier to tune the tile size in the future.

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

jvlunteren added 6 commits July 18, 2025 00:48

modify shape of kv cache

c3f8dd4

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

removed prefill support from split-kv attention

7d52de7

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

added reorder_batch method

41a1254

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

added tiling to support large non-power-of-2 block sizes

6109a14

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

formatting

343ed93

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

updated parameters

a96e229

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners July 18, 2025 16:11

mergify bot added the v1 label Jul 18, 2025

gemini-code-assist bot reviewed Jul 18, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 21, 2025

resolved conflicts

ffa29db

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

mergify bot removed the needs-rebase label Jul 21, 2025

jvlunteren and others added 2 commits July 21, 2025 15:40

removed unneeded files

6a375ee

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

Merge branch 'vllm-project:main' into jvl-hybrid-support

39e352d

tdoublep mentioned this pull request Jul 24, 2025

[V1] [Kernel] Change KV cache layout to (num_blocks, 2, ...) for FlashAttention backend #21549

Open

4 tasks

add prefill support back to split-kv kernel

c4f7d67

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

mergify bot added the needs-rebase label Aug 2, 2025

removed changes to triton_attn.py

b4fd773

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

mergify bot removed the needs-rebase label Aug 2, 2025

jvlunteren and others added 2 commits August 2, 2025 12:47

Merge branch 'vllm-project:main' into jvl-hybrid-support

40218f4

restored changes to triton_attn.py

832ccb8

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren and others added 2 commits September 16, 2025 15:50

Merge branch 'vllm-project:main' into jvl-hybrid-support

2b415b6

restore changes to triton_unified_attention.py

4254fad

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren requested a review from tdoublep as a code owner September 16, 2025 13:56

jvlunteren added 2 commits September 16, 2025 17:04

Merge branch 'main' into jvl-hybrid-support

db61501

Merge branch 'main' into jvl-hybrid-support

3605c65

tdoublep reviewed Sep 17, 2025

View reviewed changes

replaced block size check by tile size check

8a846bc

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

jvlunteren requested review from mgoin, tlrmchlsmth and yewentao256 as code owners September 17, 2025 14:34

Merge branch 'main' into jvl-hybrid-support

f41f90e

assign default tile sizes for prefill and decode, remove check

3dedd20

Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

Merge branch 'main' into jvl-hybrid-support

ee82dd5

tdoublep approved these changes Sep 18, 2025

View reviewed changes

tdoublep enabled auto-merge (squash) September 18, 2025 07:48

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025

Merge branch 'main' into jvl-hybrid-support

8da5f74

tdoublep merged commit 01a583f into vllm-project:main Sep 18, 2025
42 checks passed

vllmellm mentioned this pull request Sep 19, 2025

[Bug]: Failed to run Qwen/Qwen3-Next-80B-A3B-Instruct on rocm/MI210 #25030

Closed

1 task

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[Kernel] Decouple Tile Size from Block Size in Triton Unified Attenti…

3d67b83

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Kernel] Decouple Tile Size from Block Size in Triton Unified Attenti…

1201e4a

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[Kernel] Decouple Tile Size from Block Size in Triton Unified Attenti…

e61df9c

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[Kernel] Decouple Tile Size from Block Size in Triton Unified Attenti…

b96f414

…on Kernel (vllm-project#21197) Signed-off-by: Jan van Lunteren <jvl@zurich.ibm.com>

Uh oh!

[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel #21197

[Kernel] Enable Hybrid Model Support in Triton Unified Attention Kernel #21197

Uh oh!

Conversation

jvlunteren commented Jul 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Correctness

How were the changes realized?

Uh oh!

github-actions bot commented Jul 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

jvlunteren Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 21, 2025

Uh oh!

jvlunteren commented Jul 29, 2025

Uh oh!

mergify bot commented Aug 2, 2025

Uh oh!

jvlunteren commented Sep 16, 2025

Uh oh!

tdoublep Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jvlunteren Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdoublep Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

tdoublep Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jvlunteren Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

jvlunteren commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jvlunteren commented Sep 17, 2025

Uh oh!

tdoublep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jvlunteren commented Jul 18, 2025 •

edited by github-actions bot

Loading

jvlunteren Sep 17, 2025 •

edited

Loading

jvlunteren commented Sep 17, 2025 •

edited

Loading

tdoublep left a comment •

edited

Loading