[Perf] Disable chunked local attention by default with llama4 #21761

LucasWilkinson · 2025-07-28T14:27:48Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Currently using the hybrid kv-cache with llama4s chunked local attention causes a latency ~2ms since when the hybrid kv-cache manager is used we end up with 3 ChunkedLocalAttention kv-cache spec groups. We end up with the following groups:

(FullAttention x 12) (ChunkedLocalAttention x 12) (ChunkedLocalAttention x 12) (ChunkedLocalAttention x 12)

This results in attn metadata and local virtual batches for the local layers being constructed 3 times adding latency:

Enabled: 

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 4 --trust-remote-code --max-model-len 16384 --port 8081  --disable-log-requests

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  9.11      
Total input tokens:                      6299      
Total generated tokens:                  12509     
Request throughput (req/s):              10.97     
Output token throughput (tok/s):         1372.85   
Total Token throughput (tok/s):          2064.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          61.84     
Median TTFT (ms):                        61.53     
P99 TTFT (ms):                           106.66    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.46     
Median TPOT (ms):                        29.17     
P99 TPOT (ms):                           30.99     
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.44     
Median ITL (ms):                         28.65     
P99 ITL (ms):                            38.05     
==================================================


Disabled:

vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 4 --trust-remote-code --max-model-len 16384 --port 8081  --disable-log-requests --disable-hybrid-kv-cache-manager

============ Serving Benchmark Result ============
Successful requests:                     100       
Benchmark duration (s):                  8.84      
Total input tokens:                      6299      
Total generated tokens:                  12297     
Request throughput (req/s):              11.32     
Output token throughput (tok/s):         1391.49   
Total Token throughput (tok/s):          2104.26   
---------------Time to First Token----------------
Mean TTFT (ms):                          58.69     
Median TTFT (ms):                        59.23     
P99 TTFT (ms):                           90.65     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.48     
Median TPOT (ms):                        27.32     
P99 TPOT (ms):                           28.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.55     
Median ITL (ms):                         26.54     
P99 ITL (ms):                            39.40     
==================================================

Test Plan

see: #21707

Test Result

see: #21707

(Optional) Documentation Update

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

github-actions · 2025-07-28T14:27:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request correctly addresses a performance regression by disabling chunked local attention with the hybrid KV cache manager by default, while providing an environment variable to re-enable it. The implementation is sound. My only suggestion is to update a comment to more accurately reflect that the change is a performance optimization, which will improve code clarity and maintainability.

gemini-code-assist · 2025-07-28T14:28:42Z

vllm/config.py

+                    # Hybrid KV cache manager is not yet supported with chunked
+                    # local attention.


This comment is slightly misleading as it suggests the feature is unsupported, whereas the PR description and warning log indicate it's a performance regression. To improve clarity for future maintenance, it would be better to state the performance-related reason for disabling it.

Suggested change

# Hybrid KV cache manager is not yet supported with chunked

# local attention.

# Disable hybrid KV cache manager with chunked local attention

# due to a performance regression.

I kind of agree with Gemini here, although you say this in your log

mgoin

LGTM for the moment

mgoin · 2025-07-28T14:50:51Z

vllm/config.py

+                    # Hybrid KV cache manager is not yet supported with chunked
+                    # local attention.


I kind of agree with Gemini here, although you say this in your log

mgoin · 2025-07-28T14:50:58Z

vllm/config.py

+                    self.scheduler_config.disable_hybrid_kv_cache_manager = True
+                elif \
+                    not envs.VLLM_ALLOW_CHUNKED_LOCAL_ATTN_WITH_HYBRID_KV_CACHE:
+                    logger.warning(


nit: warning_once

mgoin · 2025-07-28T22:48:52Z

Merging to solve the regression since we have better solutions on the way

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

luccafong · 2025-08-02T01:26:44Z

@LucasWilkinson will we reduce metadata creation with refactoring?

LucasWilkinson · 2025-08-02T05:29:23Z

thats the plan; we are working towards: https://vllm-dev.slack.com/archives/C07R5Q1Q2BB/p1753727605258469?thread_ts=1753202489.248869&cid=C07R5Q1Q2BB but that will be a followup PR

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: x22x22 <wadeking@qq.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

disable chunked local attention by default

dd3ccf5

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson requested review from WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth and youkaichao as code owners July 28, 2025 14:27

mergify bot added the llama Related to Llama models label Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

mgoin approved these changes Jul 28, 2025

View reviewed changes

mgoin added performance Performance-related issues ready ONLY add when PR is ready to merge/full CI is needed labels Jul 28, 2025

mgoin merged commit 8aa1485 into vllm-project:main Jul 28, 2025
78 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

47a6c89

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

sarckk mentioned this pull request Aug 2, 2025

[Bug]: [v1/core/block_pool.py] Assertion Failure: prev_block.block_hash is not None #21992

Closed

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

024f5de

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

be60f7a

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

94a185c

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

b0119fd

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Noam Gat <noamgat@gmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

3712e58

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

1e8cef7

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[Perf] Disable chunked local attention by default with llama4 (vllm-p…

47bfbc4

…roject#21761) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Disable chunked local attention by default with llama4 #21761

[Perf] Disable chunked local attention by default with llama4 #21761

Uh oh!

LucasWilkinson commented Jul 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 28, 2025

Uh oh!

mgoin Jul 28, 2025

Uh oh!

mgoin left a comment

Uh oh!

mgoin Jul 28, 2025

Uh oh!

mgoin Jul 28, 2025

Uh oh!

mgoin commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

luccafong commented Aug 2, 2025

Uh oh!

LucasWilkinson commented Aug 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Hybrid KV cache manager is not yet supported with chunked
		# local attention.

Uh oh!

[Perf] Disable chunked local attention by default with llama4 #21761

[Perf] Disable chunked local attention by default with llama4 #21761

Uh oh!

Conversation

LucasWilkinson commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mgoin commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

luccafong commented Aug 2, 2025

Uh oh!

LucasWilkinson commented Aug 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LucasWilkinson commented Jul 28, 2025 •

edited by github-actions bot

Loading

mgoin commented Jul 28, 2025 •

edited

Loading