[V1] Fix local chunked attention always disabled #21419

sarckk · 2025-07-23T00:00:41Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

#21188 and #19351 made similar and conflicting changes around self.use_irope in Attention layer, causing self.use_irope to always be False in V1:

self.use_irope = extra_impl_args.pop("use_irope", False)
...
self.use_irope = extra_impl_args.get("use_irope", False)

We should not pop use_irope in V0 as attention backends still expect use_irope as an arg (example)

Test Plan

ruler niah_multikey_2

VLLM_USE_V1=1 lm_eval --model vllm --tasks niah_multikey_2 --model_args pretrained=meta-llama/Llama-4-Scout-17B-16E-Instruct,tensor_parallel_size=4,max_model_len=256000  --metadata='{"max_seq_lengths":[4096,8192,16384,32768]}' --batch_size auto  > /tmp/test_irope_fix.log 2>&1 &

Test Result

baseline:

|---------------|------:|------|-----:|-----:|---|----:|---|------|
|niah_multikey_2|      1|none  |     0| 16384|↑  |0.980|±  |   N/A|
|               |       |none  |     0| 32768|↑  |0.000|±  |   N/A|
|               |       |none  |     0|  4096|↑  |1.000|±  |   N/A|
|               |       |none  |     0|  8192|↑  |0.996|±  |   N/A|

This PR:

|     Tasks     |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|---------------|------:|------|-----:|-----:|---|----:|---|------|
|niah_multikey_2|      1|none  |     0| 16384|↑  |0.980|±  |   N/A|
|               |       |none  |     0| 32768|↑  |0.944|±  |   N/A|
|               |       |none  |     0|  4096|↑  |1.000|±  |   N/A|
|               |       |none  |     0|  8192|↑  |0.996|±  |   N/A|

cc: @luccafong @minosfuture @houseroad @yeqcharlotte

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

github-actions · 2025-07-23T00:00:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR correctly fixes a bug where self.use_irope was always being set to False due to being overwritten. While the fix is logically sound for the stated problem, I've identified a potential critical issue with the underlying logic of using pop vs get for V1 and V0 code paths. This could be a latent bug related to how quantization parameters are handled. My review includes a suggestion to make the behavior consistent and likely more correct.

vllm/attention/layer.py

facebook-github-bot · 2025-07-23T00:35:14Z

@vladmihailescu has imported this pull request. If you are a Meta employee, you can view this in D78782520.

vladmihailescu · 2025-07-23T00:39:25Z

Importing this diff internally for an A/B perf test for Llama4 Maverick on H100

yeqcharlotte · 2025-07-23T02:57:01Z

Good catch @sarckk! Are we missing some unit test coverage for local attention? I would expect some test failure when we disable things.

LucasWilkinson

Good catch!! I appreciate you fixing this! Thanks for the contribution

sarckk · 2025-07-23T16:46:20Z

Good catch @sarckk! Are we missing some unit test coverage for local attention? I would expect some test failure when we disable things.

yes, I will add some unit test to catch this in the future (EDIT: added in #21478)

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: x22x22 <wadeking@qq.com>

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

Fix use_irope not set correctly in V1

725429e

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

gemini-code-assist bot reviewed Jul 23, 2025

View reviewed changes

vllm/attention/layer.py Show resolved Hide resolved

sarckk marked this pull request as ready for review July 23, 2025 00:35

sarckk changed the title ~~[V1] Fix use_irope always set to False~~ [V1] Fix local chunked attention always disabled Jul 23, 2025

LucasWilkinson approved these changes Jul 23, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) July 23, 2025 15:23

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 23, 2025

sarckk mentioned this pull request Jul 23, 2025

Add interleaved RoPE test for Llama4 (Maverick) #21478

Merged

4 tasks

simon-mo disabled auto-merge July 23, 2025 22:59

simon-mo merged commit 78c13e3 into vllm-project:main Jul 23, 2025
81 of 83 checks passed

sarckk deleted the fix-irope branch July 24, 2025 00:08

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

c3deb55

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: x22x22 <wadeking@qq.com>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

61e3745

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

101d542

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

ea10bc2

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

a83f748

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

b86bf64

Signed-off-by: Yong Hoon Shin <yhshin@meta.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[V1] Fix local chunked attention always disabled (vllm-project#21419)

6acf47a

Signed-off-by: Yong Hoon Shin <yhshin@meta.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[V1] Fix local chunked attention always disabled #21419

[V1] Fix local chunked attention always disabled #21419

Uh oh!

sarckk commented Jul 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

vladmihailescu commented Jul 23, 2025

Uh oh!

yeqcharlotte commented Jul 23, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

sarckk commented Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Uh oh!

[V1] Fix local chunked attention always disabled #21419

[V1] Fix local chunked attention always disabled #21419

Uh oh!

Conversation

sarckk commented Jul 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jul 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

facebook-github-bot commented Jul 23, 2025

Uh oh!

vladmihailescu commented Jul 23, 2025

Uh oh!

yeqcharlotte commented Jul 23, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

sarckk commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sarckk commented Jul 23, 2025 •

edited by github-actions bot

Loading

sarckk commented Jul 23, 2025 •

edited

Loading