[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Daisy-Ma-coder · 2025-10-17T22:05:36Z

Bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490

Run into illegal memory access error when testing some prompts with prefix caching enabled on Flash Attention MLA backend

Log below is generated with CUDA_LAUNCH_BLOCKING=1 which indicating it's flash attn mla.

INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:CUDA error (../../.deps/vllm-flash-attn-src/hopper/flash_fwd_combine_launch_template.h:60): an illegal memory access was encountered
INFO:/scripts/vllm_scripts/utils.py:[1;36m(EngineCore_0 pid=481)[0;0m ERROR 10-13 10:51:40 [multiproc_executor.py:146] Worker proc VllmWorker-5 died unexpectedly, shutting down executor.
...

And realized it's the same root cause as #25490 where get_scheduler_metadata was being called with a different max_num_splits than what was being passed to FlashAttnMLAMetadata.

gemini-code-assist

Code Review

This pull request aims to fix an illegal memory access error in Flash Attention MLA with full CUDA graph support by ensuring get_scheduler_metadata and FlashAttnMLAMetadata receive the same max_num_splits value. The changes correctly refactor the logic to calculate max_num_splits before it's used. However, I've identified a remaining logic issue where a similar discrepancy can occur when vllm_is_batch_invariant() is true, which could lead to the same bug under different conditions. I've provided a suggestion to fully resolve this.

…25490 Signed-off-by: qqma <qqma@amazon.com>

Signed-off-by: qqma <qqma@amazon.com>

LucasWilkinson

LGTM; thanks!

Daisy-Ma-coder · 2025-10-18T00:55:44Z

seems like the failed tests are unrelated, is it fine to still merge it?

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com>

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com>

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Daisy-Ma-coder requested a review from LucasWilkinson as a code owner October 17, 2025 22:05

mergify bot added the v1 label Oct 17, 2025

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

bugfix for Flash Attention MLA with full cuda graph IMA following pr-…

a8cdaba

…25490 Signed-off-by: qqma <qqma@amazon.com>

Daisy-Ma-coder force-pushed the flash_attn_mla_ima_fix branch from 257c4e8 to a8cdaba Compare October 17, 2025 22:08

qqma added 2 commits October 17, 2025 15:36

fix linting check

376c203

Signed-off-by: qqma <qqma@amazon.com>

fix linting check

8c25145

Signed-off-by: qqma <qqma@amazon.com>

LucasWilkinson approved these changes Oct 17, 2025

View reviewed changes

LucasWilkinson enabled auto-merge (squash) October 17, 2025 23:00

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 17, 2025

Daisy-Ma-coder added 2 commits October 17, 2025 21:42

Merge branch 'main' into flash_attn_mla_ima_fix

f8f0132

Merge branch 'main' into flash_attn_mla_ima_fix

bc7b111

Daisy-Ma-coder requested a review from pavanimajety as a code owner October 22, 2025 16:34

LucasWilkinson merged commit 5beacce into vllm-project:main Oct 22, 2025
47 checks passed

usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA foll…

861d534

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com>

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA foll…

9976587

…owing pr-25490 (vllm-project#27128) Signed-off-by: qqma <qqma@amazon.com> Co-authored-by: qqma <qqma@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Daisy-Ma-coder commented Oct 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

LucasWilkinson left a comment

Uh oh!

Daisy-Ma-coder commented Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 #27128

Conversation

Daisy-Ma-coder commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Daisy-Ma-coder commented Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Daisy-Ma-coder commented Oct 17, 2025 •

edited by github-actions bot

Loading