[V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046

fhl2000 · 2025-08-17T05:14:13Z

Purpose

This PR addressed several issues after #20059 was landed.

rename cuda_piecewise_backend.py to piecewise_backend.py
clean up/fix some comments
reenable cascade attention when cudagraph_mode is full cudagraph related, i.e., dispatch to eager run when cudagraph_mode is FULL or FULL_DECODE_ONLY, or dispatch to piecewise cudagraph when mode is FULL_AND_PIECEWISE. （updated: still disable it when DBO, since they are not compatible）
change default value of splitting_ops to [] when enable_attn_fusion is true in the pass_config. (Suggested by @elvischenv )

More issues affecting spec-decode (part 2), please see #23679.

Test Plan

Simply test if the dispatcher can dispatch to NONE or PIECEWISE runtime mode for cascade attention.
No benchmark or correctness test is provided.

Test Result

It passed.

(Optional) Documentation Update

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

… comments; default empty splitting_ops when enable_attn_fusion Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

github-actions · 2025-08-17T05:14:22Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces several improvements related to CUDA graphs and attention mechanisms. The changes include renaming a backend file for clarity, enhancing the handling of cascade attention with full CUDA graphs by providing warnings instead of disabling the feature, and updating logic for attention fusion. My review focuses on ensuring code quality and correctness. I've identified a minor but important naming issue that should be addressed.

vllm/config/compilation.py

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

ProExpertProg

A few minor comments, will go look at #20059 to review tests in detail for more improvements to put here

vllm/config/__init__.py

ProExpertProg · 2025-08-22T00:32:35Z

vllm/config/compilation.py

+    def has_piecewise_cudagraphs(self) -> bool:
+        return self.requires_piecewise_compilation()


These two seem semantically different

Yeah, but they are equivalent in actuality since we don't allow piecewise mode with empty splitting ops (translated to FULL in this case). So, having piecewise_cudagraph means requiring piecewise compilation, and requiring piecewise compilation implies having piecewise_cudagraph.

ProExpertProg · 2025-08-22T00:35:09Z

vllm/config/compilation.py

+                        "PIECEWISE will be treated as FULL cudagraph_mode. "
+                        "Please ensure you are using attention backends that "
+                        "support cudagraph or set cudagraph_mode to NONE "


This is confusing, we should clarify that we disable piecewise not that piecewise is handled as full. Also I think we should do full_and_piecewise->full and piecewise->none, not piecewise->full.

We have two places in this function doing the piecewise->full. The former case is for attn_ops fusion (splitting_ops=[]), so it must be FULL mode in this case. The latter is when users explicitly set splitting_ops=[]. I agree that this case is more reasonable to do full_and_piecewise->full and piecewise->none

Yeah for the attn fusion case let's just explicitly update cg mode there.

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

vllm/v1/worker/gpu_model_runner.py

ProExpertProg

Just a few minor notes, feel free to address in follow up!

ProExpertProg · 2025-09-24T12:36:03Z

vllm/compilation/decorators.py


    compilation_config = vllm_config.compilation_config
-    if (compilation_config.cudagraph_mode != CUDAGraphMode.NONE
+    if (compilation_config.cudagraph_mode.has_piecewise_cudagraphs()


This can be a follow up but we should refactor this so that we use a new factory method on current_platform:

class PlatformInterface: ... def create_static_graph_wrapper(runnable, vllm_config, runtime_mode: CUDAGraphMode, partition_idx:int, num_partitions: int) -> StaticGraphWrapper: ... # alternatively: def get_static_graph_wrapper_factory() -> StaticGraphWrapper.Factory ...

ProExpertProg · 2025-09-24T12:37:31Z

vllm/config/__init__.py

                    "when cudagraph_mode piecewise cudagraphs is used, "\
                    f"cudagraph_mode={self.compilation_config.cudagraph_mode}"

+            # final migrate the deprecated flags


Can you create an issue to remove the deprecated flags?

vllm/config/compilation.py

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

ProExpertProg · 2025-09-24T21:54:26Z

@fhl2000 not sure about the timeout but this seems not great, could you take a look?

    (Worker_TP1 pid=736) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
    (Worker_TP0 pid=735) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.

EDIT: seems like we're dispatching before init, can you fix that so we don't warn unnecessarily?

fhl2000 · 2025-09-25T05:01:08Z

    (Worker_TP1 pid=736) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
    (Worker_TP0 pid=735) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.

EDIT: seems like we're dispatching before init, can you fix that so we don't warn unnecessarily?

It seems from the profile run. Previously, we defaulted to CUDGraphMode.NONE, so no dispatching would happen for that. After #25505, it became optional (default None). The simple fix is to still disable dispatching when is_profile is True.

But this is not related to the timeout. From the CI for gpt-oss, the running is very slow, not sure if this is triggered by cascade attention. Will temporary disable it and re-trigger CI.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

ProExpertProg · 2025-09-25T15:05:34Z

@fhl2000 can you just do a quick sanity check for E2E performance and accuracy?

fhl2000 · 2025-09-25T15:18:57Z

@fhl2000 can you just do a quick sanity check for E2E performance and accuracy?

I can do it tomorrow.

fhl2000 · 2025-09-26T15:10:56Z

@ProExpertProg Looks no bad! Here is the script:

import os
import time
from vllm import LLM, SamplingParams
from tests.utils import wait_for_gpu_memory_to_clear

example_system_message = ""
with open("tests/system_messages/sonnet3.5_nov2024.txt") as f:
    example_system_message = f.read()

def test_cascade_attention(example_system_message, use_cascade, attn_backend="FLASH_ATTN" ):
    prompt = "\n<User>: Implement fibonacci sequence in Python.\n<Claude>:"

    os.environ.update({"VLLM_USE_V1": "1"})
    os.environ.update({"VLLM_ATTENTION_BACKEND": attn_backend})

    llm = LLM(model="/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4", disable_cascade_attn=not use_cascade)
    sampling_params = SamplingParams(temperature=0.0, max_tokens=100,top_p=1.0)

    # No cascade attention.
    single_prompt = [example_system_message + prompt]
    responses = llm.generate(single_prompt, sampling_params)
    ref_output = responses[0].outputs[0].text

    t1 = time.time()
    # (Probably) Use cascade attention.
    prompts = [example_system_message + prompt] * 64
    responses = llm.generate(prompts, sampling_params)
    t2 = time.time()
    print(f"Use cascade: {use_cascade} Time taken: {t2 - t1} seconds")
    with open(f"output_use_cascade_{use_cascade}.txt", "w") as f:
        for response in responses:
            f.write(response.outputs[0].text + "\n")
    del llm
    wait_for_gpu_memory_to_clear(
        devices=[0],
        threshold_ratio=0.1,
    )

if __name__ == "__main__":
    test_cascade_attention(example_system_message, use_cascade=False)
    test_cascade_attention(example_system_message, use_cascade=True)

Output:

Use cascade: False Time taken: 2.001439332962036 seconds
Use cascade: True Time taken: 1.7061328887939453 seconds

No regression observed from the generated tokens.

Edit: it's on the new default mode FULL_AND_PIECEWISE.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

fhl2000 added 3 commits August 16, 2025 05:31

renaming to piecewise_backend.py

a012257

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

dispatch cascade attention to NONE or PIECEWISE runtime mode;clean up…

c37583c

… comments; default empty splitting_ops when enable_attn_fusion Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into post_issues_20059

5e998c2

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

fhl2000 requested review from ProExpertProg, WoosukKwon, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao, ywang96 and zou3519 as code owners August 17, 2025 05:14

mergify bot added the v1 label Aug 17, 2025

gemini-code-assist bot reviewed Aug 17, 2025

View reviewed changes

vllm/config/compilation.py Outdated Show resolved Hide resolved

fhl2000 marked this pull request as draft August 17, 2025 07:59

fhl2000 added 5 commits August 17, 2025 08:08

apply suggestion from bot

648fbb3

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

Merge branch 'main' into post_issues_20059

3c88284

fix bug when attn_metadata have no use_cascade

868c85f

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

simple dispatching test

3bef9e4

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

minor comment tweak

05cf012

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

fhl2000 marked this pull request as ready for review August 17, 2025 13:11

ProExpertProg reviewed Aug 22, 2025

View reviewed changes

address comments part1

20d8afb

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

mergify bot removed the needs-rebase label Sep 24, 2025

fhl2000 added 2 commits September 24, 2025 01:26

pre-commit

92cbd4f

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

modify comments for full_cuda_graph

df90576

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

ProExpertProg reviewed Sep 24, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025

ProExpertProg approved these changes Sep 24, 2025

View reviewed changes

ProExpertProg and others added 2 commits September 24, 2025 11:22

Merge branch 'main' into post_issues_20059

6176761

address comment;add test for splitting_ops

4679802

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

fhl2000 added 5 commits September 25, 2025 05:16

fix profile_run log

5475e9e

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

temporary disable cascade attention

413079b

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

Merge branch 'main' into post_issues_20059

254bfd3

Merge branch 'main' into post_issues_20059

f584663

recover

d8a1ad7

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into post_issues_20059

891723a

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>

ProExpertProg mentioned this pull request Sep 26, 2025

[Feature]: Enabling performance optimizations by default #25689

Open

11 tasks

ProExpertProg merged commit f075693 into vllm-project:main Sep 26, 2025
41 checks passed

fhl2000 mentioned this pull request Sep 27, 2025

[Docs] add docs for cuda graph v1 #24374

Merged

5 tasks

fhl2000 deleted the post_issues_20059 branch September 30, 2025 15:37

fhl2000 changed the title ~~[V1] address post issues related to #20059 (part 1)~~ [V1] address post issues related to #20059 (part 1); cascade attention reenable by default Oct 3, 2025

		def has_piecewise_cudagraphs(self) -> bool:
		return self.requires_piecewise_compilation()

Uh oh!

[V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046

[V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046

Uh oh!

Conversation

fhl2000 commented Aug 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

fhl2000 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

fhl2000 Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

ProExpertProg Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ProExpertProg commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fhl2000 commented Sep 25, 2025

Uh oh!

ProExpertProg commented Sep 25, 2025

Uh oh!

fhl2000 commented Sep 25, 2025

Uh oh!

fhl2000 commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fhl2000 commented Aug 17, 2025 •

edited by github-actions bot

Loading

ProExpertProg commented Sep 24, 2025 •

edited

Loading

fhl2000 commented Sep 26, 2025 •

edited

Loading