Skip to content

Conversation

@fhl2000
Copy link
Contributor

@fhl2000 fhl2000 commented Aug 17, 2025

Purpose

This PR addressed several issues after #20059 was landed.

  • rename cuda_piecewise_backend.py to piecewise_backend.py
  • clean up/fix some comments
  • reenable cascade attention when cudagraph_mode is full cudagraph related, i.e., dispatch to eager run when cudagraph_mode is FULL or FULL_DECODE_ONLY, or dispatch to piecewise cudagraph when mode is FULL_AND_PIECEWISE. (updated: still disable it when DBO, since they are not compatible)
  • change default value of splitting_ops to [] when enable_attn_fusion is true in the pass_config. (Suggested by @elvischenv )

CC list: @ProExpertProg @LucasWilkinson

More issues affecting spec-decode (part 2), please see #23679.

Test Plan

Simply test if the dispatcher can dispatch to NONE or PIECEWISE runtime mode for cascade attention.
No benchmark or correctness test is provided.

Test Result

It passed.

(Optional) Documentation Update


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
… comments; default empty splitting_ops when enable_attn_fusion

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Aug 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several improvements related to CUDA graphs and attention mechanisms. The changes include renaming a backend file for clarity, enhancing the handling of cascade attention with full CUDA graphs by providing warnings instead of disabling the feature, and updating logic for attention fusion. My review focuses on ensuring code quality and correctness. I've identified a minor but important naming issue that should be addressed.

@fhl2000 fhl2000 marked this pull request as draft August 17, 2025 07:59
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@fhl2000 fhl2000 marked this pull request as ready for review August 17, 2025 13:11
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, will go look at #20059 to review tests in detail for more improvements to put here

Comment on lines +65 to +66
def has_piecewise_cudagraphs(self) -> bool:
return self.requires_piecewise_compilation()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two seem semantically different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but they are equivalent in actuality since we don't allow piecewise mode with empty splitting ops (translated to FULL in this case). So, having piecewise_cudagraph means requiring piecewise compilation, and requiring piecewise compilation implies having piecewise_cudagraph.

Comment on lines 561 to 563
"PIECEWISE will be treated as FULL cudagraph_mode. "
"Please ensure you are using attention backends that "
"support cudagraph or set cudagraph_mode to NONE "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing, we should clarify that we disable piecewise not that piecewise is handled as full. Also I think we should do full_and_piecewise->full and piecewise->none, not piecewise->full.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have two places in this function doing the piecewise->full. The former case is for attn_ops fusion (splitting_ops=[]), so it must be FULL mode in this case. The latter is when users explicitly set splitting_ops=[]. I agree that this case is more reasonable to do full_and_piecewise->full and piecewise->none

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for the attn fusion case let's just explicitly update cg mode there.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Sep 24, 2025
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor notes, feel free to address in follow up!


compilation_config = vllm_config.compilation_config
if (compilation_config.cudagraph_mode != CUDAGraphMode.NONE
if (compilation_config.cudagraph_mode.has_piecewise_cudagraphs()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a follow up but we should refactor this so that we use a new factory method on current_platform:

class PlatformInterface:
    ...

    def create_static_graph_wrapper(runnable, vllm_config, runtime_mode: CUDAGraphMode, partition_idx:int, num_partitions: int) -> StaticGraphWrapper:
        ...
    # alternatively:
    def get_static_graph_wrapper_factory() -> StaticGraphWrapper.Factory
        ...

"when cudagraph_mode piecewise cudagraphs is used, "\
f"cudagraph_mode={self.compilation_config.cudagraph_mode}"

# final migrate the deprecated flags
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue to remove the deprecated flags?

ProExpertProg and others added 2 commits September 24, 2025 11:22
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@ProExpertProg
Copy link
Collaborator

ProExpertProg commented Sep 24, 2025

@fhl2000 not sure about the timeout but this seems not great, could you take a look?

    (Worker_TP1 pid=736) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
    (Worker_TP0 pid=735) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.

EDIT: seems like we're dispatching before init, can you fix that so we don't warn unnecessarily?

@fhl2000
Copy link
Contributor Author

fhl2000 commented Sep 25, 2025

    (Worker_TP1 pid=736) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
    (Worker_TP0 pid=735) WARNING 09-24 13:25:34 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.

EDIT: seems like we're dispatching before init, can you fix that so we don't warn unnecessarily?

It seems from the profile run. Previously, we defaulted to CUDGraphMode.NONE, so no dispatching would happen for that. After #25505, it became optional (default None). The simple fix is to still disable dispatching when is_profile is True.

But this is not related to the timeout. From the CI for gpt-oss, the running is very slow, not sure if this is triggered by cascade attention. Will temporary disable it and re-trigger CI.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@ProExpertProg
Copy link
Collaborator

@fhl2000 can you just do a quick sanity check for E2E performance and accuracy?

@fhl2000
Copy link
Contributor Author

fhl2000 commented Sep 25, 2025

@fhl2000 can you just do a quick sanity check for E2E performance and accuracy?

I can do it tomorrow.

@fhl2000
Copy link
Contributor Author

fhl2000 commented Sep 26, 2025

@ProExpertProg Looks no bad! Here is the script:

import os
import time
from vllm import LLM, SamplingParams
from tests.utils import wait_for_gpu_memory_to_clear

example_system_message = ""
with open("tests/system_messages/sonnet3.5_nov2024.txt") as f:
    example_system_message = f.read()

def test_cascade_attention(example_system_message, use_cascade, attn_backend="FLASH_ATTN" ):
    prompt = "\n<User>: Implement fibonacci sequence in Python.\n<Claude>:"

    os.environ.update({"VLLM_USE_V1": "1"})
    os.environ.update({"VLLM_ATTENTION_BACKEND": attn_backend})

    llm = LLM(model="/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4", disable_cascade_attn=not use_cascade)
    sampling_params = SamplingParams(temperature=0.0, max_tokens=100,top_p=1.0)

    # No cascade attention.
    single_prompt = [example_system_message + prompt]
    responses = llm.generate(single_prompt, sampling_params)
    ref_output = responses[0].outputs[0].text

    t1 = time.time()
    # (Probably) Use cascade attention.
    prompts = [example_system_message + prompt] * 64
    responses = llm.generate(prompts, sampling_params)
    t2 = time.time()
    print(f"Use cascade: {use_cascade} Time taken: {t2 - t1} seconds")
    with open(f"output_use_cascade_{use_cascade}.txt", "w") as f:
        for response in responses:
            f.write(response.outputs[0].text + "\n")
    del llm
    wait_for_gpu_memory_to_clear(
        devices=[0],
        threshold_ratio=0.1,
    )

if __name__ == "__main__":
    test_cascade_attention(example_system_message, use_cascade=False)
    test_cascade_attention(example_system_message, use_cascade=True)

Output:

Use cascade: False Time taken: 2.001439332962036 seconds
Use cascade: True Time taken: 1.7061328887939453 seconds

No regression observed from the generated tokens.

Edit: it's on the new default mode FULL_AND_PIECEWISE.

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
@ProExpertProg ProExpertProg merged commit f075693 into vllm-project:main Sep 26, 2025
41 checks passed
@fhl2000 fhl2000 mentioned this pull request Sep 27, 2025
5 tasks
@fhl2000 fhl2000 deleted the post_issues_20059 branch September 30, 2025 15:37
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
…-project#23046)

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
@fhl2000 fhl2000 changed the title [V1] address post issues related to #20059 (part 1) [V1] address post issues related to #20059 (part 1); cascade attention reenable by default Oct 3, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…-project#23046)

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…-project#23046)

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…-project#23046)

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…-project#23046)

Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants