-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[V1] address post issues related to #20059 (part 1); cascade attention reenable by default #23046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
… comments; default empty splitting_ops when enable_attn_fusion Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several improvements related to CUDA graphs and attention mechanisms. The changes include renaming a backend file for clarity, enhancing the handling of cascade attention with full CUDA graphs by providing warnings instead of disabling the feature, and updating logic for attention fusion. My review focuses on ensuring code quality and correctness. I've identified a minor but important naming issue that should be addressed.
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, will go look at #20059 to review tests in detail for more improvements to put here
| def has_piecewise_cudagraphs(self) -> bool: | ||
| return self.requires_piecewise_compilation() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two seem semantically different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but they are equivalent in actuality since we don't allow piecewise mode with empty splitting ops (translated to FULL in this case). So, having piecewise_cudagraph means requiring piecewise compilation, and requiring piecewise compilation implies having piecewise_cudagraph.
vllm/config/compilation.py
Outdated
| "PIECEWISE will be treated as FULL cudagraph_mode. " | ||
| "Please ensure you are using attention backends that " | ||
| "support cudagraph or set cudagraph_mode to NONE " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing, we should clarify that we disable piecewise not that piecewise is handled as full. Also I think we should do full_and_piecewise->full and piecewise->none, not piecewise->full.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have two places in this function doing the piecewise->full. The former case is for attn_ops fusion (splitting_ops=[]), so it must be FULL mode in this case. The latter is when users explicitly set splitting_ops=[]. I agree that this case is more reasonable to do full_and_piecewise->full and piecewise->none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for the attn fusion case let's just explicitly update cg mode there.
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor notes, feel free to address in follow up!
|
|
||
| compilation_config = vllm_config.compilation_config | ||
| if (compilation_config.cudagraph_mode != CUDAGraphMode.NONE | ||
| if (compilation_config.cudagraph_mode.has_piecewise_cudagraphs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a follow up but we should refactor this so that we use a new factory method on current_platform:
class PlatformInterface:
...
def create_static_graph_wrapper(runnable, vllm_config, runtime_mode: CUDAGraphMode, partition_idx:int, num_partitions: int) -> StaticGraphWrapper:
...
# alternatively:
def get_static_graph_wrapper_factory() -> StaticGraphWrapper.Factory
...
| "when cudagraph_mode piecewise cudagraphs is used, "\ | ||
| f"cudagraph_mode={self.compilation_config.cudagraph_mode}" | ||
|
|
||
| # final migrate the deprecated flags |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create an issue to remove the deprecated flags?
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
@fhl2000 not sure about the timeout but this seems not great, could you take a look? EDIT: seems like we're dispatching before init, can you fix that so we don't warn unnecessarily? |
It seems from the profile run. Previously, we defaulted to But this is not related to the timeout. From the CI for gpt-oss, the running is very slow, not sure if this is triggered by cascade attention. Will temporary disable it and re-trigger CI. |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
|
@fhl2000 can you just do a quick sanity check for E2E performance and accuracy? |
I can do it tomorrow. |
|
@ProExpertProg Looks no bad! Here is the script: import os
import time
from vllm import LLM, SamplingParams
from tests.utils import wait_for_gpu_memory_to_clear
example_system_message = ""
with open("tests/system_messages/sonnet3.5_nov2024.txt") as f:
example_system_message = f.read()
def test_cascade_attention(example_system_message, use_cascade, attn_backend="FLASH_ATTN" ):
prompt = "\n<User>: Implement fibonacci sequence in Python.\n<Claude>:"
os.environ.update({"VLLM_USE_V1": "1"})
os.environ.update({"VLLM_ATTENTION_BACKEND": attn_backend})
llm = LLM(model="/root/models/Qwen2.5-7B-Instruct-GPTQ-Int4", disable_cascade_attn=not use_cascade)
sampling_params = SamplingParams(temperature=0.0, max_tokens=100,top_p=1.0)
# No cascade attention.
single_prompt = [example_system_message + prompt]
responses = llm.generate(single_prompt, sampling_params)
ref_output = responses[0].outputs[0].text
t1 = time.time()
# (Probably) Use cascade attention.
prompts = [example_system_message + prompt] * 64
responses = llm.generate(prompts, sampling_params)
t2 = time.time()
print(f"Use cascade: {use_cascade} Time taken: {t2 - t1} seconds")
with open(f"output_use_cascade_{use_cascade}.txt", "w") as f:
for response in responses:
f.write(response.outputs[0].text + "\n")
del llm
wait_for_gpu_memory_to_clear(
devices=[0],
threshold_ratio=0.1,
)
if __name__ == "__main__":
test_cascade_attention(example_system_message, use_cascade=False)
test_cascade_attention(example_system_message, use_cascade=True)Output: No regression observed from the generated tokens. Edit: it's on the new default mode |
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com>
…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#23046) Signed-off-by: fhl2000 <63384265+fhl2000@users.noreply.github.com> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
This PR addressed several issues after #20059 was landed.
FULLorFULL_DECODE_ONLY, or dispatch to piecewise cudagraph when mode isFULL_AND_PIECEWISE. (updated: still disable it when DBO, since they are not compatible)splitting_opsto[]whenenable_attn_fusionis true in the pass_config. (Suggested by @elvischenv )CC list: @ProExpertProg @LucasWilkinson
More issues affecting spec-decode (part 2), please see #23679.
Test Plan
Simply test if the dispatcher can dispatch to NONE or PIECEWISE runtime mode for cascade attention.
No benchmark or correctness test is provided.
Test Result
It passed.
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.