Skip to content

Conversation

@sighingnow
Copy link
Collaborator

No description provided.

… sizes.

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
@mergify mergify bot added the qwen Related to Qwen models label Sep 11, 2025
@mergify mergify bot added the v1 label Sep 11, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request attempts to fix a CUDA graph capture condition for large batch sizes in the GDNAttentionBackend. However, the added condition m.num_actual_tokens <= self.decode_cudagraph_max_bs introduces a unit mismatch by comparing a token count with a sequence limit. This makes the check overly restrictive and prevents CUDA graph usage in many valid scenarios. My review provides a detailed explanation of the issue and suggests a more accurate approach to fix the underlying problem.

if (self.use_full_cuda_graph and num_prefills == 0 and num_decodes == 0
and num_spec_decodes <= self.decode_cudagraph_max_bs):
and num_spec_decodes <= self.decode_cudagraph_max_bs
and m.num_actual_tokens <= self.decode_cudagraph_max_bs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This condition m.num_actual_tokens <= self.decode_cudagraph_max_bs appears to have a unit mismatch. m.num_actual_tokens is the number of tokens, while self.decode_cudagraph_max_bs is used as a limit on the number of sequences for sizing tensors like spec_state_indices_tensor and spec_sequence_masks.

Comparing tokens to sequences is likely incorrect and makes this check overly restrictive. For instance, with num_spec=7 and decode_cudagraph_max_bs=32, this change limits num_spec_decodes to 4 (since 4 * 8 <= 32), whereas the original code allowed up to 32 sequences.

The underlying issue is that batch_size can exceed self.decode_cudagraph_max_bs due to token padding. The batch_size is calculated as self.vllm_config.pad_for_cudagraph(m.num_actual_tokens) // (self.num_spec + 1).

A more accurate check would be to compute this batch_size and compare it against self.decode_cudagraph_max_bs, while also ensuring m.num_actual_tokens does not exceed self.compilation_config.max_capture_size to prevent errors from pad_for_cudagraph.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keep m.num_actual_tokens <= self.decode_cudagraph_max_bs is good. Then and num_spec_decodes <= self.decode_cudagraph_max_bs seems unnecessary.

@pUnK2008
Copy link

around 20 lines below there is one more similar place

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me assuming @vadiklyutiy confirms the fix

longer term hopefully: #23789 / #24002 will resolve this more broadly but this makes sense as a temporary fix 👍

@LucasWilkinson
Copy link
Collaborator

LucasWilkinson commented Sep 12, 2025

around 20 lines below there is one more similar place

Are you referring to:

        if (self.use_full_cuda_graph and num_prefills == 0
                and num_spec_decodes == 0
                and num_decodes <= self.decode_cudagraph_max_bs):

I think thats safe due to num_spec_decodes == 0 so in that case num_decodes == num_tokens

@vadiklyutiy
Copy link
Collaborator

around 20 lines below there is one more similar place

Are you referring to:

        if (self.use_full_cuda_graph and num_prefills == 0
                and num_spec_decodes == 0
                and num_decodes <= self.decode_cudagraph_max_bs):

I think thats safe due to num_spec_decodes == 0 so in that case num_decodes == num_tokens

agree

@LucasWilkinson LucasWilkinson enabled auto-merge (squash) September 12, 2025 19:32
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 12, 2025
@LucasWilkinson LucasWilkinson merged commit 8226dd5 into vllm-project:main Sep 12, 2025
43 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
simon-mo pushed a commit that referenced this pull request Sep 13, 2025
… sizes (#24660) (#24667)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Sep 15, 2025
@sighingnow sighingnow deleted the fixes-large-bs-cg branch September 15, 2025 06:08
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
bbartels pushed a commit to bbartels/vllm that referenced this pull request Sep 15, 2025
… sizes (vllm-project#24660) (vllm-project#24667)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Sep 17, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Sep 17, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Sep 17, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Sep 17, 2025
shyeh25 pushed a commit to shyeh25/vllm that referenced this pull request Sep 23, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
… sizes (vllm-project#24660) (vllm-project#24667)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
… sizes (vllm-project#24660) (vllm-project#24667)

Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants