[perf] Use CPU tensor to reduce GPU->CPU sync #25884

lhtin · 2025-09-29T12:37:59Z

Use seq_lens_cpu instead of seq_lens to reduce GPU->CPU sync.

Purpose

Reduce unnecessary GPU->CPU sync, since it will affect the perf of Async Scheduling+MTP.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Lehua Ding <lehuading@tencent.com>

gemini-code-assist

Code Review

This pull request introduces a performance optimization to reduce GPU-to-CPU synchronization during speculative decoding. The change replaces a call to .max() on a GPU tensor (seq_lens) with its CPU counterpart (seq_lens_cpu) within a conditional check. This avoids a blocking operation, which is particularly beneficial for asynchronous scheduling. The change is correct and aligns with the stated goal of improving performance. I have no further comments.

lhtin · 2025-09-29T12:41:38Z

Which introduce by this pr(#24662), @AlonKejzman can you help review this too? thanks.

benchislett

LGTM, thanks!

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: Lehua Ding <lehuading@tencent.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com>

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

[perf] Use CPU tensor to reduce GPU->CPU sync

9a6d1fc

Signed-off-by: Lehua Ding <lehuading@tencent.com>

lhtin requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 29, 2025 12:38

mergify bot added the v1 label Sep 29, 2025

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

benchislett approved these changes Sep 29, 2025

View reviewed changes

benchislett reviewed Sep 29, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 29, 2025

[perf] Use .max_seq_len replace .seq_lens_cpu.max()

0617ab6

Signed-off-by: Lehua Ding <lehuading@tencent.com>

DarkLight1337 merged commit e184c9c into vllm-project:main Sep 30, 2025
42 checks passed

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

1116b82

Signed-off-by: Lehua Ding <lehuading@tencent.com>

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)

eea2536

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

fd3f60f

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

b4546cc

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

bf48744

Signed-off-by: Lehua Ding <lehuading@tencent.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

60dc418

Signed-off-by: Lehua Ding <lehuading@tencent.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[perf] Use CPU tensor to reduce GPU->CPU sync (vllm-project#25884)

6694e88

Signed-off-by: Lehua Ding <lehuading@tencent.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

lhtin commented Sep 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

benchislett left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Uh oh!

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

[perf] Use CPU tensor to reduce GPU->CPU sync #25884

Conversation

lhtin commented Sep 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

lhtin commented Sep 29, 2025

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhtin commented Sep 29, 2025 •

edited by github-actions bot

Loading