[Model][Qwen3VL] Compute `cu_seqlens` on CPU to remove #26496

lgeiger · 2025-10-09T14:50:34Z

Purpose

The current code does 5 cudaStreamSynchronize( 3 syncs due to CPU-GPU weight copy and 2 syncs inside torch.repeat_interleave) which isn't idea.
This PR removes the blocking weight copy and moves the cu_seqlens to the CPU which is faster and doesn't require any CPU-GPU syncs.

Before:

After:

Test Plan

On a single H100

vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 --limit-mm-per-prompt.video 0 --no-enable-prefix-caching

vllm bench serve --backend openai-chat --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Test Result

Before:

============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  31.49
Total input tokens:                      93876
Total generated tokens:                  118504
Request throughput (req/s):              31.25
Output token throughput (tok/s):         3763.75
Peak output token throughput (tok/s):    10155.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          6745.30
---------------Time to First Token----------------
Mean TTFT (ms):                          12682.91
Median TTFT (ms):                        11258.48
P99 TTFT (ms):                           26015.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.32
Median TPOT (ms):                        121.03
P99 TPOT (ms):                           257.56
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.87
Median ITL (ms):                         63.28
P99 ITL (ms):                            447.10
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  31.47
Total input tokens:                      93720
Total generated tokens:                  118619
Request throughput (req/s):              31.27
Output token throughput (tok/s):         3769.80
Peak output token throughput (tok/s):    10951.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          6748.29
---------------Time to First Token----------------
Mean TTFT (ms):                          12383.31
Median TTFT (ms):                        11419.97
P99 TTFT (ms):                           26081.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          112.25
Median TPOT (ms):                        121.49
P99 TPOT (ms):                           248.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.45
Median ITL (ms):                         61.96
P99 ITL (ms):                            486.04
==================================================

This reduces Mean TTFT by 1 - 2 %.

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

gemini-code-assist

Code Review

This pull request aims to improve performance by making tensor copies non-blocking and moving cu_seqlens computation to the CPU. The changes are generally well-implemented and align with the goal of reducing synchronization overhead. However, there is a critical issue where a tensor is moved to the GPU without reassigning the result to the variable, which will lead to either a runtime error or negate the intended performance benefit. I've left a specific comment with a suggested fix.

vllm/model_executor/models/qwen3_vl.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: bbartels <benjamin@bartels.dev>

…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

…26496) Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

lgeiger added 2 commits October 9, 2025 14:39

[Models][Qwen3VL] Use non-blocking CPU to GPU copy

ec3eb68

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

[Model][Qwen3VL] Compute cu_seqlens on CPU

2f5ab2a

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

lgeiger requested a review from sighingnow as a code owner October 9, 2025 14:51

lgeiger mentioned this pull request Oct 9, 2025

[Models][Qwen3VL] Use non-blocking CPU to GPU copy #26489

Closed

mergify bot added the qwen Related to Qwen models label Oct 9, 2025

lgeiger force-pushed the qwen-non-blocking branch from f0bd894 to 2f5ab2a Compare October 9, 2025 15:24

gemini-code-assist bot reviewed Oct 9, 2025

View reviewed changes

vllm/model_executor/models/qwen3_vl.py Outdated Show resolved Hide resolved

Update vllm/model_executor/models/qwen3_vl.py

54712cc

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>

DarkLight1337 approved these changes Oct 9, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) October 9, 2025 16:03

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

vllm-bot merged commit b2155ed into vllm-project:main Oct 10, 2025
53 of 55 checks passed

lgeiger deleted the qwen-non-blocking branch October 10, 2025 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model][Qwen3VL] Compute `cu_seqlens` on CPU to remove #26496

[Model][Qwen3VL] Compute `cu_seqlens` on CPU to remove #26496

Uh oh!

lgeiger commented Oct 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Model][Qwen3VL] Compute cu_seqlens on CPU to remove #26496

[Model][Qwen3VL] Compute cu_seqlens on CPU to remove #26496

Uh oh!

Conversation

lgeiger commented Oct 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Model][Qwen3VL] Compute `cu_seqlens` on CPU to remove #26496

[Model][Qwen3VL] Compute `cu_seqlens` on CPU to remove #26496

lgeiger commented Oct 9, 2025 •

edited by github-actions bot

Loading