Skip to content

Conversation

@lgeiger
Copy link
Contributor

@lgeiger lgeiger commented Oct 9, 2025

Purpose

The current code does 5 cudaStreamSynchronize( 3 syncs due to CPU-GPU weight copy and 2 syncs inside torch.repeat_interleave) which isn't idea.
This PR removes the blocking weight copy and moves the cu_seqlens to the CPU which is faster and doesn't require any CPU-GPU syncs.

Before:

Screenshot 2025-10-09 at 15 36 12

After:

Screenshot 2025-10-09 at 15 35 59

Test Plan

On a single H100

vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 --limit-mm-per-prompt.video 0 --no-enable-prefix-caching
vllm bench serve --backend openai-chat --model Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Test Result

Before:

============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  31.49
Total input tokens:                      93876
Total generated tokens:                  118504
Request throughput (req/s):              31.25
Output token throughput (tok/s):         3763.75
Peak output token throughput (tok/s):    10155.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          6745.30
---------------Time to First Token----------------
Mean TTFT (ms):                          12682.91
Median TTFT (ms):                        11258.48
P99 TTFT (ms):                           26015.58
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          111.32
Median TPOT (ms):                        121.03
P99 TPOT (ms):                           257.56
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.87
Median ITL (ms):                         63.28
P99 ITL (ms):                            447.10
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     984
Benchmark duration (s):                  31.47
Total input tokens:                      93720
Total generated tokens:                  118619
Request throughput (req/s):              31.27
Output token throughput (tok/s):         3769.80
Peak output token throughput (tok/s):    10951.00
Peak concurrent requests:                984.00
Total Token throughput (tok/s):          6748.29
---------------Time to First Token----------------
Mean TTFT (ms):                          12383.31
Median TTFT (ms):                        11419.97
P99 TTFT (ms):                           26081.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          112.25
Median TPOT (ms):                        121.49
P99 TPOT (ms):                           248.47
---------------Inter-token Latency----------------
Mean ITL (ms):                           107.45
Median ITL (ms):                         61.96
P99 ITL (ms):                            486.04
==================================================

This reduces Mean TTFT by 1 - 2 %.

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@lgeiger lgeiger requested a review from sighingnow as a code owner October 9, 2025 14:51
@mergify mergify bot added the qwen Related to Qwen models label Oct 9, 2025
@lgeiger lgeiger force-pushed the qwen-non-blocking branch from f0bd894 to 2f5ab2a Compare October 9, 2025 15:24
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve performance by making tensor copies non-blocking and moving cu_seqlens computation to the CPU. The changes are generally well-implemented and align with the goal of reducing synchronization overhead. However, there is a critical issue where a tensor is moved to the GPU without reassigning the result to the variable, which will lead to either a runtime error or negate the intended performance benefit. I've left a specific comment with a suggested fix.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) October 9, 2025 16:03
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025
@vllm-bot vllm-bot merged commit b2155ed into vllm-project:main Oct 10, 2025
53 of 55 checks passed
@lgeiger lgeiger deleted the qwen-non-blocking branch October 10, 2025 16:48
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…26496)

Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants