[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL #22742

tjtanaa · 2025-08-12T15:17:32Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

This PR enables DP for ViT (LLM will be in TP)

A load balancing logic has also been implemented.

Test Plan

Run lm_eval on ChartQA dataset

Evaluate the Performance Gain

Add unit test case to evaluate the function get_load_balance_assignment and run_dp_sharded_mrope_vision_model.

Test Result

lm_eval ChartQA Dataset of model Qwen/Qwen2.5VL-72B-Instruct

TP8 Baseline
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8864,
    "anywhere_in_answer_relaxed_correctness": 0.8908
}
================================================================================

DP8+TP8 This PR
================================================================================
Metrics:
{
    "explicit_prompt_relaxed_correctness": 0.8844,
    "anywhere_in_answer_relaxed_correctness": 0.8884
}
================================================================================

Performance Gain:

Server command:

HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
MIOPEN_USER_DB_PATH=/app/vl/miopen \
MIOPEN_FIND_MODE=FAST \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
SAFETENSORS_FAST_GPU=1 \
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--tensor_parallel_size=8 \
--trust_remote_code \
--port 7899 \
--enable_multimodal_encoder_data_parallel

Client command:

python3 benchmarks/benchmark_serving.py  \
--backend openai-chat   \
--model Qwen/Qwen2.5-VL-72B-Instruct   \
--endpoint /v1/chat/completions   \
--dataset-name hf   \
--dataset-path lmarena-ai/VisionArena-Chat   \
--hf-split train   \
--num-prompts 1000 \
--max-concurrency 64 \
--port 7899

Metric	ViT TP	ViT DP	DP vs TP Improvement
Throughput
Request throughput (req/s)	1.79	2.63	+47%
Output token throughput (tok/s)	206.66	302.89	+47%
Total token throughput (tok/s)	375.53	551.31	+47%
Latency
Benchmark duration (s)	558.57	379.70	-32% (faster)
Mean TTFT (ms)	4,584.50	2,062.35	-55% (faster)
Median TTFT (ms)	2,280.17	1,513.51	-34% (faster)
P99 TTFT (ms)	25,097.62	10,313.05	-59% (faster)
Mean TPOT (ms)	285.83	198.03	-31% (faster)
Median TPOT (ms)	283.86	203.93	-28% (faster)
P99 TPOT (ms)	515.32	274.71	-47% (faster)
Mean ITL (ms)	337.49	249.15	-26% (faster)
Median ITL (ms)	33.11	131.26	+296% (slower)
P99 ITL (ms)	3,699.28	1,291.46	-65% (faster)

Most of the improvement comes from DP-ing Conv3d.

(Optional) Documentation Update

Trace of Qwen2.5VL-72B-Instruct with 16 concurrency prompts

Before enabling DP (ViT in TP mode)

After enabling DP (ViT in DP mode)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

github-actions · 2025-08-12T15:17:39Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-08-12T15:18:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tjtanaa.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces data parallelism for the Vision Transformer in the Qwen2.5VL model, which results in significant performance improvements as shown by the benchmarks. The implementation includes a load balancing mechanism to distribute images across GPUs. The code is well-structured, but I've identified a potential issue in the load balancing logic that could lead to imbalanced workloads under certain conditions, even though it doesn't affect the current usage in this PR. I've provided a detailed comment with a suggested fix for this.

gemini-code-assist · 2025-08-12T15:21:47Z

vllm/model_executor/models/qwen2_5_vl.py

+    # Assign minimum samples to each GPU
+    # (round-robin with smallest samples first)
+    small_to_large_indices = torch.argsort(sizes, descending=False)
+
+    for gpu_id in range(num_gpus):
+        samples_assigned = 0
+        for idx in small_to_large_indices:
+            if idx.item(
+            ) not in used_indices and samples_assigned < min_samples_per_gpu:
+                gpu_assignments[gpu_id].append(idx.item())
+                gpu_loads[gpu_id] += sizes[idx]
+                used_indices.add(idx.item())
+                samples_assigned += 1
+
+                if samples_assigned >= min_samples_per_gpu:
+                    break
+


The current implementation for Phase 1 of load balancing does not perform a round-robin assignment as the comment suggests. Instead, it assigns blocks of the smallest samples to each GPU sequentially. This can lead to significant load imbalance if min_samples_per_gpu > 1.

For example, with sizes = [1, 2, 100, 101], num_gpus=2, and min_samples_per_gpu=2, GPU 0 would get samples of size 1 and 2 (total load 3), while GPU 1 would get samples of size 100 and 101 (total load 201).

While the current usage in this PR sets min_samples_per_gpu=0 (making this a non-issue for now), the function's default is 1, and it's a latent bug for other potential uses.

Here is a suggested fix that implements a proper round-robin assignment for Phase 1.

# Assign minimum samples to each GPU # (round-robin with smallest samples first) if min_samples_per_gpu > 0: small_to_large_indices = torch.argsort(sizes, descending=False) unassigned_indices_iter = iter(idx.item() for idx in small_to_large_indices) for _ in range(min_samples_per_gpu): for gpu_id in range(num_gpus): try: # Find the next available sample idx = next(unassigned_indices_iter) gpu_assignments[gpu_id].append(idx) gpu_loads[gpu_id] += sizes[idx] used_indices.add(idx) except StopIteration: # Not enough samples to satisfy min_samples_per_gpu for all GPUs break else: continue break

DarkLight1337 · 2025-08-13T00:50:42Z

vllm/model_executor/layers/linear.py

            shard_size = self.output_sizes[loaded_shard_id]

-        param[shard_offset:shard_offset + shard_size] = loaded_weight
+        param.data[shard_offset:shard_offset + shard_size] = loaded_weight


Is this change necessary?

I am getting this error if it is not assigning to the param.data directly

File "/app/tritonmrope/dp-qwen2vl/vllm/model_executor/layers/linear.py", line 446, in weight_loader param[shard_offset:shard_offset + shard_size] = loaded_weight RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.

@youkaichao @mgoin any idea how this can happen?

Looks like some params are created with required_grad=True incorrectly, but since other Linear layer's weights_loader all slice at param.data, I think this change is fine for this PR:

vllm/vllm/model_executor/layers/linear.py

Lines 353 to 356 in 653124b

assert param.size() == loaded_weight.size(), (

f"Tried to load weights of size {loaded_weight.size()}"

f"to a parameter of size {param.size()}")

param.data.copy_(loaded_weight)

vllm/model_executor/models/qwen2_5_vl.py

tjtanaa · 2025-08-13T02:40:46Z

CC. @wuhuikx

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

vllm/model_executor/models/qwen2_5_vl.py

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

vllm/model_executor/models/qwen2_5_vl.py

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Enable DP for ViT in Qwen2.5VL

bf98652

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

mergify bot added the qwen Related to Qwen models label Aug 12, 2025

mergify bot added the needs-rebase label Aug 12, 2025

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

tjtanaa mentioned this pull request Aug 12, 2025

[Feature]: Generalized the DP feature for ViT and multimodal backbone for the benefit of all models #22743

Closed

1 task

DarkLight1337 reviewed Aug 13, 2025

View reviewed changes