-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL #22742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces data parallelism for the Vision Transformer in the Qwen2.5VL model, which results in significant performance improvements as shown by the benchmarks. The implementation includes a load balancing mechanism to distribute images across GPUs. The code is well-structured, but I've identified a potential issue in the load balancing logic that could lead to imbalanced workloads under certain conditions, even though it doesn't affect the current usage in this PR. I've provided a detailed comment with a suggested fix for this.
# Assign minimum samples to each GPU | ||
# (round-robin with smallest samples first) | ||
small_to_large_indices = torch.argsort(sizes, descending=False) | ||
|
||
for gpu_id in range(num_gpus): | ||
samples_assigned = 0 | ||
for idx in small_to_large_indices: | ||
if idx.item( | ||
) not in used_indices and samples_assigned < min_samples_per_gpu: | ||
gpu_assignments[gpu_id].append(idx.item()) | ||
gpu_loads[gpu_id] += sizes[idx] | ||
used_indices.add(idx.item()) | ||
samples_assigned += 1 | ||
|
||
if samples_assigned >= min_samples_per_gpu: | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for Phase 1 of load balancing does not perform a round-robin assignment as the comment suggests. Instead, it assigns blocks of the smallest samples to each GPU sequentially. This can lead to significant load imbalance if min_samples_per_gpu > 1
.
For example, with sizes = [1, 2, 100, 101]
, num_gpus=2
, and min_samples_per_gpu=2
, GPU 0 would get samples of size 1 and 2 (total load 3), while GPU 1 would get samples of size 100 and 101 (total load 201).
While the current usage in this PR sets min_samples_per_gpu=0
(making this a non-issue for now), the function's default is 1
, and it's a latent bug for other potential uses.
Here is a suggested fix that implements a proper round-robin assignment for Phase 1.
# Assign minimum samples to each GPU
# (round-robin with smallest samples first)
if min_samples_per_gpu > 0:
small_to_large_indices = torch.argsort(sizes, descending=False)
unassigned_indices_iter = iter(idx.item() for idx in small_to_large_indices)
for _ in range(min_samples_per_gpu):
for gpu_id in range(num_gpus):
try:
# Find the next available sample
idx = next(unassigned_indices_iter)
gpu_assignments[gpu_id].append(idx)
gpu_loads[gpu_id] += sizes[idx]
used_indices.add(idx)
except StopIteration:
# Not enough samples to satisfy min_samples_per_gpu for all GPUs
break
else:
continue
break
shard_size = self.output_sizes[loaded_shard_id] | ||
|
||
param[shard_offset:shard_offset + shard_size] = loaded_weight | ||
param.data[shard_offset:shard_offset + shard_size] = loaded_weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am getting this error if it is not assigning to the param.data directly
File "/app/tritonmrope/dp-qwen2vl/vllm/model_executor/layers/linear.py", line 446, in weight_loader
param[shard_offset:shard_offset + shard_size] = loaded_weight
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao @mgoin any idea how this can happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like some params are created with required_grad=True
incorrectly, but since other Linear layer's weights_loader
all slice at param.data
, I think this change is fine for this PR:
vllm/vllm/model_executor/layers/linear.py
Lines 353 to 356 in 653124b
assert param.size() == loaded_weight.size(), ( | |
f"Tried to load weights of size {loaded_weight.size()}" | |
f"to a parameter of size {param.size()}") | |
param.data.copy_(loaded_weight) |
CC. @wuhuikx |
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Head branch was pushed to by a user without write access
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR enables DP for ViT (LLM will be in TP)
A load balancing logic has also been implemented.
Test Plan
Run lm_eval on ChartQA dataset
Evaluate the Performance Gain
Add unit test case to evaluate the function
get_load_balance_assignment
andrun_dp_sharded_mrope_vision_model
.Test Result
lm_eval ChartQA Dataset of model Qwen/Qwen2.5VL-72B-Instruct
Performance Gain:
Server command:
Client command:
Most of the improvement comes from DP-ing Conv3d.
(Optional) Documentation Update
Trace of Qwen2.5VL-72B-Instruct with 16 concurrency prompts
Before enabling DP (ViT in TP mode)
After enabling DP (ViT in DP mode)
