Skip to content

Conversation

SamitHuang
Copy link
Contributor

@SamitHuang SamitHuang commented Aug 19, 2025

This PR adds the Qwen2.5-VL recipe, including Qwen2.5-VL-72B + TP, as planned in #10

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @SamitHuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive guide for setting up, running, and benchmarking the Qwen2.5-VL series models using vLLM. The guide focuses on leveraging native BF16 precision on NVIDIA GPUs for optimal inference accuracy and provides detailed instructions for both large (72B) and smaller (7B) models, including parallelization strategies and expected benchmark results. This addition aims to streamline the process for users looking to deploy and evaluate Qwen2.5-VL models.

Highlights

  • vLLM Installation Guide: Provides clear steps for installing vLLM using uv venv and uv pip, ensuring users can quickly set up the necessary environment.
  • Qwen2.5-VL Inference Setup: Details how to launch an online inference server for Qwen2.5-VL models with native BF16 precision, including guidance on tensor-parallel and data-parallel strategies for multi-GPU setups.
  • Memory Optimization for Inference: Offers practical tips for preserving GPU memory and maximizing KVCache utilization by using --max-model-len and --gpu-memory-utilization flags during server startup.
  • Detailed Benchmarking Instructions: Includes comprehensive commands and configurations for benchmarking Qwen2.5-VL models, covering various input/output lengths (prompt-heavy, decode-heavy, balanced) and different batch sizes.
  • Expected Performance Metrics: Presents detailed expected benchmark outputs for both Qwen2.5-VL-72B and Qwen2.5-VL-7B models, showcasing key performance metrics like request throughput, token throughput, and time-to-first-token.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Signed-off-by: SamitHuang <285365963@qq.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a usage guide for Qwen2.5-VL. The guide is well-structured and provides useful commands for running and benchmarking the model. I've provided a few suggestions to improve clarity and consistency, particularly in the benchmarking section which could be confusing for users. I've also pointed out some minor inconsistencies in model naming and formatting.

# Qwen2.5-VL Usage Guide

This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs.
Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model name Qwen2.5VL is used here, but the title of the document uses Qwen2.5-VL. For consistency, please use Qwen2.5-VL throughout the document. This also applies to other occurrences on lines 42, 68, and 95.

Suggested change
Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy.
Since BF16 is the commonly used precision type for Qwen2.5-VL training or fine-tuning, using BF16 in inference ensures the best accuracy.


## Running Qwen2.5-VL-72B with BF16 on 4xA100

There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better readability, heavy-loads should be written as two words: heavy loads.

Suggested change
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy loads.


To launch the online inference server using tensor-parallel:

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please add the bash language specifier to the code block for proper syntax highlighting.

Suggested change
```
```bash

Comment on lines 40 to 54
For benchmarking, disable prefix caching by adding `--no-enable-prefix-caching` to the server command.

### Qwen2.5VL-72B Benchmark

```bash
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 16 \
--ignore-eos
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current instructions for benchmarking are confusing. It's not clear how to apply --no-enable-prefix-caching and whether vllm bench serve starts its own server or connects to an existing one. To improve clarity, I suggest restructuring this section to show the server launch command first, followed by the client command.

For benchmarking, you first need to launch the server with prefix caching disabled.

### Qwen2.5-VL-72B Benchmark

**1. Launch the server**

In one terminal, launch the vLLM server with the `--no-enable-prefix-caching` flag:
```bash
# Start server with BF16 model on 4 GPUs for benchmarking
export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4 \
  --data-parallel-size 1 \
  --no-enable-prefix-caching

2. Run the benchmark client

Once the server is running, open another terminal and run the benchmark client:

vllm bench serve \
  --host 0.0.0.0 \
  --port 8000 \
  --model Qwen/Qwen2.5-VL-72B-Instruct \
  --dataset-name random \
  --random-input-len 8000 \
  --random-output-len 1000 \
  --num-prompts 16 \
  --ignore-eos

Signed-off-by: SamitHuang <285365963@qq.com>
@SamitHuang
Copy link
Contributor Author

@simon-mo @WoosukKwon Please help review when available, thanks!

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I left a few changes and please take a look!

Comment on lines 24 to 29
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--data-parallel-size 1 \
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention the flag --limit-mm-per-prompt here, and how to use it.

For example, if the user knows beforehand that the incoming traffic will have at most 2 images per request and no videos, they should do --limit-mm-per-prompt.image 2 --limit-mm-per-prompt.video 0

Comment on lines 24 to 29
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--data-parallel-size 1 \
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm-project/vllm#23190 and vllm-project/vllm#22742 introduced DP ViT for Qwen2.5VL - For the 72B model I think it makes more sense to deploy the ViT in the DP fashion, so please include ----mm-encoder-tp-mode data flag too.

* vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.


To run a smaller model, such as Qwen2.5-VL-7B, you can simply replace the model name `Qwen/Qwen2.5-VL-72B-Instruct` with `Qwen/Qwen2.5-VL-7B-Instruct`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention that for the 7B model it's better to run data parallel.

Comment on lines 42 to 56
### Qwen2.5VL-72B Benchmark

Once the server is running, open another terminal and run the benchmark client:

```bash
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 16 \
--ignore-eos
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to run a image or video benchmark against a multimodal model.

@SamitHuang
Copy link
Contributor Author

Thanks for the comments @ywang96
I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

@ywang96
Copy link
Member

ywang96 commented Aug 21, 2025

Thanks for the comments @ywang96 I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

Maybe you can give a shot at vllm-project/vllm#23119 and help us review too :)

@SamitHuang
Copy link
Contributor Author

Thanks for the comments @ywang96 I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

Maybe you can give a shot at vllm-project/vllm#23119 and help us review too :)

Sure, I left some comments there. Then I will test again on VisionArena-Chat for now, then switch to random-mm after vllm-project/vllm#23119 is merged

Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
@SamitHuang
Copy link
Contributor Author

Hi @ywang96 In the lastest commit, I have 1) adjusted the --limit-mm-per-prompt and ----mm-encoder-tp-mode arguments with explanation, 2) applied DP on VisionEncoder, 3) set DP=4 for the 7B model rather than TP=4 with explanation, and 4) added the benchmark results on VisionArena-Chat dataset. Please help review~ thanks

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments

SamitHuang and others added 5 commits August 22, 2025 18:12
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: SamitHuang <285365963@qq.com>
@SamitHuang
Copy link
Contributor Author

SamitHuang commented Aug 22, 2025

Some final comments

Resolved. Thanks for the detailed revision!!

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for the contribution!

@ywang96 ywang96 merged commit 22a4b41 into vllm-project:main Aug 22, 2025
2 checks passed
@SamitHuang
Copy link
Contributor Author

LGTM - thanks for the contribution!

oops. Didn't realize we need to run /gemini review every time for new commits. Will do next time~ Thanks

Copy link
Contributor

That's correct, @SamitHuang! If you'd like me to perform a fresh review of your pull request after new commits have been pushed, please use the /gemini review command. I'm here to help whenever you need a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants