Add Qwen2.5VL Guide #30

SamitHuang · 2025-08-19T06:45:29Z

This PR adds the Qwen2.5-VL recipe, including Qwen2.5-VL-72B + TP, as planned in #10

gemini-code-assist

Summary of Changes

Hello @SamitHuang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive guide for setting up, running, and benchmarking the Qwen2.5-VL series models using vLLM. The guide focuses on leveraging native BF16 precision on NVIDIA GPUs for optimal inference accuracy and provides detailed instructions for both large (72B) and smaller (7B) models, including parallelization strategies and expected benchmark results. This addition aims to streamline the process for users looking to deploy and evaluate Qwen2.5-VL models.

Highlights

vLLM Installation Guide: Provides clear steps for installing vLLM using uv venv and uv pip, ensuring users can quickly set up the necessary environment.
Qwen2.5-VL Inference Setup: Details how to launch an online inference server for Qwen2.5-VL models with native BF16 precision, including guidance on tensor-parallel and data-parallel strategies for multi-GPU setups.
Memory Optimization for Inference: Offers practical tips for preserving GPU memory and maximizing KVCache utilization by using --max-model-len and --gpu-memory-utilization flags during server startup.
Detailed Benchmarking Instructions: Includes comprehensive commands and configurations for benchmarking Qwen2.5-VL models, covering various input/output lengths (prompt-heavy, decode-heavy, balanced) and different batch sizes.
Expected Performance Metrics: Presents detailed expected benchmark outputs for both Qwen2.5-VL-72B and Qwen2.5-VL-7B models, showcasing key performance metrics like request throughput, token throughput, and time-to-first-token.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Signed-off-by: SamitHuang <285365963@qq.com>

gemini-code-assist

Code Review

This pull request adds a usage guide for Qwen2.5-VL. The guide is well-structured and provides useful commands for running and benchmarking the model. I've provided a few suggestions to improve clarity and consistency, particularly in the benchmarking section which could be confusing for users. I've also pointed out some minor inconsistencies in model naming and formatting.

gemini-code-assist · 2025-08-19T06:48:02Z

Qwen/Qwen2.5-VL.md

+# Qwen2.5-VL Usage Guide
+
+This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs. 
+Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy.


The model name Qwen2.5VL is used here, but the title of the document uses Qwen2.5-VL. For consistency, please use Qwen2.5-VL throughout the document. This also applies to other occurrences on lines 42, 68, and 95.

Suggested change

Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy.

Since BF16 is the commonly used precision type for Qwen2.5-VL training or fine-tuning, using BF16 in inference ensures the best accuracy.

gemini-code-assist · 2025-08-19T06:48:02Z

Qwen/Qwen2.5-VL.md

+
+## Running Qwen2.5-VL-72B with BF16 on 4xA100
+
+There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.


For better readability, heavy-loads should be written as two words: heavy loads.

Suggested change

There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.

There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy loads.

gemini-code-assist · 2025-08-19T06:48:02Z

Qwen/Qwen2.5-VL.md

+
+To launch the online inference server using tensor-parallel:
+
+```


Please add the bash language specifier to the code block for proper syntax highlighting.

Suggested change

```

```bash

gemini-code-assist · 2025-08-19T06:48:02Z

Qwen/Qwen2.5-VL.md

+For benchmarking, disable prefix caching by adding `--no-enable-prefix-caching` to the server command.
+
+### Qwen2.5VL-72B Benchmark
+
+```bash
+vllm bench serve \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --model Qwen/Qwen2.5-VL-72B-Instruct \
+  --dataset-name random \
+  --random-input-len 8000 \
+  --random-output-len 1000 \
+  --num-prompts 16 \
+  --ignore-eos
+```


The current instructions for benchmarking are confusing. It's not clear how to apply --no-enable-prefix-caching and whether vllm bench serve starts its own server or connects to an existing one. To improve clarity, I suggest restructuring this section to show the server launch command first, followed by the client command.

For benchmarking, you first need to launch the server with prefix caching disabled. ### Qwen2.5-VL-72B Benchmark **1. Launch the server** In one terminal, launch the vLLM server with the `--no-enable-prefix-caching` flag: ```bash # Start server with BF16 model on 4 GPUs for benchmarking export CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve Qwen/Qwen2.5-VL-72B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --tensor-parallel-size 4 \ --data-parallel-size 1 \ --no-enable-prefix-caching

2. Run the benchmark client

Once the server is running, open another terminal and run the benchmark client:

vllm bench serve \ --host 0.0.0.0 \ --port 8000 \ --model Qwen/Qwen2.5-VL-72B-Instruct \ --dataset-name random \ --random-input-len 8000 \ --random-output-len 1000 \ --num-prompts 16 \ --ignore-eos

Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang · 2025-08-20T06:38:51Z

@simon-mo @WoosukKwon Please help review when available, thanks!

ywang96

Thanks for your contribution! I left a few changes and please take a look!

ywang96 · 2025-08-21T06:44:02Z

Qwen/Qwen2.5-VL.md

+vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --tensor-parallel-size 4 \
+  --data-parallel-size 1 \
+```


We should mention the flag --limit-mm-per-prompt here, and how to use it.

For example, if the user knows beforehand that the incoming traffic will have at most 2 images per request and no videos, they should do --limit-mm-per-prompt.image 2 --limit-mm-per-prompt.video 0

ywang96 · 2025-08-21T06:46:23Z

Qwen/Qwen2.5-VL.md

+vllm serve Qwen/Qwen2.5-VL-72B-Instruct  \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --tensor-parallel-size 4 \
+  --data-parallel-size 1 \
+```


vllm-project/vllm#23190 and vllm-project/vllm#22742 introduced DP ViT for Qwen2.5VL - For the 72B model I think it makes more sense to deploy the ViT in the DP fashion, so please include ----mm-encoder-tp-mode data flag too.

ywang96 · 2025-08-21T06:46:47Z

Qwen/Qwen2.5-VL.md

+* vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.
+
+
+To run a smaller model, such as Qwen2.5-VL-7B, you can simply replace the model name `Qwen/Qwen2.5-VL-72B-Instruct` with `Qwen/Qwen2.5-VL-7B-Instruct`. 


We should mention that for the 7B model it's better to run data parallel.

ywang96 · 2025-08-21T06:47:37Z

Qwen/Qwen2.5-VL.md

+### Qwen2.5VL-72B Benchmark
+
+Once the server is running, open another terminal and run the benchmark client:
+
+```bash
+vllm bench serve \
+  --host 0.0.0.0 \
+  --port 8000 \
+  --model Qwen/Qwen2.5-VL-72B-Instruct \
+  --dataset-name random \
+  --random-input-len 8000 \
+  --random-output-len 1000 \
+  --num-prompts 16 \
+  --ignore-eos
+```


I think it makes more sense to run a image or video benchmark against a multimodal model.

SamitHuang · 2025-08-21T07:47:03Z

Thanks for the comments @ywang96
I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

ywang96 · 2025-08-21T07:51:11Z

Thanks for the comments @ywang96 I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

Maybe you can give a shot at vllm-project/vllm#23119 and help us review too :)

SamitHuang · 2025-08-21T09:01:52Z

Thanks for the comments @ywang96 I will update the PR accordingly soon. I thought about running the benchmark on VisionArena-Chat, but the dataset size is ~80GB, which can be too large for a quick-start guideline. Do you think we should use another dataset?

Maybe you can give a shot at vllm-project/vllm#23119 and help us review too :)

Sure, I left some comments there. Then I will test again on VisionArena-Chat for now, then switch to random-mm after vllm-project/vllm#23119 is merged

Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang · 2025-08-22T09:08:36Z

Hi @ywang96 In the lastest commit, I have 1) adjusted the --limit-mm-per-prompt and ----mm-encoder-tp-mode arguments with explanation, 2) applied DP on VisionEncoder, 3) set DP=4 for the 7B model rather than TP=4 with explanation, and 4) added the benchmark results on VisionArena-Chat dataset. Please help review~ thanks

ywang96

Some final comments

Qwen/Qwen2.5-VL.md

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang · 2025-08-22T10:18:25Z

Some final comments

Resolved. Thanks for the detailed revision!!

ywang96

LGTM - thanks for the contribution!

SamitHuang · 2025-08-22T11:25:40Z

LGTM - thanks for the contribution!

oops. Didn't realize we need to run /gemini review every time for new commits. Will do next time~ Thanks

gemini-code-assist · 2025-08-22T11:25:50Z

That's correct, @SamitHuang! If you'd like me to perform a fresh review of your pull request after new commits have been pushed, please use the /gemini review command. I'm here to help whenever you need a review.

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

Add Qwen2.5VL Guide

335e3bf

Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang force-pushed the qwen2.5vl branch from 8fd0466 to 335e3bf Compare August 19, 2025 06:46

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

Update Qwen2.5VL Guide

7b166de

Signed-off-by: SamitHuang <285365963@qq.com>

Gaohan123 mentioned this pull request Aug 19, 2025

[RFC]: Add Multimodal Model Recipes (Qwen2.5-VL, Qwen2.5-Omni, InternVL, etc) #10

Open

10 tasks

ywang96 requested changes Aug 21, 2025

View reviewed changes

SamitHuang force-pushed the qwen2.5vl branch from 6f60a87 to 69cc2c1 Compare August 21, 2025 15:04

Update args and add ArenaVision benchmark

a2f3de8

Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang force-pushed the qwen2.5vl branch from b7c7b39 to a2f3de8 Compare August 21, 2025 15:09

SamitHuang requested a review from ywang96 August 21, 2025 15:09

SamitHuang added 2 commits August 21, 2025 23:11

Update Qwen2.5-VL.md

6e9e01c

Signed-off-by: SamitHuang <285365963@qq.com>

Update Qwen2.5-VL.md

1ad8ba4

Signed-off-by: SamitHuang <285365963@qq.com>

ywang96 reviewed Aug 22, 2025

View reviewed changes

Qwen/Qwen2.5-VL.md Outdated Show resolved Hide resolved

Qwen/Qwen2.5-VL.md Outdated Show resolved Hide resolved

Qwen/Qwen2.5-VL.md Outdated Show resolved Hide resolved

Qwen/Qwen2.5-VL.md Outdated Show resolved Hide resolved

Qwen/Qwen2.5-VL.md Outdated Show resolved Hide resolved

SamitHuang and others added 5 commits August 22, 2025 18:12

Update Qwen/Qwen2.5-VL.md

cbc9fa1

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

Update Qwen/Qwen2.5-VL.md

5b3162f

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

Update Qwen/Qwen2.5-VL.md

6ac2a50

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

Update Qwen/Qwen2.5-VL.md

84a6b27

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

Update Qwen/Qwen2.5-VL.md

4cd5b31

Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>

SamitHuang force-pushed the qwen2.5vl branch from c5a7dbe to 4cd5b31 Compare August 22, 2025 10:13

SamitHuang requested a review from ywang96 August 22, 2025 10:16

ywang96 approved these changes Aug 22, 2025

View reviewed changes

ywang96 merged commit 22a4b41 into vllm-project:main Aug 22, 2025
2 checks passed

	Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy.
	Since BF16 is the commonly used precision type for Qwen2.5-VL training or fine-tuning, using BF16 in inference ensures the best accuracy.


		## Running Qwen2.5-VL-72B with BF16 on 4xA100

		There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads.


		To launch the online inference server using tensor-parallel:

		```

		* vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache.


		To run a smaller model, such as Qwen2.5-VL-7B, you can simply replace the model name `Qwen/Qwen2.5-VL-72B-Instruct` with `Qwen/Qwen2.5-VL-7B-Instruct`.

Add Qwen2.5VL Guide #30

Add Qwen2.5VL Guide #30

Conversation

SamitHuang commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Aug 20, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

SamitHuang commented Aug 21, 2025

Uh oh!

ywang96 commented Aug 21, 2025

Uh oh!

SamitHuang commented Aug 21, 2025

Uh oh!

SamitHuang commented Aug 22, 2025

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SamitHuang commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SamitHuang commented Aug 22, 2025

Uh oh!

gemini-code-assist bot commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SamitHuang commented Aug 19, 2025 •

edited

Loading

SamitHuang commented Aug 22, 2025 •

edited

Loading