-
Notifications
You must be signed in to change notification settings - Fork 57
Add Qwen2.5VL Guide #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @SamitHuang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a comprehensive guide for setting up, running, and benchmarking the Qwen2.5-VL series models using vLLM. The guide focuses on leveraging native BF16 precision on NVIDIA GPUs for optimal inference accuracy and provides detailed instructions for both large (72B) and smaller (7B) models, including parallelization strategies and expected benchmark results. This addition aims to streamline the process for users looking to deploy and evaluate Qwen2.5-VL models.
Highlights
- vLLM Installation Guide: Provides clear steps for installing vLLM using
uv venv
anduv pip
, ensuring users can quickly set up the necessary environment. - Qwen2.5-VL Inference Setup: Details how to launch an online inference server for Qwen2.5-VL models with native BF16 precision, including guidance on tensor-parallel and data-parallel strategies for multi-GPU setups.
- Memory Optimization for Inference: Offers practical tips for preserving GPU memory and maximizing KVCache utilization by using
--max-model-len
and--gpu-memory-utilization
flags during server startup. - Detailed Benchmarking Instructions: Includes comprehensive commands and configurations for benchmarking Qwen2.5-VL models, covering various input/output lengths (prompt-heavy, decode-heavy, balanced) and different batch sizes.
- Expected Performance Metrics: Presents detailed expected benchmark outputs for both Qwen2.5-VL-72B and Qwen2.5-VL-7B models, showcasing key performance metrics like request throughput, token throughput, and time-to-first-token.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
Signed-off-by: SamitHuang <285365963@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a usage guide for Qwen2.5-VL. The guide is well-structured and provides useful commands for running and benchmarking the model. I've provided a few suggestions to improve clarity and consistency, particularly in the benchmarking section which could be confusing for users. I've also pointed out some minor inconsistencies in model naming and formatting.
Qwen/Qwen2.5-VL.md
Outdated
# Qwen2.5-VL Usage Guide | ||
|
||
This guide describes how to run Qwen2.5-VL series with native BF16 on NVIDIA GPUs. | ||
Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model name Qwen2.5VL
is used here, but the title of the document uses Qwen2.5-VL
. For consistency, please use Qwen2.5-VL
throughout the document. This also applies to other occurrences on lines 42, 68, and 95.
Since BF16 is the commonly used precision type for Qwen2.5VL training or fine-tuning, using BF16 in inference ensures the best accuracy. | |
Since BF16 is the commonly used precision type for Qwen2.5-VL training or fine-tuning, using BF16 in inference ensures the best accuracy. |
Qwen/Qwen2.5-VL.md
Outdated
|
||
## Running Qwen2.5-VL-72B with BF16 on 4xA100 | ||
|
||
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better readability, heavy-loads
should be written as two words: heavy loads
.
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy-loads. | |
There are two ways to parallelize the model over multiple GPUs: (1) Tensor-parallel or (2) Data-parallel. Each one has its own advantages, where tensor-parallel is usually more beneficial for low-latency / low-load scenarios and data-parallel works better for cases where there is a lot of data with heavy loads. |
Qwen/Qwen2.5-VL.md
Outdated
|
||
To launch the online inference server using tensor-parallel: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Qwen/Qwen2.5-VL.md
Outdated
For benchmarking, disable prefix caching by adding `--no-enable-prefix-caching` to the server command. | ||
|
||
### Qwen2.5VL-72B Benchmark | ||
|
||
```bash | ||
vllm bench serve \ | ||
--host 0.0.0.0 \ | ||
--port 8000 \ | ||
--model Qwen/Qwen2.5-VL-72B-Instruct \ | ||
--dataset-name random \ | ||
--random-input-len 8000 \ | ||
--random-output-len 1000 \ | ||
--num-prompts 16 \ | ||
--ignore-eos | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current instructions for benchmarking are confusing. It's not clear how to apply --no-enable-prefix-caching
and whether vllm bench serve
starts its own server or connects to an existing one. To improve clarity, I suggest restructuring this section to show the server launch command first, followed by the client command.
For benchmarking, you first need to launch the server with prefix caching disabled.
### Qwen2.5-VL-72B Benchmark
**1. Launch the server**
In one terminal, launch the vLLM server with the `--no-enable-prefix-caching` flag:
```bash
# Start server with BF16 model on 4 GPUs for benchmarking
export CUDA_VISIBLE_DEVICES=0,1,2,3
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--data-parallel-size 1 \
--no-enable-prefix-caching
2. Run the benchmark client
Once the server is running, open another terminal and run the benchmark client:
vllm bench serve \
--host 0.0.0.0 \
--port 8000 \
--model Qwen/Qwen2.5-VL-72B-Instruct \
--dataset-name random \
--random-input-len 8000 \
--random-output-len 1000 \
--num-prompts 16 \
--ignore-eos
Signed-off-by: SamitHuang <285365963@qq.com>
@simon-mo @WoosukKwon Please help review when available, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I left a few changes and please take a look!
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \ | ||
--host 0.0.0.0 \ | ||
--port 8000 \ | ||
--tensor-parallel-size 4 \ | ||
--data-parallel-size 1 \ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention the flag --limit-mm-per-prompt
here, and how to use it.
For example, if the user knows beforehand that the incoming traffic will have at most 2 images per request and no videos, they should do --limit-mm-per-prompt.image 2 --limit-mm-per-prompt.video 0
vllm serve Qwen/Qwen2.5-VL-72B-Instruct \ | ||
--host 0.0.0.0 \ | ||
--port 8000 \ | ||
--tensor-parallel-size 4 \ | ||
--data-parallel-size 1 \ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm-project/vllm#23190 and vllm-project/vllm#22742 introduced DP ViT for Qwen2.5VL - For the 72B model I think it makes more sense to deploy the ViT in the DP fashion, so please include ----mm-encoder-tp-mode data
flag too.
Qwen/Qwen2.5-VL.md
Outdated
* vLLM conservatively uses 90% of GPU memory. You can set `--gpu-memory-utilization=0.95` to maximize KVCache. | ||
|
||
|
||
To run a smaller model, such as Qwen2.5-VL-7B, you can simply replace the model name `Qwen/Qwen2.5-VL-72B-Instruct` with `Qwen/Qwen2.5-VL-7B-Instruct`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention that for the 7B model it's better to run data parallel.
Qwen/Qwen2.5-VL.md
Outdated
### Qwen2.5VL-72B Benchmark | ||
|
||
Once the server is running, open another terminal and run the benchmark client: | ||
|
||
```bash | ||
vllm bench serve \ | ||
--host 0.0.0.0 \ | ||
--port 8000 \ | ||
--model Qwen/Qwen2.5-VL-72B-Instruct \ | ||
--dataset-name random \ | ||
--random-input-len 8000 \ | ||
--random-output-len 1000 \ | ||
--num-prompts 16 \ | ||
--ignore-eos | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to run a image or video benchmark against a multimodal model.
Thanks for the comments @ywang96 |
Maybe you can give a shot at vllm-project/vllm#23119 and help us review too :) |
Sure, I left some comments there. Then I will test again on VisionArena-Chat for now, then switch to |
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Signed-off-by: SamitHuang <285365963@qq.com>
Hi @ywang96 In the lastest commit, I have 1) adjusted the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments
Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>
Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: SamitHuang <285365963@qq.com>
Resolved. Thanks for the detailed revision!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for the contribution!
oops. Didn't realize we need to run |
That's correct, @SamitHuang! If you'd like me to perform a fresh review of your pull request after new commits have been pushed, please use the |
This PR adds the Qwen2.5-VL recipe, including Qwen2.5-VL-72B + TP, as planned in #10