[Multimodal] Improve max video embedding length estimation in V1 #24312

ywang96 · 2025-09-05T10:05:10Z

Purpose

In V0, multimodal profiling was based on the assumption that the batch size (by default model context window) fits all number of modality items of the biggest sizes, therefore the max number of video frames was computed based on how much space there is after fitting all images.

In V1, we no longer require this assumption because of chunked prefill and therefore there's no need to consider image tokens to calculate the max number of frames of videos.

A follow up is to further simplify dummy data generation since in V1 we don't need to consider the number of modality items but just whether a certain modality is allowed or not.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Roger Wang <hey@rogerw.me>

gemini-code-assist

Code Review

This pull request correctly improves the estimation of the maximum video embedding length for V1 models. By removing the subtraction of image token counts from the available sequence length, the calculation now accurately reflects the capabilities of the V1 architecture with chunked prefill, where video and image data do not need to fit into the context window simultaneously. The changes in llava_onevision.py and qwen2_vl.py are consistent and simplify the logic as intended. The code is cleaner and more aligned with the V1 processing pipeline. Overall, this is a good improvement.

…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>

…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

simplify

87ce70a

Signed-off-by: Roger Wang <hey@rogerw.me>

ywang96 requested a review from sighingnow as a code owner September 5, 2025 10:05

mergify bot added the qwen Related to Qwen models label Sep 5, 2025

ywang96 requested review from DarkLight1337 and Isotr0py and removed request for sighingnow September 5, 2025 10:06

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

Isotr0py approved these changes Sep 5, 2025

View reviewed changes

Isotr0py enabled auto-merge (squash) September 5, 2025 10:09

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 5, 2025

Merge branch 'main' into fix-video-estimation

d2f8c91

vllm-bot merged commit eddaafc into vllm-project:main Sep 6, 2025
38 of 41 checks passed

ywang96 mentioned this pull request Sep 8, 2025

[Bug]: Inference with long-token vision requests blocks the V1 engine #18329

Closed

1 task

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Multimodal] Improve max video embedding length estimation in V1 (vll…

8b9d52d

…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Multimodal] Improve max video embedding length estimation in V1 (vll…

93fa133

…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>

ywang96 mentioned this pull request Sep 19, 2025

[Misc] Clean up MM profiling warnings #25222

Merged

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Multimodal] Improve max video embedding length estimation in V1 (vll…

1a54821

…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Multimodal] Improve max video embedding length estimation in V1 #24312

[Multimodal] Improve max video embedding length estimation in V1 #24312

Uh oh!

ywang96 commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Multimodal] Improve max video embedding length estimation in V1 #24312

[Multimodal] Improve max video embedding length estimation in V1 #24312

Uh oh!

Conversation

ywang96 commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ywang96 commented Sep 5, 2025 •

edited by github-actions bot

Loading