-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Multimodal] Improve max video embedding length estimation in V1 #24312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly improves the estimation of the maximum video embedding length for V1 models. By removing the subtraction of image token counts from the available sequence length, the calculation now accurately reflects the capabilities of the V1 architecture with chunked prefill, where video and image data do not need to fit into the context window simultaneously. The changes in llava_onevision.py and qwen2_vl.py are consistent and simplify the logic as intended. The code is cleaner and more aligned with the V1 processing pipeline. Overall, this is a good improvement.
…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>
…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>
…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>
…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…m-project#24312) Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
In V0, multimodal profiling was based on the assumption that the batch size (by default model context window) fits all number of modality items of the biggest sizes, therefore the max number of video frames was computed based on how much space there is after fitting all images.
In V1, we no longer require this assumption because of chunked prefill and therefore there's no need to consider image tokens to calculate the max number of frames of videos.
A follow up is to further simplify dummy data generation since in V1 we don't need to consider the number of modality items but just whether a certain modality is allowed or not.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.