-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Model] EVS support for nano_nemotron_vl #26269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] EVS support for nano_nemotron_vl #26269
Conversation
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Efficient Video Sampling (EVS) to the Nano Nemotron VL model. The implementation introduces a clever workaround to handle this model's specific use of text separators between video frames, which poses a challenge for the standard EVS workflow. The core of the solution involves manually merging video and text embeddings within the model's get_multimodal_embeddings method. Additionally, the EVS utility functions in vllm/multimodal/evs.py have been refactored for better generality, which is a positive change. The overall approach is sound, but I've identified a performance issue in the new code that could lead to unnecessary CPU-GPU synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
…er value to tokens_in_single_frame Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…ry Tensor Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Head branch was pushed to by a user without write access
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Head branch was pushed to by a user without write access
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Purpose
Add support for EVS (Efficient Video Sampling, introduced in #22980) for Nano Nemotron VL model.
Contrary to other multimodal models (for example, Qwen2.5-VL) Nano Nemotron VL uses text frame separators. These are not placeholders, and should not be replaced by video embeddings. For example, if a video has 2 frames, each with 5 tokens, the video text replacement will be:
This poses a challenge to EVS, since it means we need to know the number of tokens per frame when creating the text replacement. In the standard vLLM flow, this is impossible since the replacement is created before the forward pass through the vision encoder, and EVS selects video tokens to prune based on the vision encoder outputs (which obviously are known only after the forward pass through the vision encoder).
We worked around this issue by signaling
get_input_embeddingsof theSupportsMultiModalprotocol to not replace any of the video text replacement (done by settingembed_text=seqinget_video_replwhen creatingPromptUpdateDetails), and handle embedding merging ourselves inNemotronH_Nano_VL_V2.get_multimodal_embeddings, where we already know how many tokens are retained per frame. This also means the number of tokens per frame set in_get_prompt_updates's call toget_video_repldoesn't matter, as that text replacement is created only to set the correct number of tokens. Hence, all that matters in that call is the total number of retained video tokens, which is known at that stage based onvideo_pruning_rate.Joint work with @BloodAxe
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.