-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Model] EVS support for nano_nemotron_vl #26267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Efficient Video Sampling (EVS) to the nano_nemotron_vl model. The implementation introduces a clever workaround to accommodate this model's unique video frame processing, which involves text separators, by handling embedding merging within get_multimodal_embeddings. The changes also include a beneficial refactoring of the EVS utility functions, improving their reusability and fixing a potential bug. The overall approach is sound and the implementation appears correct. I have one suggestion to improve robustness by replacing an assert statement with a more explicit ValueError for input validation in a critical path.
| # TODO: Maybe this can be optimized to avoid the loop? | ||
| for i, single_video_embeddings in enumerate(video_embeddings): | ||
| num_frames = video_input["num_patches"][i].item() | ||
| assert single_video_embeddings.shape[0] % num_frames == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using assert for input validation in production code can be risky. Assertions can be disabled with Python's -O flag, and they raise a generic AssertionError. It's better to use an explicit if check and raise a ValueError with a descriptive message. This makes the code more robust against unexpected or malformed inputs and prevents potential server crashes.
if single_video_embeddings.shape[0] % num_frames != 0:
raise ValueError(
f"The number of video embeddings ({single_video_embeddings.shape[0]}) "
f"is not divisible by the number of frames ({num_frames})."
)…rge (vllm-project#25331) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: isotr0py <2037008807@qq.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…okens_count so code can be reused in nano_nemotrron_vl which doesn't have thw Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
This reverts commit c5dad7e. Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…M mechanism: 1. get_video_repl now doesn't mask the indicator tokens - it signals vLLM to replace all placeholder embeddings with the video embeddings returned by get_multimodal_embeddings 2. get_multimodal_embeddings handles interleaving video embeddings with text embeddings for indicator tokens 3. This is done by creating the video replacement text again in get_multimodal_embeddings, tokenizing it, and masking the indicator tokens. Indicator tokens embeddings are calculated by calling self.language_model.get_input_embeddings() directly 4. The tokenizer was added to NemotronH_Nano_VL_V2, to allow for tokenizing in get_multimodal_embeddings() Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…f NemotronH_Nano_VL_V2 Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…#25827) Signed-off-by: yingjun-mou <renzomou@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…d models (vllm-project#25854) Signed-off-by: zhoukz <me@zhoukz.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…value (vllm-project#25868) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com> Co-authored-by: Chenxi Yang <cxyang@fb.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…llm-project#25883) Signed-off-by: Rahul Tuli <rtuli@redhat.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…test (vllm-project#25885) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
vllm-project#25706) Signed-off-by: Lee Nau <lnau@nvidia.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…tion (vllm-project#25513) Signed-off-by: adabeyta <aabeyta@redhat.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
vllm-project#25819) Signed-off-by: Naman Lalit <nl2688@nyu.edu> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
|
closed due to rebase havoc. Replaced by #26269 |
Purpose
Add support for EVS (Efficient Video Sampling, introduced in #22980) for Nano Nemotron VL model.
Contrary to other multimodal models (for example, Qwen2.5-VL) Nano Nemotron VL uses text frame separators. These are not placeholders, and should not be replaced by video embeddings. For example, if a video has 2 frames, each with 5 tokens, the video text replacement will be:
This poses a challenge to EVS, since it means we need to know the number of tokens per frame when creating the text replacement. In the standard vLLM flow, this is impossible since the replacement is created before the forward pass through the vision encoder, and EVS selects video tokens to prune based on the vision encoder outputs (which obviously are known only after the forward pass through the vision encoder).
We worked around this issue by signaling
get_input_embeddingsof theSupportsMultiModalprotocol to not replace any of the video text replacement (done by settingembed_text=seqinget_video_replwhen creatingPromptUpdateDetails), and handle embedding merging ourselves inNemotronH_Nano_VL_V2.get_multimodal_embeddings, where we already know how many tokens are retained per frame. This also means the number of tokens per frame set in_get_prompt_updates's call toget_video_repldoesn't matter, as that text replacement is created only to set the correct number of tokens. Hence, all that matters in that call is the total number of retained video tokens, which is known at that stage based onvideo_pruning_rate.Joint work with @BloodAxe
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.