Skip to content

Conversation

@tomeras91
Copy link
Contributor

@tomeras91 tomeras91 commented Oct 6, 2025

Purpose

Add support for EVS (Efficient Video Sampling, introduced in #22980) for Nano Nemotron VL model.

Contrary to other multimodal models (for example, Qwen2.5-VL) Nano Nemotron VL uses text frame separators. These are not placeholders, and should not be replaced by video embeddings. For example, if a video has 2 frames, each with 5 tokens, the video text replacement will be:

Frame0: <img><image><image><image><image><image></img>Frame2: <img><image><image><image><image><image></img>

This poses a challenge to EVS, since it means we need to know the number of tokens per frame when creating the text replacement. In the standard vLLM flow, this is impossible since the replacement is created before the forward pass through the vision encoder, and EVS selects video tokens to prune based on the vision encoder outputs (which obviously are known only after the forward pass through the vision encoder).

We worked around this issue by signaling get_input_embeddings of the SupportsMultiModal protocol to not replace any of the video text replacement (done by setting embed_text=seq in get_video_repl when creating PromptUpdateDetails), and handle embedding merging ourselves in NemotronH_Nano_VL_V2.get_multimodal_embeddings, where we already know how many tokens are retained per frame. This also means the number of tokens per frame set in _get_prompt_updates's call to get_video_repl doesn't matter, as that text replacement is created only to set the correct number of tokens. Hence, all that matters in that call is the total number of retained video tokens, which is known at that stage based on video_pruning_rate.

Joint work with @BloodAxe


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added multi-modality Related to multi-modality (#4194) qwen Related to Qwen models labels Oct 6, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Efficient Video Sampling (EVS) to the nano_nemotron_vl model. The implementation introduces a clever workaround to accommodate this model's unique video frame processing, which involves text separators, by handling embedding merging within get_multimodal_embeddings. The changes also include a beneficial refactoring of the EVS utility functions, improving their reusability and fixing a potential bug. The overall approach is sound and the implementation appears correct. I have one suggestion to improve robustness by replacing an assert statement with a more explicit ValueError for input validation in a critical path.

# TODO: Maybe this can be optimized to avoid the loop?
for i, single_video_embeddings in enumerate(video_embeddings):
num_frames = video_input["num_patches"][i].item()
assert single_video_embeddings.shape[0] % num_frames == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using assert for input validation in production code can be risky. Assertions can be disabled with Python's -O flag, and they raise a generic AssertionError. It's better to use an explicit if check and raise a ValueError with a descriptive message. This makes the code more robust against unexpected or malformed inputs and prevents potential server crashes.

            if single_video_embeddings.shape[0] % num_frames != 0:
                raise ValueError(
                    f"The number of video embeddings ({single_video_embeddings.shape[0]}) "
                    f"is not divisible by the number of frames ({num_frames})."
                )

Isotr0py and others added 23 commits October 6, 2025 04:06
…rge (vllm-project#25331)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: isotr0py <2037008807@qq.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…okens_count so code can be reused in nano_nemotrron_vl which doesn't have thw

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
This reverts commit c5dad7e.

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…M mechanism:

1. get_video_repl now doesn't mask the indicator tokens - it signals
   vLLM to replace all placeholder embeddings with the video embeddings
   returned by get_multimodal_embeddings
2. get_multimodal_embeddings handles interleaving video embeddings with
   text embeddings for indicator tokens
3. This is done by creating the video replacement text again in
   get_multimodal_embeddings, tokenizing it, and masking the indicator
   tokens. Indicator tokens embeddings are calculated by calling
   self.language_model.get_input_embeddings() directly
4. The tokenizer was added to NemotronH_Nano_VL_V2, to allow for
   tokenizing in get_multimodal_embeddings()

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…f NemotronH_Nano_VL_V2

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…#25827)

Signed-off-by: yingjun-mou <renzomou@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…d models (vllm-project#25854)

Signed-off-by: zhoukz <me@zhoukz.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…value (vllm-project#25868)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Chenxi Yang <cxyang@fb.com>
Co-authored-by: Chenxi Yang <cxyang@fb.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…llm-project#25883)

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…test (vllm-project#25885)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
vllm-project#25706)

Signed-off-by: Lee Nau <lnau@nvidia.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
…tion (vllm-project#25513)

Signed-off-by: adabeyta <aabeyta@redhat.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
vllm-project#25819)

Signed-off-by: Naman Lalit <nl2688@nyu.edu>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models new-model Requests to new models performance Performance-related issues gpt-oss Related to GPT-OSS models rocm Related to AMD ROCm labels Oct 6, 2025
@mergify mergify bot added tpu Related to Google TPUs tool-calling labels Oct 6, 2025
@mergify
Copy link

mergify bot commented Oct 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tomeras91.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@mergify mergify bot removed tpu Related to Google TPUs needs-rebase labels Oct 6, 2025
@tomeras91
Copy link
Contributor Author

closed due to rebase havoc. Replaced by #26269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output tool-calling v1

Projects

Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.