Nemotron Nano V2 VL + EVS Video Support #27107

BloodAxe · 2025-10-17T14:31:58Z

Purpose

This MR adds support for EVS to Nemotron Nano 2 VL.
Co-authored with @tomeras91 and @nvnbagrov

Test Plan

Tested internally

Co-authored-by: nvnbagrov <nbagrov@nvidia.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

gemini-code-assist

Code Review

This pull request adds support for Efficient Video Sampling (EVS) to Nemotron Nano 2 VL. The changes primarily involve enhancing the video processing pipeline to handle and utilize video metadata, such as frame indices and timestamps, for more robust tokenization. A notable improvement is the refactoring of input normalization from the vision model to the processor, which is a good design choice. My review identifies a couple of areas for improvement: a code duplication that could impact maintainability and a performance issue in the new video frame sampling logic. Addressing these points will enhance the overall quality and efficiency of the implementation.

vllm/model_executor/models/nano_nemotron_vl.py

vllm/multimodal/video.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

vllm/model_executor/models/nano_nemotron_vl.py

vllm/multimodal/video.py

ywang96 · 2025-10-17T18:12:44Z

vllm/v1/worker/gpu_model_runner.py

+            if (
+                self.is_multimodal_pruning_enabled
+                and modality == "video"
+                and num_items > 1
+            ):
+                for video_mm_kwargs_item in filter(
+                    lambda item: item.modality == "video", mm_kwargs
+                ):
+                    _, _, micro_batch_mm_inputs = next(
+                        group_mm_kwargs_by_modality(
+                            [video_mm_kwargs_item],
+                            device=self.device,
+                            pin_memory=self.pin_memory,
+                            merge_by_field_config=model.merge_by_field_config,
+                        )


Is this change necessary?

I honestly also feel like it's probably just cleaner if we can have this under a separate code path when pruning is enabled rather than hacking this under the normal batching logic loop

@tomeras91 leaving that to you ^_^

Hey @ywang96 :)

A few notes on this:

This workaround was introduced in a previous PR. This one just makes it more robust, by extracting a single video from mm_kwargs instead of relying on the fact that the first dimension of items in mm_kwargs_group is the video batch dim. Changes in group_mm_kwargs_by_modality() which now has merge_by_field_config=True for almost all relevant models mean that the video batch dimension doesn't exist and frames from different videos are gathered in the same dim. That's why we need a more robust way to deal with processing just a single video

The reason the workaround is needed in the first place is because the scheduler considers the output number of tokens to compute how many videos can be batched together. With EVS, this is different than the input number of tokens. For example, with 75% pruning rate, scheduler sees the output number of vision tokens is small (E.g 1/4 of input number of vision tokens) and increases number of images 4x times (effectively trying to "squeeze" multiple videos into a into a batch), which causes OOM in the vision encoder. A better solution is to change the heuristic to compute the batch size, but we feel this is out-of-scope of this PR.

I agree it's better if this logic can be moved somewhere more isolated to a different code path where EVS is enabled. Can you advise on such a place? We don't want to put it in modeling code since EVS is a general feature that can work with basically any video model (currently implemented in vLLM for Qwen3-VL and nano-v2-vl)

Thanks for the information - I remember the context behind this hack now.

I added a note for it - the real problem here is that during profiling we don't take EVS into account, and therefore drastically underestimate how much memory it takes to execute encoder (which will result in OOM at actual inference time). For now let's leave this as is.

On a related note - did you actually get this to work with Qwen3-VL? Qwen3-VL has special logic around timestamps so I wonder how that would work with EVS.

@ywang96 I looked into Qwen3 code, and it seems to be very similar to what we have in Nano V2. In Nano V2, there is "Frame {i} sampled at {t:.2f}:" + vision_start_token + [vision_placeholder_token] * num_tokens_in_frame_i + vision_end_token.

In Qwen 3 there is "<{timestamp} seconds>" (on average 9.5 tokens) + vision_start_token + video_token + vision_end_token.

We solved how to make EVS working with this interleaving image tokens packing for Nano V2 (Although it was NOT an easy task). So it should be possible to employ EVS for Qwen3 as well.

A key method is get_video_repl (https://github.com/vllm-project/vllm/pull/27107/files#diff-b947b52004ad7e873c11916a4194f0b6825892eb3d51a6c57b0e3d64fc79449bR642) that does all heavy lifting to build "fake" prompt replacement for preprocessing step and "real" prompt replacement when we get final vision embeddings.

Amazing - thanks for the explanation

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>

Signed-off-by: Roger Wang <hey@rogerw.io>

Isotr0py

LGTM now!

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>

Nemotron Nano V2 VL + EVS Video Support

cdc557d

Co-authored-by: nvnbagrov <nbagrov@nvidia.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

BloodAxe requested review from DarkLight1337, NickLucche and ywang96 as code owners October 17, 2025 14:31

mergify bot added multi-modality Related to multi-modality (#4194) v1 labels Oct 17, 2025

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

vllm/model_executor/models/nano_nemotron_vl.py Show resolved Hide resolved

vllm/multimodal/video.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 17, 2025

View reviewed changes

vllm/model_executor/models/nano_nemotron_vl.py Show resolved Hide resolved

vllm/model_executor/models/nano_nemotron_vl.py Show resolved Hide resolved

DarkLight1337 reviewed Oct 17, 2025

View reviewed changes

vllm/multimodal/video.py Outdated Show resolved Hide resolved

ywang96 reviewed Oct 17, 2025

View reviewed changes

BloodAxe and others added 5 commits October 17, 2025 21:52

Fix missing fill value

a63d3b2

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

Add micro-batching of images to support 128+ frames video

af7005f

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

Add micro-batching of images to support 128+ frames video

265ef0b

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

Fix docstrings

09f8c57

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>

introduced 'fps' to base video backend. removed nemotron backend

ee0804e

Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>

nvnbagrov force-pushed the feature/nemotron_nano_v2_vl_video_support branch from a76602c to ee0804e Compare October 19, 2025 11:34

nvnbagrov and others added 3 commits October 19, 2025 04:52

ruff

bdbb7f6

Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>

add note

e112b1e

Signed-off-by: Roger Wang <hey@rogerw.io>

minor

5c2f8af

Signed-off-by: Roger Wang <hey@rogerw.io>

Isotr0py approved these changes Oct 20, 2025

View reviewed changes

Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 20, 2025

Merge branch 'main' into feature/nemotron_nano_v2_vl_video_support

2019d2e

DarkLight1337 merged commit e93ff6c into vllm-project:main Oct 20, 2025
55 checks passed

faaany mentioned this pull request Oct 21, 2025

[FIX] TypeError in Image.fromarray by specifying dtype for video frames #27191

Closed

5 tasks

BloodAxe deleted the feature/nemotron_nano_v2_vl_video_support branch October 22, 2025 13:36

tomeras91 mentioned this pull request Oct 23, 2025

Oct16 updated try1 #27098

Closed

5 tasks

Kay-Tian mentioned this pull request Oct 23, 2025

vLLM PR #27107 变更核心文件提醒 Kay-Tian/vllm#27

Closed

This was referenced Oct 27, 2025

[E2E][MM] Add e2e tests for InternVL model vllm-project/vllm-ascend#3796

Merged

[Bug][v0.11.0rc0]: KeyError: ((1, 1, 3), '<i8') vllm-project/vllm-ascend#3925

Open

hl475 mentioned this pull request Oct 31, 2025

[CI Failure] fix_test_radio #27903

Closed

5 tasks

Uh oh!

Nemotron Nano V2 VL + EVS Video Support #27107

Nemotron Nano V2 VL + EVS Video Support #27107

Conversation

BloodAxe commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywang96 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

BloodAxe Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

tomeras91 Oct 19, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BloodAxe Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

BloodAxe commented Oct 17, 2025 •

edited by github-actions bot

Loading

ywang96 Oct 20, 2025 •

edited

Loading

BloodAxe Oct 20, 2025 •

edited

Loading