Skip to content

Conversation

@BloodAxe
Copy link
Contributor

@BloodAxe BloodAxe commented Oct 17, 2025

Purpose

This MR adds support for EVS to Nemotron Nano 2 VL.
Co-authored with @tomeras91 and @nvnbagrov

Test Plan

Tested internally

Co-authored-by: nvnbagrov <nbagrov@nvidia.com>
Co-authored-by: Tomer Asida <tomera@ai21.com>

Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
@mergify mergify bot added multi-modality Related to multi-modality (#4194) v1 labels Oct 17, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Efficient Video Sampling (EVS) to Nemotron Nano 2 VL. The changes primarily involve enhancing the video processing pipeline to handle and utilize video metadata, such as frame indices and timestamps, for more robust tokenization. A notable improvement is the refactoring of input normalization from the vision model to the processor, which is a good design choice. My review identifies a couple of areas for improvement: a code duplication that could impact maintainability and a performance issue in the new video frame sampling logic. Addressing these points will enhance the overall quality and efficiency of the implementation.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines +1743 to +1757
if (
self.is_multimodal_pruning_enabled
and modality == "video"
and num_items > 1
):
for video_mm_kwargs_item in filter(
lambda item: item.modality == "video", mm_kwargs
):
_, _, micro_batch_mm_inputs = next(
group_mm_kwargs_by_modality(
[video_mm_kwargs_item],
device=self.device,
pin_memory=self.pin_memory,
merge_by_field_config=model.merge_by_field_config,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary?

I honestly also feel like it's probably just cleaner if we can have this under a separate code path when pruning is enabled rather than hacking this under the normal batching logic loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomeras91 leaving that to you ^_^

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ywang96 :)

A few notes on this:

  1. This workaround was introduced in a previous PR. This one just makes it more robust, by extracting a single video from mm_kwargs instead of relying on the fact that the first dimension of items in mm_kwargs_group is the video batch dim. Changes in group_mm_kwargs_by_modality() which now has merge_by_field_config=True for almost all relevant models mean that the video batch dimension doesn't exist and frames from different videos are gathered in the same dim. That's why we need a more robust way to deal with processing just a single video
  2. The reason the workaround is needed in the first place is because the scheduler considers the output number of tokens to compute how many videos can be batched together. With EVS, this is different than the input number of tokens. For example, with 75% pruning rate, scheduler sees the output number of vision tokens is small (E.g 1/4 of input number of vision tokens) and increases number of images 4x times (effectively trying to "squeeze" multiple videos into a into a batch), which causes OOM in the vision encoder. A better solution is to change the heuristic to compute the batch size, but we feel this is out-of-scope of this PR.
  3. I agree it's better if this logic can be moved somewhere more isolated to a different code path where EVS is enabled. Can you advise on such a place? We don't want to put it in modeling code since EVS is a general feature that can work with basically any video model (currently implemented in vLLM for Qwen3-VL and nano-v2-vl)

Copy link
Member

@ywang96 ywang96 Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information - I remember the context behind this hack now.

I added a note for it - the real problem here is that during profiling we don't take EVS into account, and therefore drastically underestimate how much memory it takes to execute encoder (which will result in OOM at actual inference time). For now let's leave this as is.

On a related note - did you actually get this to work with Qwen3-VL? Qwen3-VL has special logic around timestamps so I wonder how that would work with EVS.

Copy link
Contributor Author

@BloodAxe BloodAxe Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ywang96 I looked into Qwen3 code, and it seems to be very similar to what we have in Nano V2. In Nano V2, there is "Frame {i} sampled at {t:.2f}:" + vision_start_token + [vision_placeholder_token] * num_tokens_in_frame_i + vision_end_token.

In Qwen 3 there is "<{timestamp} seconds>" (on average 9.5 tokens) + vision_start_token + video_token + vision_end_token.

We solved how to make EVS working with this interleaving image tokens packing for Nano V2 (Although it was NOT an easy task). So it should be possible to employ EVS for Qwen3 as well.

A key method is get_video_repl (https://github.com/vllm-project/vllm/pull/27107/files#diff-b947b52004ad7e873c11916a4194f0b6825892eb3d51a6c57b0e3d64fc79449bR642) that does all heavy lifting to build "fake" prompt replacement for preprocessing step and "real" prompt replacement when we get final vision embeddings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing - thanks for the explanation

BloodAxe and others added 5 commits October 17, 2025 21:52
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
@nvnbagrov nvnbagrov force-pushed the feature/nemotron_nano_v2_vl_video_support branch from a76602c to ee0804e Compare October 19, 2025 11:34
nvnbagrov and others added 3 commits October 19, 2025 04:52
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Roger Wang <hey@rogerw.io>
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now!

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 20, 2025
@DarkLight1337 DarkLight1337 merged commit e93ff6c into vllm-project:main Oct 20, 2025
55 checks passed
adabeyta pushed a commit to adabeyta/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
faaany pushed a commit to faaany/vllm that referenced this pull request Oct 21, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Zhuul pushed a commit to Zhuul/vllm that referenced this pull request Oct 21, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Ther-LF pushed a commit to Ther-LF/vllm that referenced this pull request Oct 22, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
@BloodAxe BloodAxe deleted the feature/nemotron_nano_v2_vl_video_support branch October 22, 2025 13:36
@tomeras91 tomeras91 mentioned this pull request Oct 23, 2025
5 tasks
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
@hl475 hl475 mentioned this pull request Oct 31, 2025
5 tasks
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Natan Bagrov <nbagrov@nvidia.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants