-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Nemotron Nano V2 VL + EVS Video Support #27107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nemotron Nano V2 VL + EVS Video Support #27107
Conversation
Co-authored-by: nvnbagrov <nbagrov@nvidia.com> Co-authored-by: Tomer Asida <tomera@ai21.com> Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for Efficient Video Sampling (EVS) to Nemotron Nano 2 VL. The changes primarily involve enhancing the video processing pipeline to handle and utilize video metadata, such as frame indices and timestamps, for more robust tokenization. A notable improvement is the refactoring of input normalization from the vision model to the processor, which is a good design choice. My review identifies a couple of areas for improvement: a code duplication that could impact maintainability and a performance issue in the new video frame sampling logic. Addressing these points will enhance the overall quality and efficiency of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| if ( | ||
| self.is_multimodal_pruning_enabled | ||
| and modality == "video" | ||
| and num_items > 1 | ||
| ): | ||
| for video_mm_kwargs_item in filter( | ||
| lambda item: item.modality == "video", mm_kwargs | ||
| ): | ||
| _, _, micro_batch_mm_inputs = next( | ||
| group_mm_kwargs_by_modality( | ||
| [video_mm_kwargs_item], | ||
| device=self.device, | ||
| pin_memory=self.pin_memory, | ||
| merge_by_field_config=model.merge_by_field_config, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary?
I honestly also feel like it's probably just cleaner if we can have this under a separate code path when pruning is enabled rather than hacking this under the normal batching logic loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomeras91 leaving that to you ^_^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ywang96 :)
A few notes on this:
- This workaround was introduced in a previous PR. This one just makes it more robust, by extracting a single video from
mm_kwargsinstead of relying on the fact that the first dimension of items in mm_kwargs_group is the video batch dim. Changes ingroup_mm_kwargs_by_modality()which now hasmerge_by_field_config=Truefor almost all relevant models mean that the video batch dimension doesn't exist and frames from different videos are gathered in the same dim. That's why we need a more robust way to deal with processing just a single video - The reason the workaround is needed in the first place is because the scheduler considers the output number of tokens to compute how many videos can be batched together. With EVS, this is different than the input number of tokens. For example, with 75% pruning rate, scheduler sees the output number of vision tokens is small (E.g 1/4 of input number of vision tokens) and increases number of images 4x times (effectively trying to "squeeze" multiple videos into a into a batch), which causes OOM in the vision encoder. A better solution is to change the heuristic to compute the batch size, but we feel this is out-of-scope of this PR.
- I agree it's better if this logic can be moved somewhere more isolated to a different code path where EVS is enabled. Can you advise on such a place? We don't want to put it in modeling code since EVS is a general feature that can work with basically any video model (currently implemented in vLLM for Qwen3-VL and nano-v2-vl)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the information - I remember the context behind this hack now.
I added a note for it - the real problem here is that during profiling we don't take EVS into account, and therefore drastically underestimate how much memory it takes to execute encoder (which will result in OOM at actual inference time). For now let's leave this as is.
On a related note - did you actually get this to work with Qwen3-VL? Qwen3-VL has special logic around timestamps so I wonder how that would work with EVS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywang96 I looked into Qwen3 code, and it seems to be very similar to what we have in Nano V2. In Nano V2, there is "Frame {i} sampled at {t:.2f}:" + vision_start_token + [vision_placeholder_token] * num_tokens_in_frame_i + vision_end_token.
In Qwen 3 there is "<{timestamp} seconds>" (on average 9.5 tokens) + vision_start_token + video_token + vision_end_token.
We solved how to make EVS working with this interleaving image tokens packing for Nano V2 (Although it was NOT an easy task). So it should be possible to employ EVS for Qwen3 as well.
A key method is get_video_repl (https://github.com/vllm-project/vllm/pull/27107/files#diff-b947b52004ad7e873c11916a4194f0b6825892eb3d51a6c57b0e3d64fc79449bR642) that does all heavy lifting to build "fake" prompt replacement for preprocessing step and "real" prompt replacement when we get final vision embeddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing - thanks for the explanation
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com>
Signed-off-by: Natan Bagrov <nbagrov@nvidia.com>
a76602c to
ee0804e
Compare
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now!
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Eugene Khvedchenia <ekhvedchenia@nvidia.com> Signed-off-by: Natan Bagrov <nbagrov@nvidia.com> Signed-off-by: Roger Wang <hey@rogerw.io> Co-authored-by: Natan Bagrov <nbagrov@nvidia.com> Co-authored-by: Roger Wang <hey@rogerw.io>
Purpose
This MR adds support for EVS to Nemotron Nano 2 VL.
Co-authored with @tomeras91 and @nvnbagrov
Test Plan
Tested internally