-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Add RADIO Vision Encoder Support to vLLM #24595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the RADIO vision encoder, enabling its use in multimodal models like Nano Nemotron VL. The changes include a new RadioModel implementation, integration into NanoNemotronVL, and corresponding tests. While the implementation is comprehensive, there are a few critical issues that need to be addressed. A potential crash due to unsafe dictionary access in the configuration helper needs to be fixed. The vLLM implementation of RadioInternVisionModel is missing a final normalization layer present in the original model, which will lead to incorrect outputs. Additionally, a bug in the test file could lead to incorrect or inefficient test execution. There are also opportunities to make the weight loading logic more robust by handling unexpected weights.
vllm/model_executor/models/radio.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RadioInternVisionModel implementation is missing the final normalization layer that is present in the original HuggingFace RadioInternVisionModel. The original model applies a norm layer after the encoder. This omission will lead to incorrect model outputs.
Additionally, the load_weights method in RadioModel silently ignores weights that it doesn't recognize, including the weights for this missing normalization layer (model.norm.weight and model.norm.bias). This makes the issue harder to detect.
You should add the final normalization layer to RadioInternVisionModel and update RadioModel.load_weights to handle its weights.
|
@DarkLight1337 Fixed the comments. |
|
This pull request has merge conflicts that must be resolved before it can be |
2d7ea3b to
ae5b38d
Compare
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster>
Signed-off-by: Daniel Afrimi <danielafrimi8@gmail.com> Co-authored-by: root <root@cw-dfw-h100-001-305-026.cm.cluster> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR implements support for the C-RADIO (Retrieval-Augmented Dual Instruction Optimization) vision encoder in vLLM, enabling its use with multimodal models like Nano Nemotron VL.
Changes
New Radio Model Implementation (
vllm/model_executor/models/radio.py)RadioInternVisionModel: Core vision model using InternVision encoder architectureIntegration Updates (
vllm/model_executor/models/nano_nemotron_vl.py)Testing (
tests/models/multimodal/pooling/test_radio.py)nvidia/C-RADIOv2-HTechnical Notes
Hardcoded Values: The implementation preserves hardcoded values from the original
timmpackage implementation, including OpenAI CLIP normalization constants and predefined ViT model dimensions, ensuring compatibility and reproducibility.Configuration: Create new configuration approach to instantiate the Radio model based on InterVision model architecture, with dynamic parameter mapping for different ViT variants.
Weight Loading: Custom weight loader handles mapping between HuggingFace and vLLM parameter names, supporting models with
radio_model.prefix while skipping unused parameters.