-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Fix(models/siglip): Add compatibility for Gemma models quantized by llm-compressor #19643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
vllm/model_executor/models/siglip.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some comment to explain this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @houseroad, thank you so much for the quick and insightful review!
- Absolutely, that's a great suggestion. I've added a detailed comment to the code to explain the purpose of the remapping logic.
- You've pointed out the core issue perfectly regarding the double prefix. My current fix in
load_weightsis indeed a workaround at the loading stage. A more fundamental fix at the model's__init__level would certainly be cleaner. I'm happy to explore that path. Could you provide any pointers on the preferred way to handle this prefixing logic within the vLLM architecture? Or would you prefer to merge the current loader-side fix first to resolve the user-facing bug, and then create a separate issue for the architectural refactoring?
I'll push the commit with the added comments shortly. Thanks again!
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can fix the weights loading logic to avoid vision_model.vision_model double prefix?
bec5429 to
5e0cded
Compare
|
Hi @houseroad , thank you so much for the quick and insightful review! I change my method and refer other models plemetation and just finish verify my new method, pass successful.
output: |
|
Hi @houseroad , following up on our discussion about the I've just pushed a new version of the fix that addresses the issue at a more fundamental level, as you suggested. Instead of patching the I have verified locally that this new approach also successfully resolves the The PR should be ready for review and the final CI run whenever you have a moment. Thanks again for your guidance! |
|
Hi @DarkLight1337 Can you merge this codes? I updated this PR about this codes specific description. 😄 |
vllm/model_executor/models/gemma3.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, but Gemma3ForCausalLM is a text-only model, why can it have ViT in checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Isotr0py , thank you again for the very insightful question. It helped me clarify the core of the issue, and my apologies for not making the context clearer in the initial PR description.
You are absolutely correct that gemma3 is a text-only LLM.
This fix is specifically designed to support its multi-modal variants, where gemma-3-4b-it is used as the language backbone and is combined with a separate vision encoder (like SigLIP, which is a ViT). In these common VLM (Vision-Language-Model) architectures, the vision encoder is loaded into a component conventionally named vision_tower.
This leads to the exact problem this PR solves: the model checkpoint contains weight names with the prefix vision_tower.vision_model., while vLLM's internal SiglipVisionModel loader expects the prefix to be just vision_model.. This naming mismatch results in the KeyError.
The WeightsMapper I've implemented in gemma3_mm.py directly and cleanly resolves this by providing a rule to remap vision_tower.vision_model. to vision_model., thus allowing the weights to be loaded correctly. This addresses the bug originally reported in vllm-project/llm-compressor#1546.
I hope this provides the necessary context for the change! Please let me know if you have any further questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this mapping from Gemma3ForCausalLM to Gemma3ForConditionalGeneration? This mapping in Gemma3ForCausalLM here can be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Isotr0py ,
Thank you for the excellent and clear guidance. Your feedback was incredibly helpful.
I have just pushed an updated commit that implements your suggestion precisely.
Specifically, I have reverted the previous changes to gemma3.py and have now placed the fix within gemma3_mm.py. I located the existing WeightsMapper in the Gemma3ForConditionalGeneration class and simply added the necessary rule ("vision_tower.vision_model.": "vision_model.") to its orig_to_new_prefix dictionary.
This is a much cleaner solution that correctly places the vision-related logic within the multi-modal class, fully addressing your concern about keeping Gemma3ForCausalLM clean.
The PR should now be ready for another look. Thanks again for your help!
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
727cb28 to
4ed2749
Compare
4ed2749 to
e74d681
Compare
|
Hi @Isotr0py , thank you for triggering the final CI run! It seems like the This looks like it might be a transient CI environment or network issue. Would it be possible to re-run the failed job? Please let me know if there's anything I need to do on my end. Thank you! |
…lm-compressor (vllm-project#19643) Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: juncheoll <th6re8e@naver.com>
…lm-compressor (vllm-project#19643) Signed-off-by: Vensenmu <vensenmu@gmail.com> Signed-off-by: fhl <2410591650@qq.com>
Description
This PR resolves a
KeyErrorthat occurs when attempting to serve a Gemma-3 model that has been quantized by thevllm-project/llm-compressorlibrary. It makes the SigLIP vision model loader more robust to variations in weight naming conventions.This directly addresses the root cause of the bug reported in vllm-project/llm-compressor#1546.
Root Cause
The
KeyErrorarises from a mismatch in weight-naming conventions between the vLLM's internalSiglipVisionModelarchitecture and the standard model artifacts produced by tools likellm-compressor.Specifically, the
SiglipVisionModelin vLLM prepends an additionalvision_model.prefix to its parameters (e.g., expectingvision_model.vision_model.encoder...), while the quantized model artifact uses the standard Hugging Face naming convention (e.g.,vision_model.encoder...).The existing
load_weightsfunction insiglip.pystrictly expected the internal vLLM format, leading to aKeyErrorwhen it encountered a weight name from the quantized model.Solution
This patch introduces a flexible remapping logic within the
load_weightsfunction invllm/model_executor/models/siglip.py.vision_model..This makes the loader robust to this naming discrepancy, resolving the
KeyError.Testing
The fix was verified in a cloud GPU environment (Colab A100) using the following methodology:
KeyErrorwith the original vLLM code.vllmpackage was installed.siglip.pyfile within the installation was "hot-patched" with the code from this PR.vllm servecommand was re-run against the quantized model.Result:
INFO: Remapping weight...logs were observed, confirming the patch was being correctly executed.KeyErrorwas successfully resolved.vllmserver launched successfully and was able to serve the quantized model.Thank you for considering this contribution!