-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper
#20046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Quant] [Bugfix] Fix quantization config matching with hf_to_vllm_mapper
#20046
Conversation
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
hf_to_vllm_mapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @kylesayrs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly improves the resolution of quantization schemes for models whose Hugging Face structure differs from vLLM's internal module structure. It achieves this by introducing a new mechanism to adapt quantization configurations using a WeightsMapper, integrating this capability directly into the SupportsQuant interface, and providing concrete implementations for various quantization methods. This ensures more accurate and robust quantization across a broader range of models.
Highlights
- Quantization Config Adaptation: Introduced an
apply_vllm_mappermethod to theQuantizationConfigbase class and implemented it inCompressedTensorsConfigandFp8Config. This method allows quantization configurations to dynamically adjust module names (e.g., for ignore lists or scheme maps) to align with vLLM's internal model structure, using aWeightsMapper. - Standardized Quantization Interface: The
SupportsQuantmixin was enhanced to include an optionalhf_to_vllm_mapperclass variable. During model initialization, if aquant_configis present, theSupportsQuantmixin now automatically callsapply_vllm_mapperon thequant_configusing the model'shf_to_vllm_mapper, standardizing how models inform their quantization configurations about module name differences. - Transition and Backward Compatibility: The
configure_quant_configutility function was updated to apply thehf_to_vllm_mapperto quantization configs for models that do not yet inherit from theSupportsQuantinterface, ensuring a smooth transition and continued support for existing models. - Enhanced WeightsMapper Utility: The
WeightsMapperutility now includesapply_listandapply_dictmethods, making it more versatile for transforming lists of module names (like ignore lists) and dictionaries (like scheme maps) within quantization configurations. - Qwen2.5-VL Model Integration: The
Qwen2_5_VLForConditionalGenerationmodel was updated to inherit from theSupportsQuantmixin, demonstrating the practical application of these changes for multi-modal models and ensuring its quantization configuration correctly aligns with vLLM's structure.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a robust mechanism to handle quantization configuration mismatches between Hugging Face and vLLM model structures by adding an apply_vllm_mapper method to quantization configs. The changes are well-structured, introducing a SupportsQuant mixin for new models and providing backward compatibility for older models. The implementation is clean and consistent across the modified files.
I have one suggestion regarding the new apply_vllm_mapper interface to ensure its adoption across all relevant quantization configurations in the future.
hf_to_vllm_mapperhf_to_vllm_mapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the apply_vllm_mapper method provides a good abstraction. A unit test to lock in some expected behavior from this mapper would be nice to have
|
@kylesayrs it looks like there is a related failure in quantization test |
… as nullable, workaround TransformersForCausalLM Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Head branch was pushed to by a user without write access
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
|
@mgoin This is good to go, I needed to fix some edge cases with QuantConfigs not calling super().init() and TransformersForCausalLM |
…onfig-with-mappings
|
This pull request has merge conflicts that must be resolved before it can be |
…onfig-with-mappings
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! FYI @jeejeelee @Isotr0py
…pper` (vllm-project#20046) Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Purpose
Background
When the quantization config is produced, it has an ignore list which matches the hf model structure. However, the hf model structure is not guaranteed to match the vllm model structure, which can lead to mismatching mappings.
This PR allows provides an interface for the
hf_to_vllm_mapperto update the mappings in the quantization config.Changes
apply_vllm_mappermethod on quantization configsconfigure_quant_configorSupportsQuantmixinhf_to_vllm_mapperto update quantization config attributes such as the ignore list in order to correct match against vllm module prefixesSupportsQuantto qwen_2_5_vlapply_vllm_mapperfor compressed tensors as well as fp8 formatsTesting
Run
examples/offline_inference/vision_language.pyexample with truncated tokenizer modelThe above script fails on main due to quantizing being applied to the vision tower, but succeeds with these changes.