-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
remove GLM-4 quantization wrong Code #21435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix an issue with quantization for the GLM-4.5 MoE model by removing the quant_config parameter from the VocabParallelEmbedding layer. While this may resolve issues for certain quantization methods, I've identified a critical issue where this change will break support for GGUF-quantized models. My review includes a detailed explanation of the problem and suggests a more robust, conditional approach to ensure all supported quantization formats continue to work correctly.
| self.embed_tokens = VocabParallelEmbedding( | ||
| config.vocab_size, | ||
| config.hidden_size, | ||
| quant_config=quant_config, | ||
| prefix=f"{prefix}.embed_tokens") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change unconditionally removes quant_config from the VocabParallelEmbedding layer, which will disable quantization for token embeddings. While this might be the intended fix for certain quantization methods (e.g., GPTQ, AWQ), it will break support for others that rely on quantizing the embedding layer, such as GGUF.
When quant_config is not provided, VocabParallelEmbedding defaults to UnquantizedEmbeddingMethod. This method does not create the necessary parameters (like qweight) for GGUF, which will lead to failures during weight loading for GGUF-quantized GLM-4.5 models.
This is a critical issue as it silently disables a supported quantization format for this model.
A more robust solution would be to conditionally pass quant_config based on the quantization method. For instance, you could check if quant_config.get_name() == 'gguf' and only pass the config in that case, preserving the fix for other methods while maintaining GGUF compatibility.
|
Next time please sign off your commits using |
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Head branch was pushed to by a user without write access
|
This PR does not need to be merged first, as we have found some potential issues in the quantized model |
|
This PR can be merged first; we haven't found any FC issues yet. If we find and fix them, we will submit a new PR. |
Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
need to remove this line