-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Fix GGUF loader for Qwen3 MoE. #22785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly adds support for Qwen2 and Qwen3 MoE models in the GGUF loader and fixes an embed_tokens layer initialization issue. The changes appear to be correct and well-tested. My main feedback is regarding code duplication in the GGUF loader, which could be refactored to improve long-term maintainability.
| if model_type in ("qwen2_moe", "qwen3_moe"): | ||
| model_type = model_type.replace("_", "") | ||
| # GGUF layer map assumes that we will have a merged expert weights | ||
| # so we need to map them manually | ||
| for idx in range(config.num_hidden_layers): | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \ | ||
| f"model.layers.{idx}.mlp.experts.0.down_proj.weight" | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \ | ||
| f"model.layers.{idx}.mlp.experts.0.gate_proj.weight" | ||
| gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \ | ||
| f"model.layers.{idx}.mlp.experts.0.up_proj.weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block for handling Qwen MoE models introduces significant code duplication with the existing logic for DeepSeek MoE models just above. While functionally correct, this pattern makes the code harder to maintain. If a change is needed for expert weight mapping, it would have to be applied in multiple places. Please consider refactoring to share the common logic between these model types.
Isotr0py
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
|
请问你用的版本是什么 我们试过都没法正常加载 |
Try using the nightly wheel, this PR hasn't been included in newest release. |
we were using latest main branch code, pip install -e .; just now, I tried, "pip remove vllm", then "pip install -U vllm still no luck in load qwen3 mode gguf, both 30B version and 480B version both are terminated in this error and this error info could be observed at very beginning "ERROR 08-18 05:01:32 [config.py:133] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dev/shm/Qwen3-30B-A3B-Q4_K_M.gguf'. Use |
Perhaps you forgot to pull the main branch? This bug has been fixed by #21579 three weeks ago... |
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
再请教个问题 我的也运行起来 gguf 格式了 但是推理的时候 报错乱码是什么原因? 你得推理输出正常吗? 换了好多版本的gguf还是不行 |
|
就是输出的内容是乱码 一些无意义的东西 或者就是不停的输出 iiiiiiiiiiiiiii 服务器端没报错退出 |
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Purpose
Despite upstream repositories (gguf-py, transformers) having added support for Qwen3 MoE GGUF quantization, vLLM GGUF loading is still broken.
This PR aims to fix the GGUF loader and a mismatch in the Qwen3 MoE model's embed_tokens layer.
Test Plan
Test Result
Successfully loaded the GGUF model.
References