Skip to content

Conversation

@Gh0u1L5
Copy link
Contributor

@Gh0u1L5 Gh0u1L5 commented Aug 13, 2025

Purpose

Despite upstream repositories (gguf-py, transformers) having added support for Qwen3 MoE GGUF quantization, vLLM GGUF loading is still broken.
This PR aims to fix the GGUF loader and a mismatch in the Qwen3 MoE model's embed_tokens layer.

Test Plan

wget https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF/resolve/main/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
vllm serve ./Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

Test Result

Successfully loaded the GGUF model.

References

@Gh0u1L5 Gh0u1L5 requested a review from sighingnow as a code owner August 13, 2025 05:35
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Aug 13, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly adds support for Qwen2 and Qwen3 MoE models in the GGUF loader and fixes an embed_tokens layer initialization issue. The changes appear to be correct and well-tested. My main feedback is regarding code duplication in the GGUF loader, which could be refactored to improve long-term maintainability.

Comment on lines +77 to +87
if model_type in ("qwen2_moe", "qwen3_moe"):
model_type = model_type.replace("_", "")
# GGUF layer map assumes that we will have a merged expert weights
# so we need to map them manually
for idx in range(config.num_hidden_layers):
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.0.down_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.0.gate_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.0.up_proj.weight"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block for handling Qwen MoE models introduces significant code duplication with the existing logic for DeepSeek MoE models just above. While functionally correct, this pattern makes the code harder to maintain. If a change is needed for expert weight mapping, it would have to be applied in multiple places. Please consider refactoring to share the common logic between these model types.

@jeejeelee jeejeelee requested a review from Isotr0py August 13, 2025 05:45
Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Isotr0py Isotr0py enabled auto-merge (squash) August 13, 2025 11:07
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 13, 2025
@vllm-bot vllm-bot merged commit b159c0a into vllm-project:main Aug 13, 2025
48 of 55 checks passed
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
@yanmindi
Copy link

请问你用的版本是什么 我们试过都没法正常加载

@Isotr0py
Copy link
Member

请问你用的版本是什么 我们试过都没法正常加载

Try using the nightly wheel, this PR hasn't been included in newest release.

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

@adonishong
Copy link

请问你用的版本是什么 我们试过都没法正常加载

Try using the nightly wheel, this PR hasn't been included in newest release.

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

we were using latest main branch code, pip install -e .; just now, I tried, "pip remove vllm", then "pip install -U vllm
--pre
--extra-index-url https://wheels.vllm.ai/nightly"

still no luck in load qwen3 mode gguf, both 30B version and 480B version

both are terminated in this error
"vllm/transformers_utils/config.py", line 623, in get_sentence_transformer_tokenizer_config
if not encoder_dict and not model.startswith("/"):
^^^^^^^^^^^^^^^^
AttributeError: 'PosixPath' object has no attribute 'startswith'"

and this error info could be observed at very beginning

"ERROR 08-18 05:01:32 [config.py:133] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dev/shm/Qwen3-30B-A3B-Q4_K_M.gguf'. Use repo_type argument if needed., retrying 1 of 2
ERROR 08-18 05:01:34 [config.py:131] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/dev/shm/Qwen3-30B-A3B-Q4_K_M.gguf'. Use repo_type argument if needed."

@Isotr0py
Copy link
Member

we were using latest main branch code

AttributeError: 'PosixPath' object has no attribute 'startswith'"

Perhaps you forgot to pull the main branch? This bug has been fixed by #21579 three weeks ago...

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
divakar-amd pushed a commit to divakar-amd/vllm_upstream that referenced this pull request Aug 20, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
@yanmindi
Copy link

Purpose

Despite upstream repositories (gguf-py, transformers) having added support for Qwen3 MoE GGUF quantization, vLLM GGUF loading is still broken. This PR aims to fix the GGUF loader and a mismatch in the Qwen3 MoE model's embed_tokens layer.

Test Plan

wget https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF/resolve/main/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf
vllm serve ./Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

Test Result

Successfully loaded the GGUF model.

References

再请教个问题 我的也运行起来 gguf 格式了 但是推理的时候 报错乱码是什么原因? 你得推理输出正常吗? 换了好多版本的gguf还是不行

@yanmindi
Copy link

就是输出的内容是乱码 一些无意义的东西 或者就是不停的输出 iiiiiiiiiiiiiii 服务器端没报错退出

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Gh0u1L5 <Gh0u1L5@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants