-
-
Notifications
You must be signed in to change notification settings - Fork 10.5k
[Model] Deepseek GGUF support #13167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
aec8431
to
1038380
Compare
# GGUF layer map assumes that we will have a merged expert weights | ||
# so we need to map them manually | ||
for idx in range(config.num_hidden_layers): | ||
gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \ | ||
f"model.layers.{idx}.mlp.gate.e_score_correction_bias" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers
handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0
mapping because this is what deepseek_v2::load_weights
expects, not sure it that's an issue
I followed your instruction but I got an error
|
Which of the quantized models are you trying to load? |
Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with? |
met an error: (base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE" /data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/ |
@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
|
Please show your detail environment? |
Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM. |
when will be merged? |
u are right ,it has something to do with vllm version and this pr environment.thank u |
I try to reproduce this PR and raise same error like @seven1122 . [rank0]: File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type' The checkpoints I used is DeepSeek-R1-UD-IQ1_S I merged multi .gguf files to single by: ./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf File path ~/DeepSeek-R1-UD-IQ1_S includes : DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf |
i raise the same error as @seven1122 |
@leolmj @seven1122 @zh-jp I'm having trouble reproducing the issue, could you share:
|
Hello @SzymonOzog .
|
@zh-jp |
I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you! |
@SzymonOzog thanks for your valuable suggestions. I build the vllm from |
Do you have a benchmark of performance? |
@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least? |
INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. Based on 8 x A100 GPUs, it's showing around 7 tokens/s |
@joshuakoh1 |
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto +-----------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------+ |
me too, Have you solved this problem? |
|
I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants |
@SzymonOzog Is there a solution to this problem? |
Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue |
Thank you for your reply, looking forward to merging the code |
After loading, the following problem occurred, I saw someone reported this issue before. If don't change I quants,How should I deal with this problem? |
For now you can run with |
thank you,Will the next version of vllm solve this problem? |
That depends on when the PR will get merged onto main |
Hello, could you provide a docker images url? The network here is not good and docker build always fails |
when I set max_model_len to 8192, The service will crash when it start
error log
|
when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows
|
@SzymonOzog |
@SzymonOzog The new DeepSeek-R1-0528-UD-Q2_K_XL gguf files have removed blk.0.attn_kv_b.weight and added blk.0.attn_k_b.weight and blk.0.attn_v_b.weight. This change prevents us from loading the model correctly. How can we address this issue? |
@ChuanhongLi I think you should be able to get around it by modifying
|
Thanks for your reply, but the problem may come from kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T (line 703, vllm/v1/attention/backends/mla/common.py) #19050 (comment) |
@SzymonOzog Hi, do you have any idea about how to fix it? Should we load the attn_k_b and attn_v_b to produce the kv_b weight? Approcaite for your help, thanks! |
I’m facing the same problem. I made a few changes and finally got vLLM to start, but the output is gibberish. Has anyone figured out a solution? |
As of now I don't have a ready solution for ths. I'll try to find some time to debug the issue over the weekend |
Any progress here? We are also stuck here. |
I meet the same problme |
This adds support for quantized deepseek versions from Unsloth:
Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.
To run at the moment one needs to:
When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:
Current issues:
Model loading is very slow as we load experts one by oneFixedGGUF MoE is a very naive implementation and is very slow
I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.
Closes #12436