[Model] Deepseek GGUF support #13167

SzymonOzog · 2025-02-12T16:49:18Z

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

download the tokenizer, configuration and modeling files from the original deepseek repo and the config.json from Unsloth GGUF repo.
Change the torch_dtype in config to float16
Merge the weights as instructed in the vLLM docs

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=4, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2000)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues:
~~Model loading is very slow as we load experts one by one~~ Fixed
GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.

Closes #12436

github-actions · 2025-02-12T16:49:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/model_executor/layers/quantization/gguf.py

Isotr0py · 2025-02-13T03:41:07Z

vllm/model_executor/model_loader/loader.py

+            # GGUF layer map assumes that we will have a merged expert weights
+            # so we need to map them manually
+            for idx in range(config.num_hidden_layers):
+                gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \
+                        f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight"


I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317

Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm

Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0 mapping because this is what deepseek_v2::load_weights expects, not sure it that's an issue

junuMoon · 2025-02-13T05:55:01Z

I followed your instruction but I got an error
Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

SzymonOzog · 2025-02-13T08:45:58Z

I followed your instruction but I got an error Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

Which of the quantized models are you trying to load?

junuMoon · 2025-02-13T09:19:46Z

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S
this one

SzymonOzog · 2025-02-13T12:06:30Z

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S this one

Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with?

chuangzhidan · 2025-02-14T02:51:43Z

met an error：
(base) ubuntu@localhost:/media/data/scripts$ python start_gguf.py
INFO 02-14 10:44:20 init.py:190] Automatically detected platform cuda.
Traceback (most recent call last):
File "/media/data/xgp/scripts/start_gguf.py", line 3, in
llm = LLM(
^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/utils.py", line 1051, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 212, in init
engine_args = EngineArgs(
^^^^^^^^^^^
TypeError: EngineArgs.init() got an unexpected keyword argument 'hf_config_path'
not sure where went wrong

(base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm
Name: vllm
Version: 0.7.2

base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE"
from vllm import LLM, SamplingParams
llm = LLM(
# model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
model="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/DeepSeek-R1-UD-IQ1_S-merge.gguf",
tokenizer=base_dir,
hf_config_path=base_dir,
enforce_eager=True,
tensor_parallel_size=2,
trust_remote_code=True,
max_model_len=10000
)

/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Feb 14 10:38 config.json
-rw-rw-r-- 1 ubuntu ubuntu 11K Feb 14 10:37 configuration_deepseek.py
-rwxrwxrwx 1 root root 131G Feb 12 17:47 DeepSeek-R1-UD-IQ1_S-merge.gguf*
-rw-rw-r-- 1 ubuntu ubuntu 171 Feb 14 10:37 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 74K Feb 14 10:37 modeling_deepseek.py
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Feb 14 10:37 tokenizer_config.json
-rw-rw-r-- 1 ubuntu ubuntu 7.5M Feb 14 10:37 tokenizer.json

SzymonOzog · 2025-02-14T04:20:25Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

zlh1992 · 2025-02-15T07:29:15Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

accupham · 2025-02-16T20:52:56Z

Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM.

seven1122 · 2025-02-17T07:47:05Z

when will be merged?

chuangzhidan · 2025-02-17T07:59:10Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled
Please show your detail environment?

u are right ,it has something to do with vllm version and this pr environment.thank u

seven1122 · 2025-02-18T03:17:27Z

met a KeyError: 'model.embed_tokens.qweight_type'

Exception in worker VllmWorkerProcess while processing method load_model.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist - packages/vllm/executor/multiproc_worker_utils.py", line
output = run_method(worker, method, args, kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/utils.py", line 2220, in run_method
return func(*args, **kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist - packages/vllm/worker/model_runner.py", line 1112, in
self.model = get_model(vllm_config=self.vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/init.py",
return loader.load_model(vllm_config=vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/loader.py",
model.load_weights()
File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/models/deepseek_v2.py",
param = params_dict[name]

KeyError: 'model.embed_tokens.qweight_type'

7d1-z · 2025-02-18T08:22:46Z

I try to reproduce this PR and raise same error like @seven1122 .

[rank0]:   File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type'

The checkpoints I used is DeepSeek-R1-UD-IQ1_S

I merged multi .gguf files to single by:

./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf

File path ~/DeepSeek-R1-UD-IQ1_S includes :

DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

leolmj · 2025-02-18T09:39:54Z

i raise the same error as @seven1122

SzymonOzog · 2025-02-18T09:49:15Z

@leolmj @seven1122 @zh-jp

I'm having trouble reproducing the issue, could you share:

your config.json
your vllm version + the commit hash that you checked out
the script that you are runnning

7d1-z · 2025-02-18T14:14:41Z

Hello @SzymonOzog .

config.json
The version of vllm is 0.7.2, and I have replaced the files that were specified in this PR.
Regarding the script, I used the demo you provided and only changed the parameter model that passed in the .gguf file path.

SzymonOzog · 2025-02-18T16:27:26Z

@zh-jp
You also need to change your dtype in config from bfloat16 to float16. Also could you check out this PR and run it through it? There have been changes in vllm since 0.7.2 and I cannot promise backwards compatibility

davidsyoung · 2025-02-18T17:43:28Z

I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you!

7d1-z · 2025-02-19T09:06:10Z

@SzymonOzog thanks for your valuable suggestions. I build the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

davidsyoung · 2025-02-19T09:32:58Z

@SzymonOzog thanks for your valuable suggestions. I install the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

Do you have a benchmark of performance?

slr1997 · 2025-02-19T09:42:43Z

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

junuMoon · 2025-02-19T13:10:28Z

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Based on 8 x A100 GPUs, it's showing around 7 tokens/s

SzymonOzog · 2025-04-02T07:30:20Z

@joshuakoh1
How are you passing the hf_config_path variable? It's set to none in your logs

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

lv03 · 2025-04-16T06:31:06Z

@SzymonOzog hello， I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB |
| 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB |
+-----------------------------------------------------------------------------------------+

zhaotyer · 2025-04-17T10:17:27Z

@SzymonOzog hello， I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off | | 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off | | 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off | | 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off | | 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 | | N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off | | 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off | | 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 | | N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB | | 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB | +-----------------------------------------------------------------------------------------+

me too, Have you solved this problem?

zhaotyer · 2025-04-17T10:20:31Z

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation.
vllm version is: 0.8.4

SzymonOzog · 2025-04-17T10:46:48Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

zhaotyer · 2025-04-17T11:07:33Z

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation. vllm version is: 0.8.4

@SzymonOzog Is there a solution to this problem?

SzymonOzog · 2025-04-17T11:15:56Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

zhaotyer · 2025-04-17T11:30:08Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Thank you for your reply, looking forward to merging the code

lv03 · 2025-04-17T12:41:51Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants我认为发生这种情况是因为 I 量化使用非常慢的 moe 实现，并且对于序列长度为 102400，它需要处理很长时间， #16780 应该增加对 MoE I 量化的更好支持

After loading, the following problem occurred，

I saw someone reported this issue before. If don't change I quants，How should I deal with this problem?

SzymonOzog · 2025-04-17T12:52:58Z

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue

lv03 · 2025-04-17T12:57:52Z

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue现在你可以用 enforce_eager=True 运行，虽然它会很慢，但我上面提到的 PR 也应该可以解决这个问题

thank you，Will the next version of vllm solve this problem?

SzymonOzog · 2025-04-17T18:10:15Z

That depends on when the PR will get merged onto main

zhaotyer · 2025-04-22T05:03:28Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Hello, could you provide a docker images url? The network here is not good and docker build always fails

zhaotyer · 2025-04-22T11:23:10Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start
I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

zhaotyer · 2025-04-22T12:26:34Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)

(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

hahmad2008 · 2025-05-05T15:01:45Z

@SzymonOzog
can we serve this model using vllm?
NoelJacob/Meta-Llama-3-8B-Instruct-Q4_K_M-GGUF

ChuanhongLi · 2025-06-03T09:49:15Z

@SzymonOzog The new DeepSeek-R1-0528-UD-Q2_K_XL gguf files have removed blk.0.attn_kv_b.weight and added blk.0.attn_k_b.weight and blk.0.attn_v_b.weight. This change prevents us from loading the model correctly. How can we address this issue?

SzymonOzog · 2025-06-03T09:58:12Z

@ChuanhongLi I think you should be able to get around it by modifying

vllm/vllm/model_executor/model_loader/gguf_loader.py

Line 37 in 118ff92

def _get_gguf_weights_map(self, model_config: ModelConfig):

to point the layers to correct places

ChuanhongLi · 2025-06-06T02:42:54Z

@ChuanhongLi I think you should be able to get around it by modifying

vllm/vllm/model_executor/model_loader/gguf_loader.py

Line 37 in 118ff92

def _get_gguf_weights_map(self, model_config: ModelConfig):

to point the layers to correct places

Thanks for your reply, but the problem may come from kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T (line 703, vllm/v1/attention/backends/mla/common.py) #19050 (comment)

ChuanhongLi · 2025-06-10T12:05:02Z

@ChuanhongLi I think you should be able to get around it by modifying

vllm/vllm/model_executor/model_loader/gguf_loader.py

Line 37 in 118ff92

def _get_gguf_weights_map(self, model_config: ModelConfig):

to point the layers to correct places

Thanks for your reply, but the problem may come from kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T (line 703, vllm/v1/attention/backends/mla/common.py) #19050 (comment)

@SzymonOzog Hi, do you have any idea about how to fix it? Should we load the attn_k_b and attn_v_b to produce the kv_b weight? Approcaite for your help, thanks!

maekawatoshiki · 2025-06-12T01:41:37Z

I’m facing the same problem. I made a few changes and finally got vLLM to start, but the output is gibberish. Has anyone figured out a solution?

SzymonOzog · 2025-06-12T09:17:09Z

As of now I don't have a ready solution for ths. I'll try to find some time to debug the issue over the weekend

hyunwen · 2025-06-23T06:43:38Z

Any progress here? We are also stuck here.

kechengcode · 2025-07-01T06:05:57Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)

(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

I meet the same problme

SzymonOzog requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 12, 2025 16:49

SzymonOzog mentioned this pull request Feb 12, 2025

[Usage]: Does DeepSeek-R1 1.58-bit Dynamic Quant work on VLLM? #12573

Closed

1 task

SzymonOzog force-pushed the gguf-deepseek branch from aec8431 to 1038380 Compare February 12, 2025 16:55

Isotr0py reviewed Feb 13, 2025

View reviewed changes

jeejeelee mentioned this pull request Feb 18, 2025

[Feature]: DeepSeek-R1-UD-IQ1_S(1.58bit-guff) support requirement #13447

Closed

1 task

huang-junhong approved these changes Feb 18, 2025

View reviewed changes

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Model] Deepseek GGUF support (vllm-project#13167)

9e743e7

Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>

harryzwh mentioned this pull request Apr 15, 2025

FEAT: add ggufv2 support for vLLM xorbitsai/inference#3259

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model] Deepseek GGUF support (vllm-project#13167)

737c2f6

SandroPats mentioned this pull request Jun 3, 2025

Deepseek-R1 GGUF support sgl-project/sglang#6847

Draft

6 tasks

Isotr0py mentioned this pull request Jun 4, 2025

[Bug]: Quantization method specified in the model config (fp8) does not match the quantization method specified in the quantization argument (gguf). #19050

Open

1 task

Uh oh!

[Model] Deepseek GGUF support #13167

[Model] Deepseek GGUF support #13167

Uh oh!

Conversation

SzymonOzog commented Feb 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 12, 2025

Uh oh!

Uh oh!

Isotr0py Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

SzymonOzog Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

SzymonOzog Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

junuMoon commented Feb 13, 2025

Uh oh!

SzymonOzog commented Feb 13, 2025

Uh oh!

junuMoon commented Feb 13, 2025

Uh oh!

SzymonOzog commented Feb 13, 2025

Uh oh!

chuangzhidan commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SzymonOzog commented Feb 14, 2025

Uh oh!

zlh1992 commented Feb 15, 2025

Uh oh!

accupham commented Feb 16, 2025

Uh oh!

seven1122 commented Feb 17, 2025

Uh oh!

chuangzhidan commented Feb 17, 2025

Uh oh!

seven1122 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

7d1-z commented Feb 18, 2025

Uh oh!

leolmj commented Feb 18, 2025

Uh oh!

SzymonOzog commented Feb 18, 2025

Uh oh!

7d1-z commented Feb 18, 2025

Uh oh!

SzymonOzog commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsyoung commented Feb 18, 2025

Uh oh!

7d1-z commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsyoung commented Feb 19, 2025

Uh oh!

slr1997 commented Feb 19, 2025

Uh oh!

junuMoon commented Feb 19, 2025

Uh oh!

SzymonOzog commented Apr 2, 2025

Uh oh!

lv03 commented Apr 16, 2025

Uh oh!

zhaotyer commented Apr 17, 2025

Uh oh!

zhaotyer commented Apr 17, 2025

Uh oh!

SzymonOzog commented Apr 17, 2025

Uh oh!

zhaotyer commented Apr 17, 2025

Uh oh!

SzymonOzog commented Apr 17, 2025

Uh oh!

SzymonOzog commented Feb 12, 2025 •

edited by github-actions bot

Loading

chuangzhidan commented Feb 14, 2025 •

edited

Loading

seven1122 commented Feb 18, 2025 •

edited

Loading

SzymonOzog commented Feb 18, 2025 •

edited

Loading

7d1-z commented Feb 19, 2025 •

edited

Loading

ChuanhongLi commented Jun 10, 2025 •

edited

Loading

maekawatoshiki commented Jun 12, 2025 •

edited

Loading