Skip to content

Conversation

SzymonOzog
Copy link
Contributor

@SzymonOzog SzymonOzog commented Feb 12, 2025

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=4, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2000)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues:
Model loading is very slow as we load experts one by one Fixed
GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.

Closes #12436

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Comment on lines 1266 to 1276
# GGUF layer map assumes that we will have a merged expert weights
# so we need to map them manually
for idx in range(config.num_hidden_layers):
gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \
f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0 mapping because this is what deepseek_v2::load_weights expects, not sure it that's an issue

@junuMoon
Copy link

I followed your instruction but I got an error
Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

@SzymonOzog
Copy link
Contributor Author

I followed your instruction but I got an error Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

Which of the quantized models are you trying to load?

@junuMoon
Copy link

@SzymonOzog
Copy link
Contributor Author

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S this one

Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with?

@chuangzhidan
Copy link

chuangzhidan commented Feb 14, 2025

met an error:
(base) ubuntu@localhost:/media/data/scripts$ python start_gguf.py
INFO 02-14 10:44:20 init.py:190] Automatically detected platform cuda.
Traceback (most recent call last):
File "/media/data/xgp/scripts/start_gguf.py", line 3, in
llm = LLM(
^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/utils.py", line 1051, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 212, in init
engine_args = EngineArgs(
^^^^^^^^^^^
TypeError: EngineArgs.init() got an unexpected keyword argument 'hf_config_path'
not sure where went wrong

(base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm
Name: vllm
Version: 0.7.2

base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE"
from vllm import LLM, SamplingParams
llm = LLM(
# model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
model="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/DeepSeek-R1-UD-IQ1_S-merge.gguf",
tokenizer=base_dir,
hf_config_path=base_dir,
enforce_eager=True,
tensor_parallel_size=2,
trust_remote_code=True,
max_model_len=10000
)

/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Feb 14 10:38 config.json
-rw-rw-r-- 1 ubuntu ubuntu 11K Feb 14 10:37 configuration_deepseek.py
-rwxrwxrwx 1 root root 131G Feb 12 17:47 DeepSeek-R1-UD-IQ1_S-merge.gguf*
-rw-rw-r-- 1 ubuntu ubuntu 171 Feb 14 10:37 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 74K Feb 14 10:37 modeling_deepseek.py
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Feb 14 10:37 tokenizer_config.json
-rw-rw-r-- 1 ubuntu ubuntu 7.5M Feb 14 10:37 tokenizer.json

@SzymonOzog
Copy link
Contributor Author

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

@zlh1992
Copy link

zlh1992 commented Feb 15, 2025

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

@accupham
Copy link

Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM.

@seven1122
Copy link

when will be merged?

@chuangzhidan
Copy link

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

u are right ,it has something to do with vllm version and this pr environment.thank u

@seven1122
Copy link

seven1122 commented Feb 18, 2025

met a KeyError: 'model.embed_tokens.qweight_type'
IMG_014

Exception in worker VllmWorkerProcess while processing method load_model.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist - packages/vllm/executor/multiproc_worker_utils.py", line
output = run_method(worker, method, args, kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/utils.py", line 2220, in run_method
return func(*args, **kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist - packages/vllm/worker/model_runner.py", line 1112, in
self.model = get_model(vllm_config=self.vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/init.py",
return loader.load_model(vllm_config=vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/loader.py",
model.load_weights()
File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/models/deepseek_v2.py",
param = params_dict[name]

KeyError: 'model.embed_tokens.qweight_type'

@7d1-z
Copy link

7d1-z commented Feb 18, 2025

I try to reproduce this PR and raise same error like @seven1122 .

[rank0]:   File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type'

The checkpoints I used is DeepSeek-R1-UD-IQ1_S

I merged multi .gguf files to single by:

./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf

File path ~/DeepSeek-R1-UD-IQ1_S includes :

DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

@leolmj
Copy link

leolmj commented Feb 18, 2025

i raise the same error as @seven1122

@SzymonOzog
Copy link
Contributor Author

@leolmj @seven1122 @zh-jp

I'm having trouble reproducing the issue, could you share:

  • your config.json
  • your vllm version + the commit hash that you checked out
  • the script that you are runnning

@7d1-z
Copy link

7d1-z commented Feb 18, 2025

Hello @SzymonOzog .

  • config.json
  • The version of vllm is 0.7.2, and I have replaced the files that were specified in this PR.
  • Regarding the script, I used the demo you provided and only changed the parameter model that passed in the .gguf file path.

@SzymonOzog
Copy link
Contributor Author

SzymonOzog commented Feb 18, 2025

@zh-jp
You also need to change your dtype in config from bfloat16 to float16. Also could you check out this PR and run it through it? There have been changes in vllm since 0.7.2 and I cannot promise backwards compatibility

@davidsyoung
Copy link

I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you!

@7d1-z
Copy link

7d1-z commented Feb 19, 2025

@SzymonOzog thanks for your valuable suggestions. I build the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

@davidsyoung
Copy link

@SzymonOzog thanks for your valuable suggestions. I install the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

Do you have a benchmark of performance?

@slr1997
Copy link

slr1997 commented Feb 19, 2025

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

@junuMoon
Copy link

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Based on 8 x A100 GPUs, it's showing around 7 tokens/s

@SzymonOzog
Copy link
Contributor Author

@joshuakoh1
How are you passing the hf_config_path variable? It's set to none in your logs

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
@lv03
Copy link

lv03 commented Apr 16, 2025

@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here
image

image

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off |
| 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off |
| 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off |
| 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off |
| 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 |
| N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off |
| 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off |
| 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 |
| N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB |
| 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB |
+-----------------------------------------------------------------------------------------+

@zhaotyer
Copy link
Contributor

@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here image

image

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off | | 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off | | 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off | | 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off | | 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 | | N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off | | 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off | | 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 | | N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB | | 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB | +-----------------------------------------------------------------------------------------+

me too, Have you solved this problem?

@zhaotyer
Copy link
Contributor

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

1744885087776
vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation.
vllm version is: 0.8.4

@SzymonOzog
Copy link
Contributor Author

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

@zhaotyer
Copy link
Contributor

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

1744885087776 vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation. vllm version is: 0.8.4

@SzymonOzog Is there a solution to this problem?

@SzymonOzog
Copy link
Contributor Author

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

@zhaotyer
Copy link
Contributor

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Thank you for your reply, looking forward to merging the code

@lv03
Copy link

lv03 commented Apr 17, 2025

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants我认为发生这种情况是因为 I 量化使用非常慢的 moe 实现,并且对于序列长度为 102400,它需要处理很长时间, #16780 应该增加对 MoE I 量化的更好支持

After loading, the following problem occurred,
image

I saw someone reported this issue before. If don't change I quants,How should I deal with this problem?
image

@SzymonOzog
Copy link
Contributor Author

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue

@lv03
Copy link

lv03 commented Apr 17, 2025

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue现在你可以用 enforce_eager=True 运行,虽然它会很慢,但我上面提到的 PR 也应该可以解决这个问题

thank you,Will the next version of vllm solve this problem?

@SzymonOzog
Copy link
Contributor Author

That depends on when the PR will get merged onto main

@zhaotyer
Copy link
Contributor

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Hello, could you provide a docker images url? The network here is not good and docker build always fails

@zhaotyer
Copy link
Contributor

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start
I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@zhaotyer
Copy link
Contributor

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)
(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
@hahmad2008
Copy link

@SzymonOzog
can we serve this model using vllm?
NoelJacob/Meta-Llama-3-8B-Instruct-Q4_K_M-GGUF

@ChuanhongLi
Copy link

@SzymonOzog The new DeepSeek-R1-0528-UD-Q2_K_XL gguf files have removed blk.0.attn_kv_b.weight and added blk.0.attn_k_b.weight and blk.0.attn_v_b.weight. This change prevents us from loading the model correctly. How can we address this issue?

@SzymonOzog
Copy link
Contributor Author

@ChuanhongLi I think you should be able to get around it by modifying

def _get_gguf_weights_map(self, model_config: ModelConfig):
to point the layers to correct places

@ChuanhongLi
Copy link

@ChuanhongLi I think you should be able to get around it by modifying

def _get_gguf_weights_map(self, model_config: ModelConfig):

to point the layers to correct places

Thanks for your reply, but the problem may come from kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T (line 703, vllm/v1/attention/backends/mla/common.py) #19050 (comment)

@ChuanhongLi
Copy link

ChuanhongLi commented Jun 10, 2025

@ChuanhongLi I think you should be able to get around it by modifying

def _get_gguf_weights_map(self, model_config: ModelConfig):

to point the layers to correct places

Thanks for your reply, but the problem may come from kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T (line 703, vllm/v1/attention/backends/mla/common.py) #19050 (comment)

@SzymonOzog Hi, do you have any idea about how to fix it? Should we load the attn_k_b and attn_v_b to produce the kv_b weight? Approcaite for your help, thanks!

@maekawatoshiki
Copy link

maekawatoshiki commented Jun 12, 2025

I’m facing the same problem. I made a few changes and finally got vLLM to start, but the output is gibberish. Has anyone figured out a solution?

@SzymonOzog
Copy link
Contributor Author

As of now I don't have a ready solution for ths. I'll try to find some time to debug the issue over the weekend

@hyunwen
Copy link

hyunwen commented Jun 23, 2025

Any progress here? We are also stuck here.

@kechengcode
Copy link

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)
(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

I meet the same problme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding structured-output v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature]: Deepseek R1 GGUF 4bit(Q4KM) support