Skip to content

RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98' #156

@krishung5

Description

@krishung5

Seeing an CUDA runtime error coming from deep_gemm.py when running vLLM WideEP multinode with DeepSeek R1 using commits after #112.

Same issue reported here. It occurs during model loading:

2025-08-01T08:01:22.575940Z ERROR core.run_engine_core: EngineCore failed to start.
Traceback (most recent call last):
  File "/opt/vllm/vllm/v1/engine/core.py", line 621, in run_engine_core
    engine_core = DPEngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/engine/core.py", line 881, in __init__
    super().__init__(vllm_config, local_client, handshake_address,
  File "/opt/vllm/vllm/v1/engine/core.py", line 441, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/opt/vllm/vllm/v1/engine/core.py", line 77, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/opt/vllm/vllm/executor/uniproc_executor.py", line 49, in _init_executor
    self.collective_rpc("load_model")
  File "/opt/vllm/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/utils/__init__.py", line 2985, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/worker/gpu_worker.py", line 201, in load_model
    self.model_runner.load_model(eep_scale_up=eep_scale_up)
  File "/opt/vllm/vllm/v1/worker/gpu_model_runner.py", line 1876, in load_model
    self.model = model_loader.load_model(
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
    process_weights_after_loading(model, model_config, target_device)
  File "/opt/vllm/vllm/model_executor/model_loader/utils.py", line 126, in process_weights_after_loading
    module.process_weights_after_loading(model_config.dtype)
  File "/opt/vllm/vllm/attention/layer.py", line 310, in process_weights_after_loading
    self.impl.process_weights_after_loading(act_dtype)
  File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 994, in process_weights_after_loading
    kv_b_proj_weight = get_and_maybe_dequant_weights(self.kv_b_proj).T
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/v1/attention/backends/mla/common.py", line 983, in get_and_maybe_dequant_weights
    dequant_weights = layer.quant_method.apply(layer,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/fp8.py", line 451, in apply
    return torch.ops.vllm.apply_w8a8_block_fp8_linear(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/utils/fp8_utils.py", line 147, in apply_w8a8_block_fp8_linear
    output = torch.ops.vllm.w8a8_block_fp8_matmul_deepgemm(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/vllm/vllm/model_executor/layers/quantization/deepgemm.py", line 58, in w8a8_block_fp8_matmul_deepgemm
    fp8_gemm_nt((A, As), (B, Bs), C)
  File "/opt/vllm/vllm/utils/deep_gemm.py", line 92, in fp8_gemm_nt
    return _fp8_gemm_nt_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98'

To repro, build the container using the dockerfile provided in the comment, and run the vllm WideEP multinode example:

# node 1
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
    vllm serve  deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
    --data-parallel-rpc-port 13345 --api-server-count=8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager

# node 2
VLLM_ALL2ALL_BACKEND=deepep_low_latency VLLM_USE_DEEP_GEMM=1 \
    vllm serve  deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address <node1 ip address> \
    --data-parallel-rpc-port 13345 --data-parallel-start-rank 8 --gpu-memory-utilization 0.95 --max-model-len 10240 --enforce-eager --headless

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions