Skip to content

[Bug]: VLLM Docker v0.9.0 produces Runtime Error: Cuda Error on Blackwell using Qwen0.6B #18916

@Rezzemy

Description

@Rezzemy

Your current environment

The output of python collect_env.py as well as nvidia-smi
# ls            
benchmarks  collect_env.py  examples  requirements
# python3 collect_env.py
INFO 05-29 09:14:46 [__init__.py:243] Automatically detected platform cuda.
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.0.2
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.10 (main, Apr  9 2025, 08:55:05) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090
Nvidia driver version        : 576.40
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               16
On-line CPU(s) list:                  0-15
Vendor ID:                            AuthenticAMD
Model name:                           AMD Ryzen 7 9800X3D 8-Core Processor
CPU family:                           26
Model:                                68
Thread(s) per core:                   2
Core(s) per socket:                   8
Socket(s):                            1
Stepping:                             0
BogoMIPS:                             9399.77
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm avx512_vp2intersect
Virtualization:                       AMD-V
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            384 KiB (8 instances)
L1i cache:                            256 KiB (8 instances)
L2 cache:                             8 MiB (8 instances)
L3 cache:                             96 MiB (1 instance)
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.52.3
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X                              N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda/lib64
CUDA_VERSION=12.8.1
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
NVIDIA_VISIBLE_DEVICES=all
NCCL_VERSION=2.25.1-1
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

# nvidia-smi
Thu May 29 09:15:20 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.55.01              Driver Version: 576.40         CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0  On |                  N/A |
| 45%   36C    P0             75W /  480W |   30394MiB /  32607MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              68      C   /python3.12                           N/A      |
+-----------------------------------------------------------------------------------------+

🐛 Describe the bug

When queuing V1/chat/completions with qwen0.6b in b16, the VLLM instance crashes due to

RuntimeError: CUDA error: no kernel image is available for execution on the device
the bat file used to launch the docker image
@echo off
echo Starting vLLM server with Docker...
docker run --runtime nvidia --gpus all ^
    -v A:\spare:/models ^
    -p 8000:8000 ^
    --ipc=host ^
    vllm/vllm-openai:v0.9.0 ^
    --model /models/Qwen3-0.6B ^
    --host 0.0.0.0 ^
    --port 8000
the full docker start+traceback
Starting vLLM server with Docker...
INFO 05-29 09:03:31 [__init__.py:243] Automatically detected platform cuda.
INFO 05-29 09:03:32 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 05-29 09:03:32 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 05-29 09:03:32 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-29 09:03:33 [api_server.py:1289] vLLM API server version 0.9.0
INFO 05-29 09:03:33 [cli_args.py:300] non-default args: {'host': '0.0.0.0', 'model': '/models/Qwen3-0.6B'}
INFO 05-29 09:03:38 [config.py:793] This model supports multiple tasks: {'embed', 'reward', 'classify', 'score', 'generate'}. Defaulting to 'generate'.
INFO 05-29 09:03:38 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 05-29 09:03:40 [__init__.py:243] Automatically detected platform cuda.
INFO 05-29 09:03:42 [core.py:438] Waiting for init message from front-end.
INFO 05-29 09:03:42 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 05-29 09:03:42 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 05-29 09:03:42 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 05-29 09:03:42 [core.py:65] Initializing a V1 LLM engine (v0.9.0) with config: model='/models/Qwen3-0.6B', speculative_config=None, tokenizer='/models/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/models/Qwen3-0.6B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
WARNING 05-29 09:03:43 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7fbdabc20fb0>
[W529 09:03:53.605944217 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W529 09:04:03.616419669 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3
INFO 05-29 09:04:03 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 05-29 09:04:03 [interface.py:344] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 05-29 09:04:03 [topk_topp_sampler.py:48] Using FlashInfer for top-p & top-k sampling.
INFO 05-29 09:04:03 [gpu_model_runner.py:1531] Starting to load model /models/Qwen3-0.6B...
INFO 05-29 09:04:03 [cuda.py:217] Using Flash Attention backend on V1 engine.
INFO 05-29 09:04:03 [backends.py:35] Using InductorAdaptor
INFO 05-29 09:04:03 [backends.py:35] Using InductorAdaptor
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:35<00:00, 35.24s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:35<00:00, 35.24s/it]

INFO 05-29 09:04:39 [default_loader.py:280] Loading weights took 35.29 seconds
INFO 05-29 09:04:39 [gpu_model_runner.py:1549] Model loading took 1.1201 GiB and 35.478868 seconds
INFO 05-29 09:04:43 [backends.py:459] Using cache directory: /root/.cache/vllm/torch_compile_cache/ba063de529/rank_0_0 for vLLM's torch.compile
INFO 05-29 09:04:43 [backends.py:469] Dynamo bytecode transform time: 3.94 s
INFO 05-29 09:04:45 [backends.py:158] Cache the graph of shape None for later use
INFO 05-29 09:05:00 [backends.py:170] Compiling a graph for general shape takes 16.71 s
INFO 05-29 09:05:11 [monitor.py:33] torch.compile takes 20.66 s in total
INFO 05-29 09:05:19 [kv_cache_utils.py:637] GPU KV cache size: 235,104 tokens
INFO 05-29 09:05:19 [kv_cache_utils.py:640] Maximum concurrency for 40,960 tokens per request: 5.74x
INFO 05-29 09:05:33 [gpu_model_runner.py:1933] Graph capturing finished in 14 secs, took 0.86 GiB
INFO 05-29 09:05:33 [core.py:167] init engine (profile, create kv cache, warmup model) took 54.30 seconds
INFO 05-29 09:05:33 [loggers.py:134] vllm cache_config_info with initialization after num_gpu_blocks is: 14694
WARNING 05-29 09:05:33 [config.py:1339] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
INFO 05-29 09:05:33 [serving_chat.py:117] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-29 09:05:33 [serving_completion.py:65] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
INFO 05-29 09:05:33 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:8000
INFO 05-29 09:05:33 [launcher.py:28] Available routes are:
INFO 05-29 09:05:33 [launcher.py:36] Route: /openapi.json, Methods: HEAD, GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /docs, Methods: HEAD, GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /redoc, Methods: HEAD, GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /health, Methods: GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /load, Methods: GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /ping, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /ping, Methods: GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /tokenize, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /detokenize, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/models, Methods: GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /version, Methods: GET
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/chat/completions, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/completions, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/embeddings, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /pooling, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /classify, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /score, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/score, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/audio/transcriptions, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /rerank, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v1/rerank, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /v2/rerank, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /invocations, Methods: POST
INFO 05-29 09:05:33 [launcher.py:36] Route: /metrics, Methods: GET
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     172.17.0.1:42292 - "OPTIONS /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:42292 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:42292 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:42292 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:44106 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:44106 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:44106 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:44106 - "GET /v1/models HTTP/1.1" 200 OK
INFO:     172.17.0.1:44106 - "OPTIONS /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-29 09:06:04 [chat_utils.py:419] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 05-29 09:06:04 [logger.py:42] Received request chatcmpl-64a973494b0c4794a8d172e5b01b74fa: prompt: '<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40951, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO:     172.17.0.1:44106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-29 09:06:04 [async_llm.py:261] Added request chatcmpl-64a973494b0c4794a8d172e5b01b74fa.
ERROR 05-29 09:06:04 [dump_input.py:68] Dumping input data
--- Logging error ---
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1196, in execute_model
    model_output = self.model(
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 300, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 245, in __call__
    model_output = self.forward(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
    def forward(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.58", line 240, in forward
    submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.2", line 5, in forward
    unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 425, in unified_attention_with_output
    self.impl.forward(self,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 622, in forward
    flash_attn_varlen_func(
  File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 227, in flash_attn_varlen_func
    out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.12/logging/__init__.py", line 1160, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 999, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/logging_utils/formatter.py", line 13, in format
    msg = logging.Formatter.format(self, record)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 703, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/logging/__init__.py", line 392, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 4506, in __str__
    f"compilation_config={self.compilation_config!r}")
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/config.py", line 3905, in __repr__
    return json.dumps(include)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 200, in encode
    chunks = self.iterencode(o, _one_shot=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 258, in iterencode
    return _iterencode(o, 0)
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/encoder.py", line 180, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type PostGradPassManager is not JSON serializable
Call stack:
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 135, in _main
    return self._bootstrap(parent_sentinel)
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 210, in execute_model
    dump_engine_exception(self.vllm_config, scheduler_output,
  File "/usr/local/lib/python3.12/dist-packages/vllm/logging_utils/dump_input.py", line 62, in dump_engine_exception
    _dump_engine_exception(config, scheduler_output, scheduler_stats)
  File "/usr/local/lib/python3.12/dist-packages/vllm/logging_utils/dump_input.py", line 70, in _dump_engine_exception
    logger.error(
Unable to print the message and arguments - possible formatting error.
Use the traceback above to help find the error.
ERROR 05-29 09:06:13 [dump_input.py:78] Dumping scheduler output for model execution:
ERROR 05-29 09:06:13 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-64a973494b0c4794a8d172e5b01b74fa,prompt_token_ids_len=9,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[151643], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=40951, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[[1]],num_computed_tokens=0,lora_request=None)], scheduled_cached_reqs=[], num_scheduled_tokens={chatcmpl-64a973494b0c4794a8d172e5b01b74fa: 9}, total_num_scheduled_tokens=9, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=[], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
ERROR 05-29 09:06:13 [dump_input.py:81] SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, gpu_cache_usage=0.00013610997686130943, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=9, hits=0), spec_decoding_stats=None)
ERROR 05-29 09:06:13 [core.py:502] EngineCore encountered a fatal error.
ERROR 05-29 09:06:13 [core.py:502] Traceback (most recent call last):
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
ERROR 05-29 09:06:13 [core.py:502]     engine_core.run_busy_loop()
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
ERROR 05-29 09:06:13 [core.py:502]     self._process_engine_step()
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
ERROR 05-29 09:06:13 [core.py:502]     outputs = self.step_fn()
ERROR 05-29 09:06:13 [core.py:502]               ^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 226, in step
ERROR 05-29 09:06:13 [core.py:502]     model_output = self.execute_model(scheduler_output)
ERROR 05-29 09:06:13 [core.py:502]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 213, in execute_model
ERROR 05-29 09:06:13 [core.py:502]     raise err
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 207, in execute_model
ERROR 05-29 09:06:13 [core.py:502]     return self.model_executor.execute_model(scheduler_output)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
ERROR 05-29 09:06:13 [core.py:502]     output = self.collective_rpc("execute_model",
ERROR 05-29 09:06:13 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 05-29 09:06:13 [core.py:502]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 05-29 09:06:13 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
ERROR 05-29 09:06:13 [core.py:502]     return func(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-29 09:06:13 [core.py:502]     return func(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
ERROR 05-29 09:06:13 [core.py:502]     output = self.model_runner.execute_model(scheduler_output,
ERROR 05-29 09:06:13 [core.py:502]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 05-29 09:06:13 [core.py:502]     return func(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1196, in execute_model
ERROR 05-29 09:06:13 [core.py:502]     model_output = self.model(
ERROR 05-29 09:06:13 [core.py:502]                    ^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 05-29 09:06:13 [core.py:502]     return self._call_impl(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 05-29 09:06:13 [core.py:502]     return forward_call(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 300, in forward
ERROR 05-29 09:06:13 [core.py:502]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 05-29 09:06:13 [core.py:502]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 245, in __call__
ERROR 05-29 09:06:13 [core.py:502]     model_output = self.forward(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 05-29 09:06:13 [core.py:502]     def forward(
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 05-29 09:06:13 [core.py:502]     return self._call_impl(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 05-29 09:06:13 [core.py:502]     return forward_call(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
ERROR 05-29 09:06:13 [core.py:502]     return fn(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 05-29 09:06:13 [core.py:502]     return self._wrapped_call(self, *args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 05-29 09:06:13 [core.py:502]     raise e
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 05-29 09:06:13 [core.py:502]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 05-29 09:06:13 [core.py:502]     return self._call_impl(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 05-29 09:06:13 [core.py:502]     return forward_call(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "<eval_with_key>.58", line 240, in forward
ERROR 05-29 09:06:13 [core.py:502]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
ERROR 05-29 09:06:13 [core.py:502]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
ERROR 05-29 09:06:13 [core.py:502]     return self._wrapped_call(self, *args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
ERROR 05-29 09:06:13 [core.py:502]     raise e
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
ERROR 05-29 09:06:13 [core.py:502]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 05-29 09:06:13 [core.py:502]     return self._call_impl(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 05-29 09:06:13 [core.py:502]     return forward_call(*args, **kwargs)
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "<eval_with_key>.2", line 5, in forward
ERROR 05-29 09:06:13 [core.py:502]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
ERROR 05-29 09:06:13 [core.py:502]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
ERROR 05-29 09:06:13 [core.py:502]     return self._op(*args, **(kwargs or {}))
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 425, in unified_attention_with_output
ERROR 05-29 09:06:13 [core.py:502]     self.impl.forward(self,
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 622, in forward
ERROR 05-29 09:06:13 [core.py:502]     flash_attn_varlen_func(
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 227, in flash_attn_varlen_func
ERROR 05-29 09:06:13 [core.py:502]     out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
ERROR 05-29 09:06:13 [core.py:502]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
ERROR 05-29 09:06:13 [core.py:502]     return self._op(*args, **(kwargs or {}))
ERROR 05-29 09:06:13 [core.py:502]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [core.py:502] RuntimeError: CUDA error: no kernel image is available for execution on the device
ERROR 05-29 09:06:13 [core.py:502] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 05-29 09:06:13 [core.py:502] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 05-29 09:06:13 [core.py:502] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 05-29 09:06:13 [core.py:502]
Process EngineCore_0:
ERROR 05-29 09:06:13 [async_llm.py:408] AsyncLLM output_handler failed.
ERROR 05-29 09:06:13 [async_llm.py:408] Traceback (most recent call last):
ERROR 05-29 09:06:13 [async_llm.py:408]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-29 09:06:13 [async_llm.py:408]     outputs = await engine_core.get_output_async()
ERROR 05-29 09:06:13 [async_llm.py:408]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [async_llm.py:408]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-29 09:06:13 [async_llm.py:408]     raise self._format_exception(outputs) from None
ERROR 05-29 09:06:13 [async_llm.py:408] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
Traceback (most recent call last):
INFO 05-29 09:06:13 [async_llm.py:333] Request chatcmpl-64a973494b0c4794a8d172e5b01b74fa failed (engine dead).
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 493, in run_engine_core
    engine_core.run_busy_loop()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 520, in run_busy_loop
    self._process_engine_step()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 545, in _process_engine_step
    outputs = self.step_fn()
              ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 226, in step
    model_output = self.execute_model(scheduler_output)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 213, in execute_model
    raise err
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 207, in execute_model
    return self.model_executor.execute_model(scheduler_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 86, in execute_model
    output = self.collective_rpc("execute_model",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
    output = self.model_runner.execute_model(scheduler_output,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1196, in execute_model
    model_output = self.model(
                   ^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3.py", line 300, in forward
    hidden_states = self.model(input_ids, positions, intermediate_tensors,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 245, in __call__
    model_output = self.forward(*args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
    def forward(
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.58", line 240, in forward
    submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<eval_with_key>.2", line 5, in forward
    unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 425, in unified_attention_with_output
    self.impl.forward(self,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/flash_attn.py", line 622, in forward
    flash_attn_varlen_func(
  File "/usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/flash_attn_interface.py", line 227, in flash_attn_varlen_func
    out, softmax_lse = torch.ops._vllm_fa2_C.varlen_fwd(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 05-29 09:06:13 [serving_chat.py:884] Error in chat completion stream generator.
ERROR 05-29 09:06:13 [serving_chat.py:884] Traceback (most recent call last):
ERROR 05-29 09:06:13 [serving_chat.py:884]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 476, in chat_completion_stream_generator
ERROR 05-29 09:06:13 [serving_chat.py:884]     async for res in result_generator:
ERROR 05-29 09:06:13 [serving_chat.py:884]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 315, in generate
ERROR 05-29 09:06:13 [serving_chat.py:884]     out = q.get_nowait() or await q.get()
ERROR 05-29 09:06:13 [serving_chat.py:884]                             ^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [serving_chat.py:884]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/output_processor.py", line 51, in get
ERROR 05-29 09:06:13 [serving_chat.py:884]     raise output
ERROR 05-29 09:06:13 [serving_chat.py:884]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 366, in output_handler
ERROR 05-29 09:06:13 [serving_chat.py:884]     outputs = await engine_core.get_output_async()
ERROR 05-29 09:06:13 [serving_chat.py:884]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-29 09:06:13 [serving_chat.py:884]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 806, in get_output_async
ERROR 05-29 09:06:13 [serving_chat.py:884]     raise self._format_exception(outputs) from None
ERROR 05-29 09:06:13 [serving_chat.py:884] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     172.17.0.1:44106 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     172.17.0.1:44106 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]
Press any key to continue . . .

I've read other issues, and while there are other issues with the same error as this issue
The error's in other issues are on a completely different system, with a different gpu topology, with a different traceback leading to that error. That is why im creating a independent issue.

If this is problematic, please feel free to close it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions