Skip to content

[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670

@yurhett

Description

@yurhett

🐛 Describe the bug

When trying to load the Qwen3-Reranker-4B model with tensor parallelism enabled (tensor_parallel_size=2), the model initialization fails due to a tensor dimension mismatch error.

Environment

  • vLLM version: 0.9.2
  • Model: Qwen/Qwen3-Reranker-4B
  • GPU configuration: 2 GPUs with tensor parallelism
  • CUDA version: 12.9

Steps to reproduce

  1. Run vLLM with the following configuration:
--model Qwen/Qwen3-Reranker-4B --task score --enforce_eager True --served_model_name Qwen/Qwen3-Reranker-4B-30k --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' --tensor_parallel_size 2 --gpu_memory_utilization 0.97 

Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.

Actual behavior The model fails to load with the following error:

RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 

Full stack trace shows the error occurs in the load_weights_using_from_2_way_softmax function when attempting to copy weights to the score layer:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax model.score.weight.data.copy_(weight) RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 

Possible workaround The model loads successfully when using tensor_parallel_size=1 (no tensor parallelism).

Full Log:

INFO 07-09 00:56:59 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:02 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-09 00:57:02 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Reranker-4B', 'task': 'score', 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen3-Reranker-4B-30k'], 'hf_overrides': {'architectures': ['Qwen3ForSequenceClassification'], 'classifier_from_token': ['no', 'yes'], 'is_original_qwen3_reranker': True}, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.97}
INFO 07-09 00:57:09 [config.py:1472] Using max model len 40960
INFO 07-09 00:57:09 [arg_utils.py:1596] (Disabling) chunked prefill by default
INFO 07-09 00:57:09 [arg_utils.py:1599] (Disabling) prefix caching by default
WARNING 07-09 00:57:09 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 07-09 00:57:09 [config.py:4601] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-09 00:57:14 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:17 [core.py:526] Waiting for init message from front-end.
INFO 07-09 00:57:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B-30k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-09 00:57:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-09 00:57:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_86991475'), local_subscribe_addr='ipc:///tmp/d4a143f0-8d21-4c46-a4fd-667a62dba5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b4da7d2'), local_subscribe_addr='ipc:///tmp/03a0090e-ee69-42d1-a8d5-d900afccb3b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f40c3804'), local_subscribe_addr='ipc:///tmp/1fe57a13-1ef5-480f-a578-f67b4f015124', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=213) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_468dc0a1'), local_subscribe_addr='ipc:///tmp/0cdef032-c307-4c3a-b4df-1c75d7eca008', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.33it/s]
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.38it/s]
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.37it/s]
(VllmWorker rank=0 pid=212) 
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.model = model_loader.load_model(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.load_weights(model, model_config)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     loaded_weights = model.load_weights(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 256, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     return seq_cls_model_loader(self, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 375, in seq_cls_model_loader
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     return SEQ_CLS_LOAD_METHODS[method](model, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     model.score.weight.data.copy_(weight)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
84487e05a96e:213:213 [1] NCCL INFO cudaDriverVersion 12090
84487e05a96e:213:213 [1] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:213:213 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : No device found.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
84487e05a96e:213:213 [1] NCCL INFO Using network Socket
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init START
84487e05a96e:213:213 [1] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:213:213 [1] NCCL INFO Bootstrap timings total 0.001015 (create 0.000035, send 0.000131, recv 0.000371, ring 0.000024, delay 0.000000)
84487e05a96e:213:213 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.

[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order

[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:213:213 [1] NCCL INFO comm 0xe54d5d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
84487e05a96e:213:213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
84487e05a96e:213:213 [1] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:213:262 [1] NCCL INFO [Proxy Service] Device 1 CPU core 7
84487e05a96e:213:265 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 9
84487e05a96e:213:213 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:213:213 [1] NCCL INFO Connected all trees
84487e05a96e:213:266 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 12
84487e05a96e:213:213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:213:213 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:213:213 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:213:213 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
84487e05a96e:212:212 [0] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO cudaDriverVersion 12090
84487e05a96e:212:212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:212:212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : No device found.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
84487e05a96e:212:212 [0] NCCL INFO Using network Socket
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init START
84487e05a96e:212:212 [0] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:212:212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000029, send 0.000126, recv 0.000313, ring 0.000026, delay 0.000000)
84487e05a96e:212:212 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.

[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order

[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:212:212 [0] NCCL INFO comm 0xe64fff0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
84487e05a96e:212:212 [0] NCCL INFO Channel 00/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Channel 01/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
84487e05a96e:212:212 [0] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:212:212 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
84487e05a96e:212:264 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9
84487e05a96e:212:263 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7
84487e05a96e:212:212 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:212:212 [0] NCCL INFO Connected all trees
84487e05a96e:212:267 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 13
84487e05a96e:212:212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:212:212 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:212:212 [0] NCCL INFO CC Off, workFifoBytes 1048576
84487e05a96e:212:212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:212:212 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
[rank0]:[W709 00:57:29.669899852 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 07-09 00:57:30 [core.py:586] EngineCore failed to start.
ERROR 07-09 00:57:30 [core.py:586] Traceback (most recent call last):
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
ERROR 07-09 00:57:30 [core.py:586]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-09 00:57:30 [core.py:586]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
ERROR 07-09 00:57:30 [core.py:586]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
ERROR 07-09 00:57:30 [core.py:586]     self.model_executor = executor_class(vllm_config)
ERROR 07-09 00:57:30 [core.py:586]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-09 00:57:30 [core.py:586]     self._init_executor()
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
ERROR 07-09 00:57:30 [core.py:586]     self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-09 00:57:30 [core.py:586]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
ERROR 07-09 00:57:30 [core.py:586]     raise e from None
ERROR 07-09 00:57:30 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
    self.workers = WorkerProc.wait_for_ready(unready_workers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
    raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
    return AsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__
    with launch_core_engines(vllm_config, executor_class,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Your current environment

The output of python collect_env.py
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.0.3
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-43-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 3080
GPU 1: NVIDIA GeForce RTX 3080

Nvidia driver version        : 575.64.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       2
Stepping:                        7
BogoMIPS:                        6000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
L1d cache:                       1 MiB (32 instances)
L1i cache:                       1 MiB (32 instances)
L2 cache:                        128 MiB (32 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.2.6.post1+cu128torch2.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.53.1
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
    GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PHB 0-31    0-1     N/A
GPU1    PHB  X  0-31    0-1     N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NCCL_VERSION=2.25.1-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.8.1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY```

</details>



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions