-
-
Notifications
You must be signed in to change notification settings - Fork 11k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
🐛 Describe the bug
When trying to load the Qwen3-Reranker-4B model with tensor parallelism enabled (tensor_parallel_size=2), the model initialization fails due to a tensor dimension mismatch error.
Environment
- vLLM version: 0.9.2
- Model: Qwen/Qwen3-Reranker-4B
- GPU configuration: 2 GPUs with tensor parallelism
- CUDA version: 12.9
Steps to reproduce
- Run vLLM with the following configuration:
--model Qwen/Qwen3-Reranker-4B --task score --enforce_eager True --served_model_name Qwen/Qwen3-Reranker-4B-30k --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' --tensor_parallel_size 2 --gpu_memory_utilization 0.97 Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.
Actual behavior The model fails to load with the following error:
RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
Full stack trace shows the error occurs in the load_weights_using_from_2_way_softmax function when attempting to copy weights to the score layer:
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax model.score.weight.data.copy_(weight) RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
Possible workaround The model loads successfully when using tensor_parallel_size=1 (no tensor parallelism).
Full Log:
INFO 07-09 00:56:59 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:02 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-09 00:57:02 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Reranker-4B', 'task': 'score', 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen3-Reranker-4B-30k'], 'hf_overrides': {'architectures': ['Qwen3ForSequenceClassification'], 'classifier_from_token': ['no', 'yes'], 'is_original_qwen3_reranker': True}, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.97}
INFO 07-09 00:57:09 [config.py:1472] Using max model len 40960
INFO 07-09 00:57:09 [arg_utils.py:1596] (Disabling) chunked prefill by default
INFO 07-09 00:57:09 [arg_utils.py:1599] (Disabling) prefix caching by default
WARNING 07-09 00:57:09 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 07-09 00:57:09 [config.py:4601] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-09 00:57:14 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:17 [core.py:526] Waiting for init message from front-end.
INFO 07-09 00:57:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B-30k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-09 00:57:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-09 00:57:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_86991475'), local_subscribe_addr='ipc:///tmp/d4a143f0-8d21-4c46-a4fd-667a62dba5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b4da7d2'), local_subscribe_addr='ipc:///tmp/03a0090e-ee69-42d1-a8d5-d900afccb3b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f40c3804'), local_subscribe_addr='ipc:///tmp/1fe57a13-1ef5-480f-a578-f67b4f015124', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=213) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_468dc0a1'), local_subscribe_addr='ipc:///tmp/0cdef032-c307-4c3a-b4df-1c75d7eca008', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212)
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212)
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 1.33it/s]
(VllmWorker rank=0 pid=212)
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.38it/s]
(VllmWorker rank=0 pid=212)
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.37it/s]
(VllmWorker rank=0 pid=212)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.worker.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model_runner.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.model = model_loader.load_model(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] self.load_weights(model, model_config)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] loaded_weights = model.load_weights(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 256, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return seq_cls_model_loader(self, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 375, in seq_cls_model_loader
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] return SEQ_CLS_LOAD_METHODS[method](model, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] model.score.weight.data.copy_(weight)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
84487e05a96e:213:213 [1] NCCL INFO cudaDriverVersion 12090
84487e05a96e:213:213 [1] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:213:213 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : No device found.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
84487e05a96e:213:213 [1] NCCL INFO Using network Socket
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init START
84487e05a96e:213:213 [1] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:213:213 [1] NCCL INFO Bootstrap timings total 0.001015 (create 0.000035, send 0.000131, recv 0.000371, ring 0.000024, delay 0.000000)
84487e05a96e:213:213 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order
[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:213:213 [1] NCCL INFO comm 0xe54d5d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
84487e05a96e:213:213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
84487e05a96e:213:213 [1] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:213:262 [1] NCCL INFO [Proxy Service] Device 1 CPU core 7
84487e05a96e:213:265 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 9
84487e05a96e:213:213 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:213:213 [1] NCCL INFO Connected all trees
84487e05a96e:213:266 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 12
84487e05a96e:213:213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:213:213 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:213:213 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:213:213 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
84487e05a96e:212:212 [0] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO cudaDriverVersion 12090
84487e05a96e:212:212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:212:212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : No device found.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
84487e05a96e:212:212 [0] NCCL INFO Using network Socket
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init START
84487e05a96e:212:212 [0] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:212:212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000029, send 0.000126, recv 0.000313, ring 0.000026, delay 0.000000)
84487e05a96e:212:212 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order
[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:212:212 [0] NCCL INFO comm 0xe64fff0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
84487e05a96e:212:212 [0] NCCL INFO Channel 00/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Channel 01/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
84487e05a96e:212:212 [0] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:212:212 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
84487e05a96e:212:264 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9
84487e05a96e:212:263 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7
84487e05a96e:212:212 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:212:212 [0] NCCL INFO Connected all trees
84487e05a96e:212:267 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 13
84487e05a96e:212:212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:212:212 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:212:212 [0] NCCL INFO CC Off, workFifoBytes 1048576
84487e05a96e:212:212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:212:212 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
[rank0]:[W709 00:57:29.669899852 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 07-09 00:57:30 [core.py:586] EngineCore failed to start.
ERROR 07-09 00:57:30 [core.py:586] Traceback (most recent call last):
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
ERROR 07-09 00:57:30 [core.py:586] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
ERROR 07-09 00:57:30 [core.py:586] super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
ERROR 07-09 00:57:30 [core.py:586] self.model_executor = executor_class(vllm_config)
ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-09 00:57:30 [core.py:586] self._init_executor()
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
ERROR 07-09 00:57:30 [core.py:586] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-09 00:57:30 [core.py:586] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
ERROR 07-09 00:57:30 [core.py:586] raise e from None
ERROR 07-09 00:57:30 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module>
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__
super().__init__(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__
with launch_core_engines(vllm_config, executor_class,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
next(self.gen)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
wait_for_engine_startup(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 4.0.3
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.7.0+cu128
Is debug build : False
CUDA used to build PyTorch : 12.8
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform : Linux-5.15.0-43-generic-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.8.93
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA GeForce RTX 3080
GPU 1: NVIDIA GeForce RTX 3080
Nvidia driver version : 575.64.03
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 16
Socket(s): 2
Stepping: 7
BogoMIPS: 6000.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
L1d cache: 1 MiB (32 instances)
L1i cache: 1 MiB (32 instances)
L2 cache: 128 MiB (32 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-15
NUMA node1 CPU(s): 16-31
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.2.6.post1+cu128torch2.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.53.1
[pip3] triton==3.3.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.9.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-31 0-1 N/A
GPU1 PHB X 0-31 0-1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NCCL_VERSION=2.25.1-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.8.1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY```
</details>
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working