[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1


### 🐛 Describe the bug

When trying to load the Qwen3-Reranker-4B model with tensor parallelism enabled (tensor_parallel_size=2), the model initialization fails due to a tensor dimension mismatch error. 

## Environment 
- vLLM version: 0.9.2
- Model: Qwen/Qwen3-Reranker-4B
- GPU configuration: 2 GPUs with tensor parallelism 
- CUDA version: 12.9 

## Steps to reproduce 
1. Run vLLM with the following configuration: 
```bash 
--model Qwen/Qwen3-Reranker-4B --task score --enforce_eager True --served_model_name Qwen/Qwen3-Reranker-4B-30k --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' --tensor_parallel_size 2 --gpu_memory_utilization 0.97 
``` 

## Expected behavior The model should load successfully with tensor parallelism across 2 GPUs. 
## Actual behavior The model fails to load with the following error: 
``` 
RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 
``` 
Full stack trace shows the error occurs in the `load_weights_using_from_2_way_softmax` function when attempting to copy weights to the score layer: 
``` 
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax model.score.weight.data.copy_(weight) RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1 
``` 
## Possible workaround The model loads successfully when using `tensor_parallel_size=1` (no tensor parallelism). 

## Full Log：
```
INFO 07-09 00:56:59 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:02 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-09 00:57:02 [cli_args.py:325] non-default args: {'host': '0.0.0.0', 'model': 'Qwen/Qwen3-Reranker-4B', 'task': 'score', 'enforce_eager': True, 'served_model_name': ['Qwen/Qwen3-Reranker-4B-30k'], 'hf_overrides': {'architectures': ['Qwen3ForSequenceClassification'], 'classifier_from_token': ['no', 'yes'], 'is_original_qwen3_reranker': True}, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.97}
INFO 07-09 00:57:09 [config.py:1472] Using max model len 40960
INFO 07-09 00:57:09 [arg_utils.py:1596] (Disabling) chunked prefill by default
INFO 07-09 00:57:09 [arg_utils.py:1599] (Disabling) prefix caching by default
WARNING 07-09 00:57:09 [cuda.py:102] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 07-09 00:57:09 [config.py:4601] Only "last" pooling supports chunked prefill and prefix caching; disabling both.
INFO 07-09 00:57:14 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:17 [core.py:526] Waiting for init message from front-end.
INFO 07-09 00:57:17 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B-30k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":0,"local_cache_dir":null}
WARNING 07-09 00:57:17 [multiproc_worker_utils.py:307] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 07-09 00:57:17 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_86991475'), local_subscribe_addr='ipc:///tmp/d4a143f0-8d21-4c46-a4fd-667a62dba5dd', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
INFO 07-09 00:57:21 [__init__.py:244] Automatically detected platform cuda.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8b4da7d2'), local_subscribe_addr='ipc:///tmp/03a0090e-ee69-42d1-a8d5-d900afccb3b0', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:24 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_f40c3804'), local_subscribe_addr='ipc:///tmp/1fe57a13-1ef5-480f-a578-f67b4f015124', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [__init__.py:1152] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [custom_all_reduce_utils.py:246] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=213) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) WARNING 07-09 00:57:25 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_468dc0a1'), local_subscribe_addr='ipc:///tmp/0cdef032-c307-4c3a-b4df-1c75d7eca008', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [parallel_state.py:1076] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1770] Starting to load model Qwen/Qwen3-Reranker-4B...
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [gpu_model_runner.py:1775] Loading model from scratch...
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:25 [cuda.py:284] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=212) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
(VllmWorker rank=1 pid=213) INFO 07-09 00:57:26 [weight_utils.py:292] Using model weights format ['*.safetensors', '*.bin', '*.pt']
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.33it/s]
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.38it/s]
(VllmWorker rank=0 pid=212) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.37it/s]
(VllmWorker rank=0 pid=212) 
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] WorkerProc failed to start.
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] Traceback (most recent call last):
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 461, in worker_main
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     worker = WorkerProc(*args, **kwargs)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 358, in __init__
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.worker.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 185, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.model_runner.load_model()
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1776, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.model = model_loader.load_model(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 41, in load_model
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     self.load_weights(model, model_config)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 269, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     loaded_weights = model.load_weights(
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 256, in load_weights
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     return seq_cls_model_loader(self, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 375, in seq_cls_model_loader
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     return SEQ_CLS_LOAD_METHODS[method](model, weights)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/adapters.py", line 351, in load_weights_using_from_2_way_softmax
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487]     model.score.weight.data.copy_(weight)
(VllmWorker rank=1 pid=213) ERROR 07-09 00:57:28 [multiproc_executor.py:487] RuntimeError: The size of tensor a (1280) must match the size of tensor b (2560) at non-singleton dimension 1
84487e05a96e:213:213 [1] NCCL INFO cudaDriverVersion 12090
84487e05a96e:213:213 [1] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:213:213 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : No device found.
84487e05a96e:213:213 [1] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:213:213 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
84487e05a96e:213:213 [1] NCCL INFO Using network Socket
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init START
84487e05a96e:213:213 [1] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:213:213 [1] NCCL INFO Bootstrap timings total 0.001015 (create 0.000035, send 0.000131, recv 0.000371, ring 0.000024, delay 0.000000)
84487e05a96e:213:213 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.

[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order

[2025-07-09 00:57:25] 84487e05a96e:213:213 [1] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:213:213 [1] NCCL INFO comm 0xe54d5d0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
84487e05a96e:213:213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
84487e05a96e:213:213 [1] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:213:262 [1] NCCL INFO [Proxy Service] Device 1 CPU core 7
84487e05a96e:213:265 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 9
84487e05a96e:213:213 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
84487e05a96e:213:213 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:213:213 [1] NCCL INFO Connected all trees
84487e05a96e:213:266 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 12
84487e05a96e:213:213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:213:213 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:213:213 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:213:213 [1] NCCL INFO ncclCommInitRank comm 0xe54d5d0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 6000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:213:213 [1] NCCL INFO Init timings - ncclCommInitRank: rank 1 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
84487e05a96e:212:212 [0] NCCL INFO Bootstrap: Using eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO cudaDriverVersion 12090
84487e05a96e:212:212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
84487e05a96e:212:212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : No device found.
84487e05a96e:212:212 [0] NCCL INFO NET/IB : Using [RO]; OOB eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO NET/Socket : Using [0]eth0:172.21.0.3<0>
84487e05a96e:212:212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
84487e05a96e:212:212 [0] NCCL INFO Using network Socket
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init START
84487e05a96e:212:212 [0] NCCL INFO RAS client listening socket at ::1<28028>
84487e05a96e:212:212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000029, send 0.000126, recv 0.000313, ring 0.000026, delay 0.000000)
84487e05a96e:212:212 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.

[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 4, falling back to simple order

[2025-07-09 00:57:25] 84487e05a96e:212:212 [0] graph/search.cc:1127 NCCL WARN Could not find a path for pattern 1, falling back to simple order
84487e05a96e:212:212 [0] NCCL INFO comm 0xe64fff0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
84487e05a96e:212:212 [0] NCCL INFO Channel 00/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Channel 01/02 : 0 1
84487e05a96e:212:212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
84487e05a96e:212:212 [0] NCCL INFO P2P Chunksize set to 131072
84487e05a96e:212:212 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
84487e05a96e:212:264 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9
84487e05a96e:212:263 [0] NCCL INFO [Proxy Service] Device 0 CPU core 7
84487e05a96e:212:212 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
84487e05a96e:212:212 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
84487e05a96e:212:212 [0] NCCL INFO Connected all trees
84487e05a96e:212:267 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 13
84487e05a96e:212:212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
84487e05a96e:212:212 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
84487e05a96e:212:212 [0] NCCL INFO CC Off, workFifoBytes 1048576
84487e05a96e:212:212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
84487e05a96e:212:212 [0] NCCL INFO ncclCommInitRank comm 0xe64fff0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 5000 commId 0x1760359f24154150 - Init COMPLETE
84487e05a96e:212:212 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 2 total 0.22 (kernels 0.15, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.05, rest 0.00)
[rank0]:[W709 00:57:29.669899852 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
ERROR 07-09 00:57:30 [core.py:586] EngineCore failed to start.
ERROR 07-09 00:57:30 [core.py:586] Traceback (most recent call last):
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
ERROR 07-09 00:57:30 [core.py:586]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-09 00:57:30 [core.py:586]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
ERROR 07-09 00:57:30 [core.py:586]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
ERROR 07-09 00:57:30 [core.py:586]     self.model_executor = executor_class(vllm_config)
ERROR 07-09 00:57:30 [core.py:586]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
ERROR 07-09 00:57:30 [core.py:586]     self._init_executor()
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
ERROR 07-09 00:57:30 [core.py:586]     self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-09 00:57:30 [core.py:586]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-09 00:57:30 [core.py:586]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
ERROR 07-09 00:57:30 [core.py:586]     raise e from None
ERROR 07-09 00:57:30 [core.py:586] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 590, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 577, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 404, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 75, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 53, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 93, in _init_executor
    self.workers = WorkerProc.wait_for_ready(unready_workers)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 422, in wait_for_ready
    raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1495, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
    return AsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 666, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 403, in __init__
    with launch_core_engines(vllm_config, executor_class,
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
    wait_for_engine_startup(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
```

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 4.0.3
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-43-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.8.93
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 3080
GPU 1: NVIDIA GeForce RTX 3080

Nvidia driver version        : 575.64.03
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   46 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          32
On-line CPU(s) list:             0-31
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz
CPU family:                      6
Model:                           85
Thread(s) per core:              1
Core(s) per socket:              16
Socket(s):                       2
Stepping:                        7
BogoMIPS:                        6000.00
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
L1d cache:                       1 MiB (32 instances)
L1i cache:                       1 MiB (32 instances)
L2 cache:                        128 MiB (32 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; TSX disabled

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.2.6.post1+cu128torch2.7
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==27.0.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.53.1
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.2
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
    GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  PHB 0-31    0-1     N/A
GPU1    PHB  X  0-31    0-1     N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.8 brand=unknown,driver>=470,driver<471 brand=grid,driver>=470,driver<471 brand=tesla,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=vapps,driver>=470,driver<471 brand=vpc,driver>=470,driver<471 brand=vcs,driver>=470,driver<471 brand=vws,driver>=470,driver<471 brand=cloudgaming,driver>=470,driver<471 brand=unknown,driver>=535,driver<536 brand=grid,driver>=535,driver<536 brand=tesla,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=vapps,driver>=535,driver<536 brand=vpc,driver>=535,driver<536 brand=vcs,driver>=535,driver<536 brand=vws,driver>=535,driver<536 brand=cloudgaming,driver>=535,driver<536 brand=unknown,driver>=550,driver<551 brand=grid,driver>=550,driver<551 brand=tesla,driver>=550,driver<551 brand=nvidia,driver>=550,driver<551 brand=quadro,driver>=550,driver<551 brand=quadrortx,driver>=550,driver<551 brand=nvidiartx,driver>=550,driver<551 brand=vapps,driver>=550,driver<551 brand=vpc,driver>=550,driver<551 brand=vcs,driver>=550,driver<551 brand=vws,driver>=550,driver<551 brand=cloudgaming,driver>=550,driver<551 brand=unknown,driver>=560,driver<561 brand=grid,driver>=560,driver<561 brand=tesla,driver>=560,driver<561 brand=nvidia,driver>=560,driver<561 brand=quadro,driver>=560,driver<561 brand=quadrortx,driver>=560,driver<561 brand=nvidiartx,driver>=560,driver<561 brand=vapps,driver>=560,driver<561 brand=vpc,driver>=560,driver<561 brand=vcs,driver>=560,driver<561 brand=vws,driver>=560,driver<561 brand=cloudgaming,driver>=560,driver<561 brand=unknown,driver>=565,driver<566 brand=grid,driver>=565,driver<566 brand=tesla,driver>=565,driver<566 brand=nvidia,driver>=565,driver<566 brand=quadro,driver>=565,driver<566 brand=quadrortx,driver>=565,driver<566 brand=nvidiartx,driver>=565,driver<566 brand=vapps,driver>=565,driver<566 brand=vpc,driver>=565,driver<566 brand=vcs,driver>=565,driver<566 brand=vws,driver>=565,driver<566 brand=cloudgaming,driver>=565,driver<566
NCCL_VERSION=2.25.1-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NVIDIA_PRODUCT_NAME=CUDA
VLLM_USAGE_SOURCE=production-docker-image
CUDA_VERSION=12.8.1
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
LD_LIBRARY_PATH=/usr/local/cuda/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY```

</details>



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670

🐛 Describe the bug

Environment

Steps to reproduce

Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.

Actual behavior The model fails to load with the following error:

Possible workaround The model loads successfully when using `tensor_parallel_size=1` (no tensor parallelism).

Full Log：

Your current environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Tensor dimension mismatch when loading Qwen3-Reranker-4B with tensor parallel > 1 #20670

Description

🐛 Describe the bug

Environment

Steps to reproduce

Expected behavior The model should load successfully with tensor parallelism across 2 GPUs.

Actual behavior The model fails to load with the following error:

Possible workaround The model loads successfully when using tensor_parallel_size=1 (no tensor parallelism).

Full Log：

Your current environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Possible workaround The model loads successfully when using `tensor_parallel_size=1` (no tensor parallelism).