[Bug]: EADDRINUSE (-98) error when setting up NCCL communicator

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
INFO 04-02 23:52:28 [__init__.py:239] Automatically detected platform rocm.
Collecting environment information...
PyTorch version: 2.8.0.dev20250327+rocm6.3
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42131-fa1d09cbd

OS: SUSE Linux Enterprise Server 15 SP6 (x86_64)
GCC version: (SUSE Linux) 7.5.0
Clang version: Could not collect
CMake version: version 3.28.3
Libc version: glibc-2.38

Python version: 3.12.0 | packaged by Anaconda, Inc. | (main, Oct  2 2023, 17:29:18) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.4.0-150600.23.17_14.0.63-cray_shasta_c-x86_64-with-glibc2.38
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI210 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42131
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7763 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            1
Stepping:                             1
BogoMIPS:                             4890.70
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
L1d cache:                            2 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             256 MiB (8 instances)
NUMA node(s):                         4
NUMA node0 CPU(s):                    0-15,64-79
NUMA node1 CPU(s):                    16-31,80-95
NUMA node2 CPU(s):                    32-47,96-111
NUMA node3 CPU(s):                    48-63,112-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==3.3.0+git96316ce5
[pip3] pyzmq==26.2.1
[pip3] torch==2.8.0.dev20250327+rocm6.3
[pip3] torchaudio==2.6.0.dev20250331+rocm6.3
[pip3] torchvision==0.22.0.dev20250331+rocm6.3
[pip3] transformers==4.50.3
[pip3] triton==3.2.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] pytorch-triton-rocm       3.3.0+git96316ce5          pypi_0    pypi
[conda] pyzmq                     26.2.1                   pypi_0    pypi
[conda] torch                     2.8.0.dev20250327+rocm6.3          pypi_0    pypi
[conda] torchaudio                2.6.0.dev20250331+rocm6.3          pypi_0    pypi
[conda] torchvision               0.22.0.dev20250331+rocm6.3          pypi_0    pypi
[conda] transformers              4.50.3                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi
ROCM Version: 6.3.42133-1b9c17779
Neuron SDK Version: N/A
vLLM Version: 0.8.3.dev151+ge6e3c55ef.d20250401
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         
GPU0   0            

================================= Hops between two GPUs ==================================
       GPU0         
GPU0   0            

=============================== Link Type between two GPUs ===============================
       GPU0         
GPU0   0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
================================== End of ROCm SMI Log ===================================

LD_LIBRARY_PATH=/opt/rocm-6.3.1/lib/roctracer:/opt/rocm-6.3.1/lib/rocprofiler:/opt/rocm-6.3.1/lib:/lib:/sw/frontier/spack-envs/cpe24.11-cpu/opt/cce-18.0.1/darshan-runtime-3.4.6-ymnx2jlqwdsmjgdiu6ldpyxmcenq2nks/lib:/opt/cray/pe/papi/7.1.0.4/lib64:/opt/cray/libfabric/1.22.0/lib64
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```
</details>


### 🐛 Describe the bug

When launcing a node-local inference server using trl's `vllm_serve`, I often get EADDRINUSE errors coming from the NCCL communicator setup

```
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/uvicorn-0.34.0-py3.12.egg/uvicorn/protocols/http/httptools_impl.py", line 409, in run
_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/uvicorn-0.34.0-py3.12.egg/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/fastapi-0.115.8-py3.12.egg/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/routing.py", line 74, in app
    await response(scope, receive, send)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/responses.py", line 159, in __call__
    await self.background()
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/background.py", line 41, in __call__
    await task()
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/background.py", line 28, in __call__
    await run_in_threadpool(self.func, *self.args, **self.kwargs)
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/starlette/concurrency.py", line 37, in run_in_threadpool
    return await anyio.to_thread.run_sync(func)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 2461, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 962, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 496, in collective_rpc
    return self.llm_engine.collective_rpc(method, timeout, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 2132, in collective_rpc
    return self.model_executor.collective_rpc(method, timeout, args,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/vllm/utils.py", line 2329, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/trl/scripts/vllm_serve.py", line 111, in init_communicator
    pg = StatelessProcessGroup.create(host=host, port=port, rank=rank, world_size=world_size)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lustre/orion/stf006/world-shared/glaser/miniconda3/envs/grpo/lib/python3.12/site-packages/vllm/distributed/utils.py", line 236, in create
    store = TCPStore(
            ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server could not be initialized on any address for port=51216, family=10 The server socket has failed to bind to su-aliases.head-bmc.cm.frontier.olcf.ornl.gov:51216 (errno: 98 - Address already in use). The server could not be initialized on any address for port=51216, family=2
```
even though the server port is verifiably not in use by any other process on that node.

Launch command:
```
VLLM_PORT=29501 ROCR_VISIBLE_DEVICES=7 trl vllm-serve --model ${MODEL} --host=127.0.0.1
```

The suspected reason is the internal use of the 0.0.0.0 interface in torch's TCPStore to listen to incoming connections, which means it's listening on *all* network interfaces of each node, not just on the one associated with the host address provided.

https://github.com/vllm-project/vllm/blob/37bfee92bf4159f5839d9bed6b2fb2b96db4e741/vllm/distributed/utils.py#L236-L242

and 

https://github.com/pytorch/pytorch/blob/2e5d95a0828060f816251671e8e59f2680f9f9be/torch/csrc/distributed/c10d/TCPStoreLibUvBackend.cpp#L269-L273

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

	store = TCPStore(
	host_name=host,
	port=port,
	world_size=world_size,
	is_master=(rank == 0),
	timeout=datetime.timedelta(seconds=store_timeout),
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: EADDRINUSE (-98) error when setting up NCCL communicator #15987

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: EADDRINUSE (-98) error when setting up NCCL communicator #15987

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions