-
Notifications
You must be signed in to change notification settings - Fork 7k
Description
[Serve.LLM] Port collisions with TP/PP when using NIXL/LMCache KV transfer backends
What happened + What you expected to happen
Reference: PR #55802 partially addressed port collisions for Data Parallelism by setting NIXL side-channel to base_port + data_parallel_rank.
Issue: With TP / PP, and multi-replica deployments, multiple vLLM processes on the same node concurrently probe for "open ports" using get_open_port() and can select the same one --> this can result in binding conflicts.
This manifests as:
- Flaky startups and deployment failures
Address already in useerrors from ZMQNIXL_ERR_BACKENDerrors- Stuck initialization when scaling replicas or increasing TP/PP on shared nodes
Expected: Each worker should receive a unique port to avoid collisions across all parallelism strategies (DP/TP/PP).
Symptoms
-
Port collisions: Multiple workers use identical ports
Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910 [repeated 3x across cluster] -
NIXL backend errors:
nixl._bindings.nixlBackendError: NIXL_ERR_BACKEND nixl_agent.cpp:481] registerMem: registration failed -
Deployment failures: Replicas fail to initialize and continuously restart
Root Cause
PR #55802 partially fixed port collisions for DP by adding logic to use data_parallel_rank:
dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rankHowever, this only works when data_parallel_rank is explicitly set by DPServer, which only occurs in DP deployments.
For TP/PP deployments:
data_parallel_rankis not set (or defaults to 0)- All workers use offset 0 → same port for all workers
- Port collision occurs when multiple workers initialize on the same node
Current code (nixl_connector.py):
dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank # Always 0 for TP/PP!Reproduction
Environment
- Ray: 3.0.0.dev0 (nightly)
- Python: 3.11.11
- Cluster: 8 GPU head node
Minimal Config (serve_config.yaml)
applications:
- name: test-pp2-nixl
import_path: ray.serve.llm:build_openai_app
route_prefix: /
args:
llm_configs:
- model_loading_config:
model_id: facebook/opt-125m
engine_kwargs:
pipeline_parallel_size: 2
tensor_parallel_size: 1
max_num_seqs: 4
enforce_eager: true
kv_transfer_config:
kv_connector: NixlConnector
kv_role: kv_both
deployment_config:
autoscaling_config:
min_replicas: 1
max_replicas: 1
ray_actor_options:
num_cpus: 4
num_gpus: 0Steps to Reproduce
-
Deploy the application:
serve run serve_config.yaml
-
Check logs for port collision:
ray logs --grep "Creating v1 connector" -
Observe that all workers use the same port:
Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910 Creating v1 connector with engine_id: Pkjr3b-10.0.235.241-52910 [repeated 3x] -
Check for NIXL errors:
ray logs --grep "NIXL_ERR_BACKEND"
Related Issues
- [Serve.LLM] Failed to launch disaggregated prefiller & decoder worker when enabled pipeline_parallel_size =2 #55775: Failed to launch disaggregated prefiller & decoder with
pipeline_parallel_size=2 - PR [serve.llm] Fixed DP DSV3 issues #55802: Fixed DP DSV3 issues (DP-only fix)
Versions / Dependencies
- Ray: 2.47+ (nightly 3.0.0.dev0 tested)
- vLLM: 0.10+
- Python: 3.11+
Ways To Reproduce Issue
Option 1 - NIXL with Pipeline Parallelism:
Set num_replicas=2, pipeline_parallel_size>=2 (or tensor_parallel_size>=2) with kv_transfer_config={'kv_connector': 'NixlConnector', 'kv_role': 'kv_both'} and observe bind conflicts without disambiguation.
Option 2 - LMCache with numeric port:
Set kv_connector_extra_config={'lmcache_rpc_port': 5555} and num_replicas>=2; observe ZMQ EADDRINUSE without disambiguation.
Impact
Without fix:
- TP/PP deployments with NIXL/LMCache fail to start
- Flaky deployments with intermittent port collisions
- Impossible to scale replicas reliably
With fix:
- All parallelism strategies (DP/TP/PP) work correctly with unique ports per worker
- Reliable scaling and deployment
- No manual port management required
Workaround
Currently, users must manually specify unique ports per worker using NIXL_SIDE_CHANNEL_PORT_BASE in experimental_configs, which is cumbersome and error-prone for multi-worker deployments.
Issue Severity
High - Blocks TP/PP deployments with KV transfer backends