[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

nrghosh · 2025-10-16T00:16:52Z

Description

Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR #55802) only addressed Data Parallelism by using explicit data_parallel_rank.

Changes:

base.py: Added _compute_port_offset() method with fallback logic
- Priority 1: Use data_parallel_rank if set (DP case)
- Priority 2: Use Ray Serve replica_rank for deterministic offset (TP/PP case)
- Fallback: Return 0
nixl_connector.py: Use _compute_port_offset() instead of direct dp_rank access
lmcache_connector_v1.py: Simplified to use string-based port naming with random suffix

Fixes port collision errors in TP/PP deployments:

Multiple workers no longer bind to same port
Prevents NIXL_ERR_BACKEND and ZMQ errors
Unblocks deployment with PP size > 1

Reproduction:
Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed:
Creating v1 connector with engine_id: ...-52910 [repeated 3x]
After fix, each worker receives unique port via replica_rank, eliminating collisions.

Related issues

Addresses #55775
Addresses vllm-project/vllm#20980

Types of change

Checklist

Does this PR introduce breaking changes?

Yes ⚠️
No

Testing:

Added/updated tests for my changes
Tested the changes manually
This PR is not tested ❌ (please explain why)

Code Quality:

Signed off every commit (git commit -s)
Ran pre-commit hooks (setup guide)

Documentation:

Updated documentation (if applicable) (contribution guide)
Added new APIs to doc/source/ (if applicable)

Additional context

Code Changes

NIXL Connector - Before:

dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank  # Always 0 for TP/PP!

NIXL Connector - After:

port = base_port + self._compute_port_offset()  # Works for DP/TP/PP

_compute_port_offset() Implementation:

def _compute_port_offset(self) -> int:
    # Priority 1: Use explicit DP rank when available
    dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank")
    if isinstance(dp_rank, int) and dp_rank >= 0:
        return dp_rank
    
    # Priority 2: Fall back to Serve replica rank for TP/PP cases
    try:
        rc = serve.get_replica_context()
        if rc and hasattr(rc, "rank"):
            return rc.rank
    except Exception:
        pass
    
    return 0

LMCache Connector - Simplified approach:

# Always use string-based naming with random suffix for uniqueness
lmcache_rpc_port_value = str(base_value) + self._get_unique_suffix()

Backward Compatibility

DP deployments continue using explicit data_parallel_rank (priority 1)
TP/PP deployments now use replica_rank from Ray Serve (priority 2)
LMCache uses string-based port naming with random suffix (unchanged behavior)
Zero fallback maintains current behavior when neither rank is available

Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh

Failing llm_serve_vllm_integration_tests::test_deepseek_model release test with

KeyError: Deployment(name='LLMServer:deepseek-ai--DeepSeek-V2-Lite', app='default')

But passes locally on feature branch

PASSED

============================================================================ 1 passed in 57.16s ============================================================================
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:05 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=704,device_name=NVIDIA_L4.json'] [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [custom_all_reduce.py:154] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2602] Starting to load model deepseek-ai/DeepSeek-V2-Lite...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2634] Loading model from scratch...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [utils.py:125] Hidden layers were unevenly partitioned: [14,13]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [cuda.py:297] Using Triton MLA backend on V1 engine.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:00 [weight_utils.py:392] Using model weights format ['*.safetensors']
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:06 [gpu_worker.py:298] Available KV cache memory: 12.54 GiB [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:07 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used. [repeated 2x across cluster]
(base) ray@ip-10-0-167-142:~/default/work/ray$ 
(base) ray@ip-10-0-167-142:~/default/work/ray$ git status
On branch nrghosh/pp-tp-kv-port-offset
nothing to commit, working tree clean

python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py

kouroshHakha · 2025-10-18T02:56:38Z

python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.py

@@ -35,6 +35,38 @@ def _get_unique_suffix(self, len: int = 6) -> str:
        """
        return "".join(random.choices(string.ascii_letters + string.digits, k=len))

+    def _compute_port_offset(self) -> int:


we should just use the replica rank to do this I feel like.

yep

now _compute_port_offset() uses replica_rank from replica context instead of hashing approach

so now logic is:

Use data_parallel_rank if explicitly set (DP deployments via DPServer)

Fall back to replica_rank from serve context (TP/PP deployments)

Return 0 as final fallback

...on/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/lmcache_connector_v1.py

python/ray/llm/_internal/serve/engines/vllm/kv_transfer/lmcache.py

- Use replica_rank API instead of hashing approach - Simplify LMCache connector by just keeping string approach - Update comments / lint Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…port-offset

kouroshHakha

Better. Thanks

Multiplies replica_rank by tensor_parallel_size to prevent port collisions when scaling to 2+ replicas with TP≥2. Problem: PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead of defaulting to 0. However, it didn't account for the port space needed by TP workers within each replica. vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790). Without proper spacing, consecutive replicas have overlapping port ranges: Replica 0 TP Worker 1: base + 0 + 1 = 50001 Replica 1 TP Worker 0: base + 1 + 0 = 50001 ← Collision Solution: Space replicas by tp_size ports to reserve room for all TP workers: Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)] Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...] Impact: - Fixes port collisions when autoscaling to 2+ replicas with TP≥2 - Backward compatible: TP=1 multiplies by 1 (no-op) - DP deployments unchanged: vLLM handles spacing - Single replica deployments unchanged: no other replica to collide with Related: PR ray-project#57771, ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

nrghosh added serve Ray Serve Related Issue llm go add ONLY when ready to merge, run all tests labels Oct 16, 2025

nrghosh commented Oct 17, 2025

View reviewed changes

Merge branch 'master' into nrghosh/pp-tp-kv-port-offset

65a8838

nrghosh marked this pull request as ready for review October 18, 2025 00:04

nrghosh requested a review from a team as a code owner October 18, 2025 00:04

nrghosh self-assigned this Oct 18, 2025

kouroshHakha mentioned this pull request Oct 18, 2025

[Bug]: ray with nixl connector failed vllm-project/vllm#20980

Closed

1 task

kouroshHakha changed the title ~~[serve.llm] Fix port collisions for TP/PP with NIXL/LMCache~~ [bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache Oct 18, 2025

kouroshHakha reviewed Oct 18, 2025

View reviewed changes

nrghosh added 2 commits October 22, 2025 00:21

wip - use replica_rank and simplify LMCache

43e06bf

- Use replica_rank API instead of hashing approach - Simplify LMCache connector by just keeping string approach - Update comments / lint Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Merge remote-tracking branch 'upstream/master' into nrghosh/pp-tp-kv-…

3df61f3

…port-offset

kouroshHakha approved these changes Oct 22, 2025

View reviewed changes

kouroshHakha merged commit d9b0a85 into ray-project:master Oct 22, 2025
6 checks passed

nrghosh mentioned this pull request Oct 24, 2025

[serve.llm] Port collisions in multi-replica TP deployments with NIXL #58072

Closed

nrghosh mentioned this pull request Oct 24, 2025

[serve][llm] Add TP*PP spacing to port offset for multi-replica deployments #58073

Merged

nrghosh deleted the nrghosh/pp-tp-kv-port-offset branch October 28, 2025 17:30

nrghosh mentioned this pull request Oct 28, 2025

[Serve.LLM] Failed to launch disaggregated prefiller & decoder worker when enabled pipeline_parallel_size =2 #55775

Closed

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache (r…

f1aee2b

…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

Uh oh!

nrghosh commented Oct 16, 2025 •

edited

Loading

Uh oh!

nrghosh left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Oct 18, 2025

Uh oh!

nrghosh Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771

Uh oh!

Conversation

nrghosh commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Types of change

Checklist

Additional context

Code Changes

Backward Compatibility

Uh oh!

nrghosh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

nrghosh Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nrghosh commented Oct 16, 2025 •

edited

Loading

nrghosh left a comment •

edited

Loading