Skip to content

Conversation

@nrghosh
Copy link
Contributor

@nrghosh nrghosh commented Oct 16, 2025

Description

Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR #55802) only addressed Data Parallelism by using explicit data_parallel_rank.

Changes:

  • base.py: Added _compute_port_offset() method with fallback logic
    • Priority 1: Use data_parallel_rank if set (DP case)
    • Priority 2: Use Ray Serve replica_rank for deterministic offset (TP/PP case)
    • Fallback: Return 0
  • nixl_connector.py: Use _compute_port_offset() instead of direct dp_rank access
  • lmcache_connector_v1.py: Simplified to use string-based port naming with random suffix

Fixes port collision errors in TP/PP deployments:

  • Multiple workers no longer bind to same port
  • Prevents NIXL_ERR_BACKEND and ZMQ errors
  • Unblocks deployment with PP size > 1

Reproduction:
Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed:
Creating v1 connector with engine_id: ...-52910 [repeated 3x]
After fix, each worker receives unique port via replica_rank, eliminating collisions.

Related issues

Addresses #55775
Addresses vllm-project/vllm#20980

Types of change

  • Bug fix 🐛
  • New feature ✨
  • Enhancement 🚀
  • Code refactoring 🔧
  • Documentation update 📖
  • Chore 🧹
  • Style 🎨

Checklist

Does this PR introduce breaking changes?

  • Yes ⚠️
  • No

Testing:

  • Added/updated tests for my changes
  • Tested the changes manually
  • This PR is not tested ❌ (please explain why)

Code Quality:

  • Signed off every commit (git commit -s)
  • Ran pre-commit hooks (setup guide)

Documentation:

  • Updated documentation (if applicable) (contribution guide)
  • Added new APIs to doc/source/ (if applicable)

Additional context

Code Changes

NIXL Connector - Before:

dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank  # Always 0 for TP/PP!

NIXL Connector - After:

port = base_port + self._compute_port_offset()  # Works for DP/TP/PP

_compute_port_offset() Implementation:

def _compute_port_offset(self) -> int:
    # Priority 1: Use explicit DP rank when available
    dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank")
    if isinstance(dp_rank, int) and dp_rank >= 0:
        return dp_rank
    
    # Priority 2: Fall back to Serve replica rank for TP/PP cases
    try:
        rc = serve.get_replica_context()
        if rc and hasattr(rc, "rank"):
            return rc.rank
    except Exception:
        pass
    
    return 0

LMCache Connector - Simplified approach:

# Always use string-based naming with random suffix for uniqueness
lmcache_rpc_port_value = str(base_value) + self._get_unique_suffix()

Backward Compatibility

  • DP deployments continue using explicit data_parallel_rank (priority 1)
  • TP/PP deployments now use replica_rank from Ray Serve (priority 2)
  • LMCache uses string-based port naming with random suffix (unchanged behavior)
  • Zero fallback maintains current behavior when neither rank is available

Extends port collision fix to Tensor Parallelism (TP) and Pipeline
Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed
Data Parallelism by using explicit data_parallel_rank.

Changes:
- base.py: Added _compute_port_offset() method with fallback logic
  * Priority 1: Use data_parallel_rank if set (DP case)
  * Priority 2: Hash replica_tag for deterministic offset (TP/PP case)
  * Fallback: Return 0
- nixl_connector.py: Use _compute_port_offset() instead of dp_rank
- lmcache_connector_v1.py: Add numeric port support with offset logic

Fixes port collision errors in TP/PP deployments:
- Multiple workers no longer bind to same port
- Prevents NIXL_ERR_BACKEND and ZMQ errors
- Enables successful deployment with pipeline_parallel_size > 1

Reproduction:
Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray
3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical
port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed:
  'Creating v1 connector with engine_id: ...-52910 [repeated 3x]'
After fix, each worker receives unique port via replica tag hashing,
eliminating collisions.

Related: ray-project#55775
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh added serve Ray Serve Related Issue llm go add ONLY when ready to merge, run all tests labels Oct 16, 2025
Copy link
Contributor Author

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing llm_serve_vllm_integration_tests::test_deepseek_model release test with

KeyError: Deployment(name='LLMServer:deepseek-ai--DeepSeek-V2-Lite', app='default')

But passes locally on feature branch

PASSED

============================================================================ 1 passed in 57.16s ============================================================================
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:05 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=704,device_name=NVIDIA_L4.json'] [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [custom_all_reduce.py:154] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2602] Starting to load model deepseek-ai/DeepSeek-V2-Lite...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2634] Loading model from scratch...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [utils.py:125] Hidden layers were unevenly partitioned: [14,13]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [cuda.py:297] Using Triton MLA backend on V1 engine.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:00 [weight_utils.py:392] Using model weights format ['*.safetensors']
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:06 [gpu_worker.py:298] Available KV cache memory: 12.54 GiB [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:07 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used. [repeated 2x across cluster]
(base) ray@ip-10-0-167-142:~/default/work/ray$ 
(base) ray@ip-10-0-167-142:~/default/work/ray$ git status
On branch nrghosh/pp-tp-kv-port-offset
nothing to commit, working tree clean

@nrghosh nrghosh marked this pull request as ready for review October 18, 2025 00:04
@nrghosh nrghosh requested a review from a team as a code owner October 18, 2025 00:04
@nrghosh nrghosh self-assigned this Oct 18, 2025
@kouroshHakha kouroshHakha changed the title [serve.llm] Fix port collisions for TP/PP with NIXL/LMCache [bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache Oct 18, 2025
@@ -35,6 +35,38 @@ def _get_unique_suffix(self, len: int = 6) -> str:
"""
return "".join(random.choices(string.ascii_letters + string.digits, k=len))

def _compute_port_offset(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just use the replica rank to do this I feel like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

now _compute_port_offset() uses replica_rank from replica context instead of hashing approach

so now logic is:

  1. Use data_parallel_rank if explicitly set (DP deployments via DPServer)
  2. Fall back to replica_rank from serve context (TP/PP deployments)
  3. Return 0 as final fallback

- Use replica_rank API instead of hashing approach
- Simplify LMCache connector by just keeping string approach
- Update comments / lint

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better. Thanks

@kouroshHakha kouroshHakha merged commit d9b0a85 into ray-project:master Oct 22, 2025
6 checks passed
nrghosh added a commit to nrghosh/ray that referenced this pull request Oct 24, 2025
Multiplies replica_rank by tensor_parallel_size to prevent port collisions
when scaling to 2+ replicas with TP≥2.

Problem:
PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead
of defaulting to 0. However, it didn't account for the port space needed by
TP workers within each replica.

vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at
bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790).
Without proper spacing, consecutive replicas have overlapping port ranges:
  Replica 0 TP Worker 1: base + 0 + 1 = 50001
  Replica 1 TP Worker 0: base + 1 + 0 = 50001  ← Collision

Solution:
Space replicas by tp_size ports to reserve room for all TP workers:
  Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)]
  Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...]

Impact:
- Fixes port collisions when autoscaling to 2+ replicas with TP≥2
- Backward compatible: TP=1 multiplies by 1 (no-op)
- DP deployments unchanged: vLLM handles spacing
- Single replica deployments unchanged: no other replica to collide with

Related: PR ray-project#57771, ray-project#55775

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh deleted the nrghosh/pp-tp-kv-port-offset branch October 28, 2025 17:30
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ay-project#57771)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants