-
Notifications
You must be signed in to change notification settings - Fork 7k
[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache #57771
Conversation
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing llm_serve_vllm_integration_tests::test_deepseek_model release test with
KeyError: Deployment(name='LLMServer:deepseek-ai--DeepSeek-V2-Lite', app='default')
But passes locally on feature branch
PASSED
============================================================================ 1 passed in 57.16s ============================================================================
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:05 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/ray/anaconda3/lib/python3.11/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=704,device_name=NVIDIA_L4.json'] [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.9 not supported, communicator is not available.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [custom_all_reduce.py:154] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [parallel_state.py:1208] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) WARNING 10-16 17:44:59 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2602] Starting to load model deepseek-ai/DeepSeek-V2-Lite...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [gpu_model_runner.py:2634] Loading model from scratch...
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [utils.py:125] Hidden layers were unevenly partitioned: [14,13]. This can be manually overridden using the VLLM_PP_LAYER_PARTITION environment variable
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:44:59 [cuda.py:297] Using Triton MLA backend on V1 engine.
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:00 [weight_utils.py:392] Using model weights format ['*.safetensors']
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225271) INFO 10-16 17:45:06 [gpu_worker.py:298] Available KV cache memory: 12.54 GiB [repeated 4x across cluster]
(ServeReplica:default:LLMServer:deepseek-ai--DeepSeek-V2-Lite pid=224669) (EngineCore_DP0 pid=225159) (RayWorkerWrapper pid=225270) WARNING 10-16 17:45:07 [cudagraph_dispatcher.py:106] cudagraph dispatching keys are not initialized. No cudagraph will be used. [repeated 2x across cluster]
(base) ray@ip-10-0-167-142:~/default/work/ray$
(base) ray@ip-10-0-167-142:~/default/work/ray$ git status
On branch nrghosh/pp-tp-kv-port-offset
nothing to commit, working tree clean
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
python/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/base.py
Outdated
Show resolved
Hide resolved
| @@ -35,6 +35,38 @@ def _get_unique_suffix(self, len: int = 6) -> str: | |||
| """ | |||
| return "".join(random.choices(string.ascii_letters + string.digits, k=len)) | |||
|
|
|||
| def _compute_port_offset(self) -> int: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should just use the replica rank to do this I feel like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
now _compute_port_offset() uses replica_rank from replica context instead of hashing approach
so now logic is:
- Use
data_parallel_rankif explicitly set (DP deployments via DPServer) - Fall back to
replica_rankfrom serve context (TP/PP deployments) - Return 0 as final fallback
...on/ray/llm/_internal/serve/deployments/llm/vllm/kv_transfer_backends/lmcache_connector_v1.py
Outdated
Show resolved
Hide resolved
- Use replica_rank API instead of hashing approach - Simplify LMCache connector by just keeping string approach - Update comments / lint Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better. Thanks
Multiplies replica_rank by tensor_parallel_size to prevent port collisions when scaling to 2+ replicas with TP≥2. Problem: PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead of defaulting to 0. However, it didn't account for the port space needed by TP workers within each replica. vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790). Without proper spacing, consecutive replicas have overlapping port ranges: Replica 0 TP Worker 1: base + 0 + 1 = 50001 Replica 1 TP Worker 0: base + 1 + 0 = 50001 ← Collision Solution: Space replicas by tp_size ports to reserve room for all TP workers: Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)] Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...] Impact: - Fixes port collisions when autoscaling to 2+ replicas with TP≥2 - Backward compatible: TP=1 multiplies by 1 (no-op) - DP deployments unchanged: vLLM handles spacing - Single replica deployments unchanged: no other replica to collide with Related: PR ray-project#57771, ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…ay-project#57771) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Description
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR #55802) only addressed Data Parallelism by using explicit
data_parallel_rank.Changes:
base.py: Added_compute_port_offset()method with fallback logicdata_parallel_rankif set (DP case)replica_rankfor deterministic offset (TP/PP case)nixl_connector.py: Use_compute_port_offset()instead of directdp_rankaccesslmcache_connector_v1.py: Simplified to use string-based port naming with random suffixFixes port collision errors in TP/PP deployments:
NIXL_ERR_BACKENDand ZMQ errorsPP size > 1Reproduction:
Deployed Ray Serve with
pipeline_parallel_size=2and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causingNIXL_ERR_BACKEND. Logs showed:Creating v1 connector with engine_id: ...-52910 [repeated 3x]After fix, each worker receives unique port via
replica_rank, eliminating collisions.Related issues
Addresses #55775
Addresses vllm-project/vllm#20980
Types of change
Checklist
Does this PR introduce breaking changes?
Testing:
Code Quality:
git commit -s)Documentation:
doc/source/(if applicable)Additional context
Code Changes
NIXL Connector - Before:
NIXL Connector - After:
_compute_port_offset()Implementation:LMCache Connector - Simplified approach:
Backward Compatibility
data_parallel_rank(priority 1)replica_rankfrom Ray Serve (priority 2)