-
Notifications
You must be signed in to change notification settings - Fork 7k
[serve][llm] Add TP*PP spacing to port offset for multi-replica deployments #58073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve][llm] Add TP*PP spacing to port offset for multi-replica deployments #58073
Conversation
Multiplies replica_rank by tensor_parallel_size to prevent port collisions when scaling to 2+ replicas with TP≥2. Problem: PR ray-project#57771 fixed inter-replica port collisions by using replica_rank instead of defaulting to 0. However, it didn't account for the port space needed by TP workers within each replica. vLLM workers add their tp_rank (0, 1, ..., tp_size-1) to the base port at bind time (vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:790). Without proper spacing, consecutive replicas have overlapping port ranges: Replica 0 TP Worker 1: base + 0 + 1 = 50001 Replica 1 TP Worker 0: base + 1 + 0 = 50001 ← Collision Solution: Space replicas by tp_size ports to reserve room for all TP workers: Replica 0 uses ports: [base, base+1, ..., base+(tp_size-1)] Replica 1 uses ports: [base+tp_size, base+tp_size+1, ...] Impact: - Fixes port collisions when autoscaling to 2+ replicas with TP≥2 - Backward compatible: TP=1 multiplies by 1 (no-op) - DP deployments unchanged: vLLM handles spacing - Single replica deployments unchanged: no other replica to collide with Related: PR ray-project#57771, ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix would still have a problem when we have TP2PP2, because it doesn't consider PP at all. You should use a generic num_device API which already exist in llmconfig --> engine_config.
| return rc.rank | ||
| # Multiply by tp_size to reserve ports for all TP workers | ||
| # Each TP worker will add its tp_rank (0, 1, ..., tp_size-1) | ||
| return rc.rank * tp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to offset by tp * pp . Effectively you should use llm_config.get_engine_config().num_devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Previous fix didn't quite get it right for TPXPPY scenario. Use llm_config.get_engine_config().num_devices instead of manually calculating tp_size, ensuring proper port spacing for both TP and PP workers. Fixes the case where PP workers also bind NIXL ports and need spacing in addition to TP workers. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Multiply
replica_rankbynum_devices(tp × pp) to prevent port collisions when scaling to 2+ replicas with TP≥2 or PP≥2.Root Cause
PR #57771 fixed port collisions in
python/ray/llm/_internal/serve/engines/vllm/kv_transfer/base.pyfor TP/PP by using Ray Serve's replica_rank for port offsets instead of defaulting to 0. However, the implementation doesn't account for port spacing needed when each replica spawns multiple workers - so it could still lead to overlap.Main issue: Consecutive replicas get consecutive port offsets (0, 1, 2, ...), but each replica actually needs
num_devices(tp × pp) consecutive ports for its workers. This causes port ranges to overlap between replicas.Example: 2 replicas, TP=2
Example: 2 replicas, TP=2, PP=2
Solution:
Space replicas by
num_devices(tp × pp) ports to reserve room for all workers:Replica 0 uses ports: [base, base+1, ..., base+(num_devices-1)]
Replica 1 uses ports: [base+num_devices, base+num_devices+1, ...]
The fix uses
llm_config.get_engine_config().num_deviceswhich correctly accounts for both TP and PP workers.Impact:
Note (about Data Parallel)
DP deployments don't need this fix because vLLM already multiplies
data_parallel_rankbytp_sizefor the offset internally:So for DP, the spacing is automatic - but for
replica_rank, we do the offset multiplication ourselves since vLLM doesn't know about Ray Serve's replica concept. The fix usesnum_devicesinstead of justtp_sizeto ensure PP workers also get unique ports.Related: PR #57771, #55775, #58072