Skip to content

Conversation

@kouroshHakha
Copy link
Contributor

@kouroshHakha kouroshHakha commented Aug 21, 2025

Fix Data Parallel Resource Allocation and KV Transfer for DSv3

Summary

Fixes resource allocation conflicts and KV transfer backend configuration for data parallel deployments in DSv3.

Key Changes

  • Resource bundling: Added logic to properly merge replica and child actor bundles so that the replica resource requirement is included in the first bundle to ensure collocated placement between the replica actor and one of the workers.
  • Port management: Fixed NIXL connector port conflicts by using base_port + dp_rank for data parallel case
  • Backend configuration: KV transfer backends now receive full LLMConfig instead of just transfer config for better context. This allows more expressive setup methods similar to what is needed for port collision handling.
  • Deployment options: Added options_override parameter for runtime configuration flexibility

Release tests passed: https://buildkite.com/ray-project/release/builds/54545

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha added the go add ONLY when ready to merge, run all tests label Aug 21, 2025
@kouroshHakha kouroshHakha marked this pull request as ready for review August 21, 2025 16:57
@kouroshHakha kouroshHakha requested a review from a team as a code owner August 21, 2025 16:57
@kouroshHakha kouroshHakha changed the title [wip][serve.llm] Fixed DP DSV3 issues [serve.llm] Fixed DP DSV3 issues Aug 21, 2025
Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Comment on lines 11 to 15
def __init__(self, llm_config: "LLMConfig"):
"""Base class for connector backends.
Args:
kv_transfer_config: Configuration for the KV transfer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update args doc

Comment on lines +23 to +25
assert (
kv_transfer_config is not None
), "In Connector backend, kv_transfer_config is not set"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to validate it early in the constructor, and validate only once?

Comment on lines 15 to 21
"NIXL_SIDE_CHANNEL_PORT_BASE", vllm_utils.get_open_port()
)
)
# If dp_rank is set, we should use the
# base port + dp_rank as the side channel port
dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0)
port = base_port + dp_rank
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is to avoid race conditions? If get_open_port() works perfectly we don't need to add the dp_rank? Maybe add a comment to make it explicit.

llm_config: LLMConfig,
*,
name_prefix: Optional[str] = None,
options_override: Optional[dict] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: what do you have on mind to use this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placement groups / deployment name (full name) etc.

child_actor_bundles: List[Dict[str, float]],
replica_actor_bundle: Dict[str, float],
) -> List[Dict[str, float]]:
"""Sum up the bundles from replica actor bundles with the first bundle from child actor bundles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not fully getting the intention here: the placement strategy is STRICT_PACK (at least for TP only), why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hanging when deployment was

[{CPU: 1, GPU:0}] + [{GPU: 1}] * tp

Also, in case of PACK, because replicas are not limited to be scheduled on the same node as their child RayWorkers I was always confounded. Modifying to this form of placement ensures the replica actor is scheduled on the same node as its own RayWorker

Comment on lines +429 to +430
child_actor_bundles: List[Dict[str, float]],
replica_actor_bundle: Dict[str, float],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: switch the order

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine.

Copy link
Contributor

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Premerge tests - Failing all permutations of python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig - which makes sense because this PR is changing the shape of the PG bundles. Test expects old two-bundle form (CPU-only head + GPU worker) - and fails since we're merging them.

would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.

In server_models.py this could look like

   collocate = self.experimental_configs.get(
       "collocate_replica_and_child", False
   )
   if collocate:
       pg_bundles = self._merge_replica_actor_and_child_actor_bundles(
           child_actor_bundles, replica_actor_resources
       )
   else:
       pg_bundles = [replica_actor_resources] + child_actor_bundles
  1. linting

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue llm labels Aug 21, 2025
@kouroshHakha
Copy link
Contributor Author

would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.

I actually think collocating replica and child is always desired. Isn't it?

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@nrghosh
Copy link
Contributor

nrghosh commented Aug 21, 2025

premerge assertion failure - test_models.py::TestModelConfig::test_get_serve_options_without_accelerator_type


[2025-08-21T22:20:55Z] E       AssertionError: assert [{'CPU': 1, 'GPU': 1}] == [{'CPU': 1, '...}, {'GPU': 1}]
--
  | [2025-08-21T22:20:55Z] E         At index 0 diff: {'GPU': 1, 'CPU': 1} != {'CPU': 1, 'GPU': 0}
  | [2025-08-21T22:20:55Z] E         Right contains one more item: {'GPU': 1}
  | [2025-08-21T22:20:55Z] E         Full diff:
  | [2025-08-21T22:20:55Z] E         - [{'CPU': 1, 'GPU': 0}, {'GPU': 1}]
  | [2025-08-21T22:20:55Z] E         ?                    ------------
  | [2025-08-21T22:20:55Z] E         + [{'CPU': 1, 'GPU': 1}]
  | [2025-08-21T22:20:55Z]
  | [2025-08-21T22:20:55Z] python/ray/llm/tests/serve/cpu/configs/test_models.py:216: AssertionError
  | [2025-08-21T22:20:55Z] =========================== short test summary info ============================
  | [2025-08-21T22:20:55Z] FAILED python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig::test_get_serve_options_without_accelerator_type


Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
@kouroshHakha kouroshHakha merged commit 413f359 into ray-project:master Aug 22, 2025
5 checks passed
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
nrghosh added a commit to nrghosh/ray that referenced this pull request Oct 15, 2025
Extends port collision fix to Tensor Parallelism (TP) and Pipeline
Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed
Data Parallelism by using explicit data_parallel_rank.

Changes:
- base.py: Added _compute_port_offset() method with fallback logic
  * Priority 1: Use data_parallel_rank if set (DP case)
  * Priority 2: Hash replica_tag for deterministic offset (TP/PP case)
  * Fallback: Return 0
- nixl_connector.py: Use _compute_port_offset() instead of dp_rank
- lmcache_connector_v1.py: Add numeric port support with offset logic

Fixes port collision errors in TP/PP deployments:
- Multiple workers no longer bind to same port
- Prevents NIXL_ERR_BACKEND and ZMQ errors
- Enables successful deployment with pipeline_parallel_size > 1

Reproduction:
Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray
3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical
port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed:
  'Creating v1 connector with engine_id: ...-52910 [repeated 3x]'
After fix, each worker receives unique port via replica tag hashing,
eliminating collisions.

Related: ray-project#55775
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants