-
Notifications
You must be signed in to change notification settings - Fork 7k
[serve.llm] Fixed DP DSV3 issues #55802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ruisearch42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
| def __init__(self, llm_config: "LLMConfig"): | ||
| """Base class for connector backends. | ||
| Args: | ||
| kv_transfer_config: Configuration for the KV transfer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update args doc
| assert ( | ||
| kv_transfer_config is not None | ||
| ), "In Connector backend, kv_transfer_config is not set" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to validate it early in the constructor, and validate only once?
| "NIXL_SIDE_CHANNEL_PORT_BASE", vllm_utils.get_open_port() | ||
| ) | ||
| ) | ||
| # If dp_rank is set, we should use the | ||
| # base port + dp_rank as the side channel port | ||
| dp_rank = self.llm_config.engine_kwargs.get("data_parallel_rank", 0) | ||
| port = base_port + dp_rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is to avoid race conditions? If get_open_port() works perfectly we don't need to add the dp_rank? Maybe add a comment to make it explicit.
| llm_config: LLMConfig, | ||
| *, | ||
| name_prefix: Optional[str] = None, | ||
| options_override: Optional[dict] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: what do you have on mind to use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
placement groups / deployment name (full name) etc.
| child_actor_bundles: List[Dict[str, float]], | ||
| replica_actor_bundle: Dict[str, float], | ||
| ) -> List[Dict[str, float]]: | ||
| """Sum up the bundles from replica actor bundles with the first bundle from child actor bundles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not fully getting the intention here: the placement strategy is STRICT_PACK (at least for TP only), why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was hanging when deployment was
[{CPU: 1, GPU:0}] + [{GPU: 1}] * tp
Also, in case of PACK, because replicas are not limited to be scheduled on the same node as their child RayWorkers I was always confounded. Modifying to this form of placement ensures the replica actor is scheduled on the same node as its own RayWorker
| child_actor_bundles: List[Dict[str, float]], | ||
| replica_actor_bundle: Dict[str, float], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: switch the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Premerge tests - Failing all permutations of
python/ray/llm/tests/serve/cpu/configs/test_models.py::TestModelConfig- which makes sense because this PR is changing the shape of the PG bundles. Test expects old two-bundle form (CPU-only head + GPU worker) - and fails since we're merging them.
would it make sense to gate this logic behind a flag for DP path? And / or add new copies of the tests that check the new shape.
In server_models.py this could look like
collocate = self.experimental_configs.get(
"collocate_replica_and_child", False
)
if collocate:
pg_bundles = self._merge_replica_actor_and_child_actor_bundles(
child_actor_bundles, replica_actor_resources
)
else:
pg_bundles = [replica_actor_resources] + child_actor_bundles
- linting
I actually think collocating replica and child is always desired. Isn't it? |
|
premerge assertion failure - |
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Extends port collision fix to Tensor Parallelism (TP) and Pipeline Parallelism (PP) scenarios. Previous fix (PR ray-project#55802) only addressed Data Parallelism by using explicit data_parallel_rank. Changes: - base.py: Added _compute_port_offset() method with fallback logic * Priority 1: Use data_parallel_rank if set (DP case) * Priority 2: Hash replica_tag for deterministic offset (TP/PP case) * Fallback: Return 0 - nixl_connector.py: Use _compute_port_offset() instead of dp_rank - lmcache_connector_v1.py: Add numeric port support with offset logic Fixes port collision errors in TP/PP deployments: - Multiple workers no longer bind to same port - Prevents NIXL_ERR_BACKEND and ZMQ errors - Enables successful deployment with pipeline_parallel_size > 1 Reproduction: Deployed Ray Serve with pipeline_parallel_size=2 and NIXL on Ray 3.0.0.dev0 (8 x L4 GPU cluster). Before fix, all workers used identical port (e.g., 52910), causing NIXL_ERR_BACKEND. Logs showed: 'Creating v1 connector with engine_id: ...-52910 [repeated 3x]' After fix, each worker receives unique port via replica tag hashing, eliminating collisions. Related: ray-project#55775 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Fix Data Parallel Resource Allocation and KV Transfer for DSv3
Summary
Fixes resource allocation conflicts and KV transfer backend configuration for data parallel deployments in DSv3.
Key Changes
base_port + dp_rankfor data parallel caseLLMConfiginstead of just transfer config for better context. This allows more expressive setup methods similar to what is needed for port collision handling.options_overrideparameter for runtime configuration flexibilityRelease tests passed: https://buildkite.com/ray-project/release/builds/54545