-
Notifications
You must be signed in to change notification settings - Fork 7k
[1/n] expose outbound deployment ids from replica actor #58345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
| _deployment_config: DeploymentConfig, | ||
| rank: int, | ||
| world_size: int, | ||
| handle_registration_callback: Optional[Callable[[str, str], None]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Type mismatch in callback for replica context
The handle_registration_callback parameter in _set_internal_replica_context has a type annotation mismatch. It's currently Callable[[str, str], None], but the ReplicaContext field and its actual invocation expect Callable[[DeploymentID], None]. This difference could lead to a runtime type error.
|
in theory, could the recorded dynamic handles be different per replica process based on business logic? how would we compile deployment dag based on this? |
This is possible, but I would assume it's rare. Since we enrich the DAG over a period of time, I expect the DAG to be a good representation of reality. But ultimately, this is the best effort |
|
Generally speaking, you are right, if there are two replicas for a deployment and if we keep switching between them, then the DAG will change over time. Not the best experience, but probably okay? Another design choice we can make is, ensure that DAG is the same across all replicas; if they are different, then don't construct the DAG. Only downside to this is seeking information from all replicas can be expensive. |
|
could it be unified at some higher level (like deployment level instead of replicas)? |
|
Yeah, we can do that here , instead of poking one replica, we can query all and union them. |
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
| scanner = _PyObjScanner(source_type=DeploymentHandle) | ||
| try: | ||
| handles = scanner.find_nodes((init_args, init_kwargs)) | ||
|
|
||
| for handle in handles: | ||
| deployment_id = handle.deployment_id | ||
| seen_deployment_ids.add(deployment_id) | ||
| finally: | ||
| scanner.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be cached, but not super important because list_outbound_deployments will be called infrequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exponential backoff starting from 1s then capped at 10mins
| scanner = _PyObjScanner(source_type=DeploymentHandle) | ||
| try: | ||
| handles = scanner.find_nodes((init_args, init_kwargs)) | ||
|
|
||
| for handle in handles: | ||
| deployment_id = handle.deployment_id | ||
| seen_deployment_ids.add(deployment_id) | ||
| finally: | ||
| scanner.clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the frequency?
…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com>
Fixes ray-project#58475 and ray-project#58474 Root cause - addition of `_handle_registration_callback` in ray-project#58345 Fix: Remove unused replica_ctx parameter from DPRankAssigner.register() to fix serialization error when passing ReplicaContext across actor boundaries. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Remove unused replica_ctx parameter from DPRankAssigner.register() to fix serialization error when passing ReplicaContext across actor boundaries. The ReplicaContext contains non-serializable callback closures that cause Ray serialization to fail. Root cause: The addition of _handle_registration_callback field to ReplicaContext in PR ray-project#58345 introduced non-serializable callback closures that cannot be passed across actor boundaries. Fixes ray-project#58474 Fixes ray-project#58475 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Remove unused replica_ctx parameter from DPRankAssigner.register() to fix serialization error when passing ReplicaContext across actor boundaries. Root cause: The addition of _handle_registration_callback field to ReplicaContext in PR ray-project#58345 Fixes ray-project#58474 and ray-project#58475 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com>
…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com>
Summary
Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction.
Motivation
Deployments call downstream deployments via handles in two ways:
__init__()and stored as attributes →self.model.func.remote()serve.get_deployment_handle()→model.func.remote()Previously, there was no way to programmatically discover these dependencies from a running replica.
Implementation
Core Changes
ReplicaActor.list_outbound_deployments(): ReturnsList[DeploymentID]of all downstream deploymentsget_deployment_handle()at runtime using a callback mechanismRuntime tracking: Modified
get_deployment_handle()to register handles when called from within a replica viaReplicaContext._handle_registration_callbackNext PR: #58350