Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] immediately send ping in router when receiving new replica set #47053

Merged
merged 12 commits into from
Aug 14, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Aug 9, 2024

Why are these changes needed?

Context:
When a new set of RunningReplicaInfos are broadcasted to a router, the nested actor handles are "empty" and don't hold the necessary actor info (e.g. actor address) to send a request to that replica. Upon first request, the handle fetches that info from the GCS.

This can cause fault tolerance issues because if the GCS goes down immediately after a replica set change is broadcasted to a router, that router is unable to send requests to any replicas; they will all be blocked until the GCS recovers.

Fix:

  • Upon receiving a new replica set, the router actively probes the queue lengths for each replica. This simultaneously sends an initial "ping" using the actor handle which populates the actor info from the GCS, and also updates the queue length cache.
  • Since proxy sends its "self" actor handle to the replica for the replica to call receive_asgi_messages, also push this actor handle to replicas upon replica set change, otherwise proxy requests to new replicas will hang when GCS is down.

Related issue number

closes #47036

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin marked this pull request as ready for review August 13, 2024 00:03
@zcin zcin requested review from GeneDer and edoakes August 13, 2024 00:03
Comment on lines 288 to 289
# Populate cache for all replicas
self._loop.create_task(self._probe_queue_lens(list(self._replicas.values()), 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm can we do this only for the replicas that were added instead of all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! for some reason I thought it would mess with the fault tolerance, but seems like the actor info is stored per-process not per actor handle. changed to only ping new replicas.

# `receive_asgi_messages` which can be blocked when GCS is down.
# To prevent that from happening, push proxy handle eagerly
if self._handle_source == DeploymentHandleSource.PROXY:
r._actor_handle.push_proxy_handle.remote(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a method to the interface, shouldn't be accessing the _actor_handle private attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that way it can be tested as well

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@@ -321,6 +321,9 @@ def _configure_logger_and_profilers(
component_id=self._component_id,
)

def push_proxy_handle(self, handle):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something to the handle? Also maybe add a type hint is it's required 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doing something with the handle seems unnecessary for now, I think if you pass any actor handle as an argument in a ray remote call like:

x.remote(actor_handle)

then ray core does some processing under the hood that requires making a call to the GCS, so if this actor_handle was never "pushed" to actor beforehand then this call hangs. "Pushing" it once is enough to unblock the call though when the GCS goes down.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@edoakes
Copy link
Contributor

edoakes commented Aug 14, 2024

fill in the "good comment"s before merging please :)

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin force-pushed the router-initial-ping-replica branch from a98a4e7 to d1c41a1 Compare August 14, 2024 17:56
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin added the go add ONLY when ready to merge, run all tests label Aug 14, 2024
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin merged commit 048190e into ray-project:master Aug 14, 2024
5 checks passed
@zcin zcin deleted the router-initial-ping-replica branch August 14, 2024 21:16
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Aug 15, 2024
ray-project#47053)

When a new set of `RunningReplicaInfos` are broadcasted to a router, the
nested actor handles are "empty" and don't hold the necessary actor info
(e.g. actor address) to send a request to that replica. Upon first
request, the handle fetches that info from the GCS. If the GCS goes down
immediately after a replica set change is broadcasted to a router, requests
will all be blocked until the GCS recovers.

Fix:
- Upon receiving a new replica set, the router actively probes the queue
lengths for each replica.
- On proxies, also push its self actor handle to replicas upon replica set
change, else proxy requests to new replicas will hang when GCS is down.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] proxy should ping replica immediately after receiving new actor handle
3 participants