-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] immediately send ping in router when receiving new replica set #47053
Conversation
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
# Populate cache for all replicas | ||
self._loop.create_task(self._probe_queue_lens(list(self._replicas.values()), 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm can we do this only for the replicas that were added instead of all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! for some reason I thought it would mess with the fault tolerance, but seems like the actor info is stored per-process not per actor handle. changed to only ping new replicas.
# `receive_asgi_messages` which can be blocked when GCS is down. | ||
# To prevent that from happening, push proxy handle eagerly | ||
if self._handle_source == DeploymentHandleSource.PROXY: | ||
r._actor_handle.push_proxy_handle.remote( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a method to the interface, shouldn't be accessing the _actor_handle
private attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that way it can be tested as well
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
python/ray/serve/_private/replica.py
Outdated
@@ -321,6 +321,9 @@ def _configure_logger_and_profilers( | |||
component_id=self._component_id, | |||
) | |||
|
|||
def push_proxy_handle(self, handle): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do something to the handle? Also maybe add a type hint is it's required 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doing something with the handle seems unnecessary for now, I think if you pass any actor handle as an argument in a ray remote call like:
x.remote(actor_handle)
then ray core does some processing under the hood that requires making a call to the GCS, so if this actor_handle
was never "pushed" to actor beforehand then this call hangs. "Pushing" it once is enough to unblock the call though when the GCS goes down.
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
fill in the "good comment"s before merging please :) |
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
a98a4e7
to
d1c41a1
Compare
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
ray-project#47053) When a new set of `RunningReplicaInfos` are broadcasted to a router, the nested actor handles are "empty" and don't hold the necessary actor info (e.g. actor address) to send a request to that replica. Upon first request, the handle fetches that info from the GCS. If the GCS goes down immediately after a replica set change is broadcasted to a router, requests will all be blocked until the GCS recovers. Fix: - Upon receiving a new replica set, the router actively probes the queue lengths for each replica. - On proxies, also push its self actor handle to replicas upon replica set change, else proxy requests to new replicas will hang when GCS is down. Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Why are these changes needed?
Context:
When a new set of
RunningReplicaInfos
are broadcasted to a router, the nested actor handles are "empty" and don't hold the necessary actor info (e.g. actor address) to send a request to that replica. Upon first request, the handle fetches that info from the GCS.This can cause fault tolerance issues because if the GCS goes down immediately after a replica set change is broadcasted to a router, that router is unable to send requests to any replicas; they will all be blocked until the GCS recovers.
Fix:
receive_asgi_messages
, also push this actor handle to replicas upon replica set change, otherwise proxy requests to new replicas will hang when GCS is down.Related issue number
closes #47036
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.