[Serve] Shared `LongPollClient` for `Router`s #48807

JoshKarpel · 2024-11-19T21:23:26Z

Why are these changes needed?

In our use case we use Ray Serve with many hundreds/thousands of apps, plus a "router" app that routes traffic to those apps using DeploymentHandles. Right now, that means we have a LongPollClient for each DeploymentHandle in each router app replica, which could be tens or hundreds of thousands of LongPollClients. This is expensive on both the Serve Controller and on the router app replicas. It can be particularly problematic in resource usage on the Serve Controller - the main thing blocking us from having as many router replicas as we'd like is the stability of the controller.

This PR aims to amortize this cost of having so many LongPollClients by going from one-long-poll-client-per-handle to one-long-poll-client-per-process. Each DeploymentHandle's Router now registers itself with a shared LongPollClient held by a singleton.

The actual implementation that I've gone with is a bit clunky because I'm trying to bridge the gap between the current solution and a design that only has shared LongPollClients. This could potentially be cleaned up in the future. Right now, each Router still gets a dedicated LongPollClient that only runs temporarily, until the shared client tells it to stop.

Related: #45957 is the same idea but for handle autoscaling metrics pushing.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

zcin · 2024-11-27T18:57:05Z

@JoshKarpel I can't reproduce the test failure locally, could you try merging master and seeing if the failure is still there?

# Conflicts: # python/ray/serve/_private/router.py

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2024-12-12T15:55:43Z

python/ray/serve/_private/long_poll.py

@@ -79,7 +80,6 @@ def __init__(
        key_listeners: Dict[KeyType, UpdateStateCallable],
        call_in_event_loop: AbstractEventLoop,
    ) -> None:
-        assert len(key_listeners) > 0


This case is now handled by https://github.com/ray-project/ray/pull/48807/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aR280-R288 (and is necessary - when the shared client boots up for the first time it will send an RPC with no keys in it)

JoshKarpel · 2024-12-16T19:18:55Z

python/ray/serve/tests/test_deploy.py

@@ -324,10 +324,16 @@ def make_nonblocking_calls(expected, expect_blocking=False):
    make_nonblocking_calls({"2": 2})


-def test_reconfigure_with_queries(serve_instance):
+def test_reconfigure_does_not_run_while_there_are_active_queries(serve_instance):


Tried to de-flake this test 🤞🏻

zcin · 2024-12-21T01:51:37Z

@JoshKarpel with new routers waiting for its dedicated long poll client to make the first RPC call, do we still ever need to rely on the long poll timeout for anything?

JoshKarpel · 2025-01-06T20:12:35Z

@JoshKarpel with new routers waiting for its dedicated long poll client to make the first RPC call, do we still ever need to rely on the long poll timeout for anything?

Oo, interesting question...

If the shared long poll never times out, it won't pick up new key listeners until some key it's already listening to has changed. So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

# Conflicts: # python/ray/serve/_private/router.py

zcin · 2025-01-07T18:17:09Z

Oh sorry bad phrasing on my part, I meant do we still need to rely on the long poll timeout for routers to receive the correct updates from the controller, i.e. are routers forced to go through a delay before receiving updates in any situation (which would have been the case without the dedicated long poll client per-router).

(Definitely we should keep the long poll timeout!)

JoshKarpel · 2025-01-07T20:07:40Z

Oh sorry bad phrasing on my part, I meant do we still need to rely on the long poll timeout for routers to receive the correct updates from the controller, i.e. are routers forced to go through a delay before receiving updates in any situation (which would have been the case without the dedicated long poll client per-router).

(Definitely we should keep the long poll timeout!)

Oh I see what you mean!

No, I don't think there's any situation where you get a delay now - the dedicated long poll client stays alive until the shared client tells it to stop, which only happens when the shared clients gets an update that includes that handle's keys, which means they must be in the shared client now.

But I guess that what I wrote above is still true:

So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

The shared client doesn't take over until the handle's key gets an update here or here. So, especially for a rarely-changing deployment, there will be some time (potentially a long time) where both the dedicated and shared clients are running concurrently 🤔

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

JoshKarpel · 2025-01-07T21:23:41Z

But I guess that what I wrote above is still true:

So we'd converge to the same final state over the long term, but only if those existing keys get an update at some point, which isn't guaranteed (for example, in our setup, some Serve apps receive very little traffic and never need to autoscale - it could be hours or days between replica updates for them). I'd prefer to keep the timeout to ensure that there's a time-bound on how long it takes to get to the desired state.

The shared client doesn't take over until the handle's key gets an update here or here. So, especially for a rarely-changing deployment, there will be some time (potentially a long time) where both the dedicated and shared clients are running concurrently 🤔

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

Ah, but I forgot that when the shared client sends its first request that includes the new key listeners, the snapshot id will be -1, so it will get an immediate changed response on those!

So my concern above is not real - the shared client will always take over on the next timeout cycle when it adds the new listeners.

zcin · 2025-01-07T21:29:39Z

... which might defeat the purpose of this, at least for us, because we've got a lot of apps that don't change much. Lemme think on this a bit more - this would be an incremental improvement for us still but maybe doesn't fully solve the load issue.

Hmm, I see. I think this will can cause a lot of overhead for the controller when apps are first deployed, but should still improve the performance once that first update is received, so that in steady state there aren't tons of separate long poll clients calling into the controller and repeatedly disconnecting/reconnecting.

However if that overhead is still a concern, have you tried implementing the cancel that we've discussed before? I am not 100% sure since I haven't implemented it myself, but I think using cancel is safe and will give us what we want. The long poll client uses obj_ref._on_completed for the callback, which should be invoked when the call is cancelled.

zcin · 2025-01-07T21:32:24Z

Ah, seems like we had a race condition of replying to each other 😅. Yes after the first update the shared client should take over, so if that removes concerns then I think the current implementation is fine.

JoshKarpel · 2025-01-07T22:01:20Z

Ah, seems like we had a race condition of replying to each other 😅. Yes after the first update the shared client should take over, so if that removes concerns then I think the current implementation is fine.

Hah, yep! I am not concerned with the performance of the implementation as-is.

Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on #48807 and other PRs. --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

# Conflicts: # python/ray/serve/tests/test_deploy.py

Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Roshan Kathawate <roshankathawate@gmail.com>

Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Small isolated PR to (hopefully) fix flakiness issues with `python/ray/serve/tests/test_deploy.py::test_reconfigure_with_queries`, noticed while working on ray-project#48807 and other PRs. --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: lielin.hyl <lielin.hyl@alibaba-inc.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2025-01-14T18:15:22Z

@zcin looks like I had a failing test from a merge conflict but that's resolved now - ready for another round of review

zcin · 2025-01-14T19:09:22Z

python/ray/serve/_private/router.py

+        for router in self.routers[deployment_id]:
+            router.update_deployment_targets(deployment_target_info)
+            router.long_poll_client.stop()


Calling stop here means that the router's own long poll client won't "stop" until after the next poll right? Since the change in the long poll host (that triggered this update) also triggered the router's long poll client.

When there is a lot of applications/deployments and controller is slowed down, will there be race conditions with multiple long poll clients updating the same state?

Correct, the router will get both updates - I guess I was assuming that those updates are and will continue to be idempotent. Is that not the case?

Hmm yes I believe that's true. Just want to make sure it's thought through carefully, since I haven't touched the long poll code before.

So the router's own long poll client will likely make 2 listen_for_change calls to the controller, but that is fine because the updates are idempotent (and they will time out after at most 30 seconds).

Sounds right to me!

zcin

LGTM!

JoshKarpel added 3 commits November 19, 2024 15:22

set up shared long poll client

5169e86

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

avoid mutating while iterating

ae170d3

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

better dictionary merge

53f97fc

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

jcotant1 added the serve Ray Serve Related Issue label Nov 20, 2024

JoshKarpel added 4 commits November 20, 2024 16:26

protect against empty keys

d586f08

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into shared-long-poll-client

6ce1b3e

call _count_send

9fa746b

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

shorter sleep on empty keys

83eb5cf

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

edoakes assigned zcin Nov 21, 2024

poll again if no callbacks

cfc7d19

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel added 7 commits November 27, 2024 15:27

Merge branch 'master' into shared-long-poll-client

c673bec

Merge branch 'master' into shared-long-poll-client

67384eb

# Conflicts: # python/ray/serve/_private/router.py

Merge branch 'master' into shared-long-poll-client

1b4712f

Merge branch 'master' into shared-long-poll-client

62bb4c3

rework test

56b7df5

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

use new handle

102fb09

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

does a long sleep fix it?

ca4a1f7

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

zcin self-requested a review December 11, 2024 23:56

JoshKarpel added 6 commits December 12, 2024 09:50

do not stop the dedicated client until the shared client gets an update

9737dbd

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

fix typo

78d2b77

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

tidy up

9f14f71

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into shared-long-poll-client

a6676ed

undo test changes

d7dd6de

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into shared-long-poll-client

4f93687

JoshKarpel commented Dec 16, 2024

View reviewed changes

JoshKarpel marked this pull request as ready for review December 16, 2024 19:19

Merge branch 'master' into shared-long-poll-client

0c62830

# Conflicts: # python/ray/serve/_private/router.py

This was referenced Jan 6, 2025

[Serve] Faster bulk imperative Serve Application deploys #49168

Open

[Serve] Deflake test_reconfigure_with_queries #49696

Merged

Merge branch 'master' into shared-long-poll-client

49cc679

# Conflicts: # python/ray/serve/tests/test_deploy.py

JoshKarpel added 2 commits January 13, 2025 10:28

Merge branch 'master' into shared-long-poll-client

3dfce1f

handle changes around update_running_targets

0611dae

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into shared-long-poll-client

d62e598

zcin reviewed Jan 14, 2025

View reviewed changes

zcin approved these changes Jan 15, 2025

View reviewed changes

Merge branch 'master' into shared-long-poll-client

dc5138c

zcin added the go add ONLY when ready to merge, run all tests label Jan 15, 2025

zcin merged commit 7452bc6 into ray-project:master Jan 16, 2025
6 checks passed

JoshKarpel deleted the shared-long-poll-client branch January 16, 2025 03:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Shared `LongPollClient` for `Router`s #48807

[Serve] Shared `LongPollClient` for `Router`s #48807

JoshKarpel commented Nov 19, 2024 •

edited

Loading

zcin commented Nov 27, 2024

JoshKarpel Dec 12, 2024

JoshKarpel Dec 16, 2024

zcin commented Dec 21, 2024

JoshKarpel commented Jan 6, 2025

zcin commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

zcin commented Jan 7, 2025 •

edited

Loading

zcin commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

JoshKarpel commented Jan 14, 2025

zcin Jan 14, 2025

JoshKarpel Jan 14, 2025

zcin Jan 15, 2025

JoshKarpel Jan 15, 2025

zcin left a comment

[Serve] Shared LongPollClient for Routers #48807

[Serve] Shared LongPollClient for Routers #48807

Conversation

JoshKarpel commented Nov 19, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

zcin commented Nov 27, 2024

JoshKarpel Dec 12, 2024

Choose a reason for hiding this comment

JoshKarpel Dec 16, 2024

Choose a reason for hiding this comment

zcin commented Dec 21, 2024

JoshKarpel commented Jan 6, 2025

zcin commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

zcin commented Jan 7, 2025 • edited Loading

zcin commented Jan 7, 2025

JoshKarpel commented Jan 7, 2025

JoshKarpel commented Jan 14, 2025

zcin Jan 14, 2025

Choose a reason for hiding this comment

JoshKarpel Jan 14, 2025

Choose a reason for hiding this comment

zcin Jan 15, 2025

Choose a reason for hiding this comment

JoshKarpel Jan 15, 2025

Choose a reason for hiding this comment

zcin left a comment

Choose a reason for hiding this comment

[Serve] Shared `LongPollClient` for `Router`s #48807

[Serve] Shared `LongPollClient` for `Router`s #48807

JoshKarpel commented Nov 19, 2024 •

edited

Loading

zcin commented Jan 7, 2025 •

edited

Loading