[serve] mark proxy as unready when its routers are aware of zero replicas #47002

zcin · 2024-08-07T19:01:28Z

Why are these changes needed?

Pinging /-/routes and /-/healthz returns 503 if proxy is unavailable, signaling that the proxy is not ready to serve traffic.
Previously the proxy would return 503 if the route table has not been populated yet. This PR makes it so that proxy will also return a 503 if it hasn't received any replicas.

Specifically (wrt routing):

If route table is not populated, proxy is not ready / unavailable.
If (proxy is on a worker node, and) route table is populated, but none of the handles (corresponding to each endpoint) have received a nonzero set of running replicas from controller, proxy is not ready.
- Note that if proxy is on head node, proxy will be considered ready for traffic even if it doesn't have any replicas to handle the scale-to-zero case. If all deployments are scaled to zero, the head node proxy will be the only remaining proxy, and it will need to receive requests to trigger upscale.
Otherwise, proxy is ready.

Related issue number

closes #46938

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin · 2024-08-13T21:42:10Z

python/ray/serve/tests/test_standalone_3.py

@@ -417,124 +417,6 @@ def check_proxy_status(proxy_status_to_count):
    serve.shutdown()


-def test_healthz_and_routes_on_head_and_worker_nodes(


moved to test_cluster.py

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin · 2024-08-14T18:54:32Z

python/ray/serve/tests/test_cluster.py

+        r = requests.post("http://localhost:8000/-/routes")
+        assert r.status_code == 200
+
+    def test_head_and_worker_nodes_no_replicas(self, ray_cluster: Cluster):


this is from test_standalone_3, unchanged

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes

Stylistic comments inline.

This needs to handle the scale-to-zero case, which I'm not seeing (let me know if missed). If all applications have min_replicas=0 (or there is only one), we could end up in a scenario where all proxies fail health checks and receive no traffic.

edoakes · 2024-08-15T18:52:57Z

python/ray/serve/_private/proxy_router.py


    def update_routes(self, endpoints: Dict[DeploymentID, EndpointInfo]) -> None:
        logger.info(
            f"Got updated endpoints: {endpoints}.", extra={"log_to_stderr": False}
        )
+        self._route_table_populated = True


should this also require that the size of the route_table is > 0?

yeah that makes sense, changed so that it is only flipped on if a nonzero list of routes were received.

python/ray/serve/_private/proxy_router.py

edoakes · 2024-08-15T18:54:37Z

python/ray/serve/_private/proxy_router.py

+            return True, ""
+
+        for handle in self.handles.values():
+            if handle.running_replicas_populated():


this name is misleading -- it could be that the running replicas have been populated at least once, but the size of the list is 0

suggest to call it something like has_running_replicas

hmm so right now what I've implemented is that the proxy is considered ready if it's received a non-zero list of replicas at least once, but if after that the number of replicas scale back down to 0, the flag isn't flipped off. For this I think running_replicas_populated is probably still more accurate. Implementing it this way targets more of the "proxy just came up and hasn't had a chance to receive replicas yet before head node goes down" case.

I can also change it so that if the number of replicas scale back down to 0, the flag is flipped off -> the proxy changes to unhealthy so that it won't receive traffic. This will cover more bases, but will put more pressure on the head node proxy when all ingress deployments have scaled to zero. WDYT?

Got it. I think the existing naming (and behavior) is reasonable then.

python/ray/serve/_private/proxy_router.py

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin · 2024-08-19T18:23:05Z

This needs to handle the scale-to-zero case, which I'm not seeing (let me know if missed). If all applications have min_replicas=0 (or there is only one), we could end up in a scenario where all proxies fail health checks and receive no traffic.

Synced offline, the special case with the head node is meant to handle the scale-to-zero case. The head node proxy will not fail health check if its handles have zero replicas.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin added 5 commits August 7, 2024 11:53

mark proxy as unhealthy if there are zero replicas

7b35b34

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

update

731f999

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

lint

059522b

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

wip

28c6a86

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into proxy-zero-replicas

c2650e5

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin force-pushed the proxy-zero-replicas branch from 641ddd5 to c2650e5 Compare August 13, 2024 00:22

zcin added 5 commits August 13, 2024 09:57

Merge branch 'master' into proxy-zero-replicas

c6c2293

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix

336ea11

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

add proxy router unit test

7185fe9

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into proxy-zero-replicas

4d12af5

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

wip

b42163d

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin commented Aug 13, 2024

View reviewed changes

zcin added 5 commits August 13, 2024 14:43

clean up code and comments

36ee61b

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

revert unneeded

610bdb9

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

Merge branch 'master' into proxy-zero-replicas

ac12bf0

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix lint + clean up tests

95255b7

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

lint

e5021c1

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin commented Aug 14, 2024

View reviewed changes

fix tests + add more tests

6beff73

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin marked this pull request as ready for review August 14, 2024 20:01

zcin requested review from edoakes and GeneDer August 14, 2024 20:01

zcin changed the title ~~[serve] mark proxy as unhealthy when there are zero replicas~~ [serve] mark proxy as unready when its routers are aware of zero replicas Aug 14, 2024

zcin added 2 commits August 14, 2024 13:11

Merge branch 'master' into proxy-zero-replicas

f785887

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

add check in cold start time test

871d649

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

edoakes reviewed Aug 15, 2024

View reviewed changes

zcin added 2 commits August 19, 2024 11:01

Merge branch 'master' into proxy-zero-replicas

ee9b712

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

move ready_for_traffic to base class

41a2d25

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

add comment on head node edge case

71f4eec

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix tests + lint

bad5a30

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin added the go add ONLY when ready to merge, run all tests label Aug 19, 2024

edoakes approved these changes Aug 19, 2024

View reviewed changes

zcin added 2 commits August 19, 2024 15:41

fix tests and lint

120b97e

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

fix test middleware

ec33dab

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>

zcin merged commit e9c0809 into ray-project:master Aug 20, 2024
5 checks passed

zcin deleted the proxy-zero-replicas branch August 21, 2024 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[serve] mark proxy as unready when its routers are aware of zero replicas #47002

[serve] mark proxy as unready when its routers are aware of zero replicas #47002

zcin commented Aug 7, 2024 •

edited

Loading

zcin Aug 13, 2024

zcin Aug 14, 2024

edoakes left a comment

edoakes Aug 15, 2024

zcin Aug 19, 2024

edoakes Aug 15, 2024

zcin Aug 19, 2024

edoakes Aug 19, 2024 •

edited

Loading

zcin commented Aug 19, 2024

		@@ -417,124 +417,6 @@ def check_proxy_status(proxy_status_to_count):
		serve.shutdown()


		def test_healthz_and_routes_on_head_and_worker_nodes(

[serve] mark proxy as unready when its routers are aware of zero replicas #47002

[serve] mark proxy as unready when its routers are aware of zero replicas #47002

Conversation

zcin commented Aug 7, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

zcin Aug 13, 2024

Choose a reason for hiding this comment

zcin Aug 14, 2024

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

edoakes Aug 15, 2024

Choose a reason for hiding this comment

zcin Aug 19, 2024

Choose a reason for hiding this comment

edoakes Aug 15, 2024

Choose a reason for hiding this comment

zcin Aug 19, 2024

Choose a reason for hiding this comment

edoakes Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

zcin commented Aug 19, 2024

zcin commented Aug 7, 2024 •

edited

Loading

edoakes Aug 19, 2024 •

edited

Loading