Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[serve] mark proxy as unready when its routers are aware of zero replicas #47002

Merged
merged 24 commits into from
Aug 20, 2024

Conversation

zcin
Copy link
Contributor

@zcin zcin commented Aug 7, 2024

Why are these changes needed?

Pinging /-/routes and /-/healthz returns 503 if proxy is unavailable, signaling that the proxy is not ready to serve traffic.
Previously the proxy would return 503 if the route table has not been populated yet. This PR makes it so that proxy will also return a 503 if it hasn't received any replicas.

Specifically (wrt routing):

  • If route table is not populated, proxy is not ready / unavailable.
  • If (proxy is on a worker node, and) route table is populated, but none of the handles (corresponding to each endpoint) have received a nonzero set of running replicas from controller, proxy is not ready.
    • Note that if proxy is on head node, proxy will be considered ready for traffic even if it doesn't have any replicas to handle the scale-to-zero case. If all deployments are scaled to zero, the head node proxy will be the only remaining proxy, and it will need to receive requests to trigger upscale.
  • Otherwise, proxy is ready.

Related issue number

closes #46938

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

zcin added 5 commits August 7, 2024 11:53
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin force-pushed the proxy-zero-replicas branch from 641ddd5 to c2650e5 Compare August 13, 2024 00:22
zcin added 5 commits August 13, 2024 09:57
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@@ -417,124 +417,6 @@ def check_proxy_status(proxy_status_to_count):
serve.shutdown()


def test_healthz_and_routes_on_head_and_worker_nodes(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to test_cluster.py

zcin added 5 commits August 13, 2024 14:43
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
r = requests.post("http://localhost:8000/-/routes")
assert r.status_code == 200

def test_head_and_worker_nodes_no_replicas(self, ray_cluster: Cluster):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from test_standalone_3, unchanged

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin marked this pull request as ready for review August 14, 2024 20:01
@zcin zcin requested review from edoakes and GeneDer August 14, 2024 20:01
@zcin zcin changed the title [serve] mark proxy as unhealthy when there are zero replicas [serve] mark proxy as unready when its routers are aware of zero replicas Aug 14, 2024
zcin added 2 commits August 14, 2024 13:11
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic comments inline.

This needs to handle the scale-to-zero case, which I'm not seeing (let me know if missed). If all applications have min_replicas=0 (or there is only one), we could end up in a scenario where all proxies fail health checks and receive no traffic.


def update_routes(self, endpoints: Dict[DeploymentID, EndpointInfo]) -> None:
logger.info(
f"Got updated endpoints: {endpoints}.", extra={"log_to_stderr": False}
)
self._route_table_populated = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this also require that the size of the route_table is > 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes sense, changed so that it is only flipped on if a nonzero list of routes were received.

return True, ""

for handle in self.handles.values():
if handle.running_replicas_populated():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this name is misleading -- it could be that the running replicas have been populated at least once, but the size of the list is 0

suggest to call it something like has_running_replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm so right now what I've implemented is that the proxy is considered ready if it's received a non-zero list of replicas at least once, but if after that the number of replicas scale back down to 0, the flag isn't flipped off. For this I think running_replicas_populated is probably still more accurate. Implementing it this way targets more of the "proxy just came up and hasn't had a chance to receive replicas yet before head node goes down" case.

I can also change it so that if the number of replicas scale back down to 0, the flag is flipped off -> the proxy changes to unhealthy so that it won't receive traffic. This will cover more bases, but will put more pressure on the head node proxy when all ingress deployments have scaled to zero. WDYT?

Copy link
Contributor

@edoakes edoakes Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I think the existing naming (and behavior) is reasonable then.

zcin added 2 commits August 19, 2024 11:01
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin
Copy link
Contributor Author

zcin commented Aug 19, 2024

This needs to handle the scale-to-zero case, which I'm not seeing (let me know if missed). If all applications have min_replicas=0 (or there is only one), we could end up in a scenario where all proxies fail health checks and receive no traffic.

Synced offline, the special case with the head node is meant to handle the scale-to-zero case. The head node proxy will not fail health check if its handles have zero replicas.

Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin added the go add ONLY when ready to merge, run all tests label Aug 19, 2024
zcin added 2 commits August 19, 2024 15:41
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@zcin zcin merged commit e9c0809 into ray-project:master Aug 20, 2024
5 checks passed
@zcin zcin deleted the proxy-zero-replicas branch August 21, 2024 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Mark proxy as not ready when it hasn't received any replicas yet
2 participants