-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serve] mark proxy as unready when its routers are aware of zero replicas #47002
Conversation
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
641ddd5
to
c2650e5
Compare
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
@@ -417,124 +417,6 @@ def check_proxy_status(proxy_status_to_count): | |||
serve.shutdown() | |||
|
|||
|
|||
def test_healthz_and_routes_on_head_and_worker_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to test_cluster.py
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
r = requests.post("http://localhost:8000/-/routes") | ||
assert r.status_code == 200 | ||
|
||
def test_head_and_worker_nodes_no_replicas(self, ray_cluster: Cluster): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from test_standalone_3, unchanged
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic comments inline.
This needs to handle the scale-to-zero case, which I'm not seeing (let me know if missed). If all applications have min_replicas=0
(or there is only one), we could end up in a scenario where all proxies fail health checks and receive no traffic.
|
||
def update_routes(self, endpoints: Dict[DeploymentID, EndpointInfo]) -> None: | ||
logger.info( | ||
f"Got updated endpoints: {endpoints}.", extra={"log_to_stderr": False} | ||
) | ||
self._route_table_populated = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this also require that the size of the route_table is > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes sense, changed so that it is only flipped on if a nonzero list of routes were received.
return True, "" | ||
|
||
for handle in self.handles.values(): | ||
if handle.running_replicas_populated(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this name is misleading -- it could be that the running replicas have been populated at least once, but the size of the list is 0
suggest to call it something like has_running_replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm so right now what I've implemented is that the proxy is considered ready if it's received a non-zero list of replicas at least once, but if after that the number of replicas scale back down to 0, the flag isn't flipped off. For this I think running_replicas_populated
is probably still more accurate. Implementing it this way targets more of the "proxy just came up and hasn't had a chance to receive replicas yet before head node goes down" case.
I can also change it so that if the number of replicas scale back down to 0, the flag is flipped off -> the proxy changes to unhealthy so that it won't receive traffic. This will cover more bases, but will put more pressure on the head node proxy when all ingress deployments have scaled to zero. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I think the existing naming (and behavior) is reasonable then.
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Synced offline, the special case with the head node is meant to handle the scale-to-zero case. The head node proxy will not fail health check if its handles have zero replicas. |
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Signed-off-by: Cindy Zhang <cindyzyx9@gmail.com>
Why are these changes needed?
Pinging
/-/routes
and/-/healthz
returns 503 if proxy is unavailable, signaling that the proxy is not ready to serve traffic.Previously the proxy would return 503 if the route table has not been populated yet. This PR makes it so that proxy will also return a 503 if it hasn't received any replicas.
Specifically (wrt routing):
Related issue number
closes #46938
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.