add annotion that labels jobs as unsafe to evict #1370

tylerpotts · 2023-04-25T18:25:11Z

Adds the annotation .annotation("[autoscaler.kubernetes.io/safe-to-evict](http://autoscaler.kubernetes.io/safe-to-evict)", "false") to all metaflow jobs on kubernetes

savingoyal · 2023-05-02T18:31:26Z

@tylerpotts - would it be helpful to use an OPA to add this label to all the relevant pods instead of making a client-side change?

tylerpotts · 2023-05-02T19:47:59Z

@savingoyal We certainly could do that. I understand that in general it's good to keep cluster-specific configurations separate from a service like Metaflow.

In this case, however, I think it prudent to add this functionality at the client side level. Metaflow jobs/pods aren't stateless because there is an impact to users if k8s attempts to reschedule them. Whenever the autoscaler attempts to move a Metaflow pod, the user loses all progress for that flow. Additionally Metaflow @retry doesn't currently handle the error and the entire flow dies.

I also think that requiring an OPA increases the barrier to entry for people to have a stable experience with Metaflow on k8s. Not only does a user need to diagnose the cause of the failing Metaflow jobs, but they need to be aware of what an OPA is and go through the process of installation/configuration.

@shrinandj do you have any additional thoughts here?

roofurmston · 2023-05-03T08:19:25Z

FWIW we use Argo Workflows for both production & development Metaflow pipelines in Kubernetes. We do this for various reasons, such as consistent management of resources on the cluster for our platform team.

One additional benefit is that workflows in Argo will not be susceptible to auto-scaling issues. See here. Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc).

We have built a bit of machinery on our side to use Argo for both production & development, so it is not out of the box functionality you get with Metaflow. However, I guess it might still be of interest to you.

shrinandj · 2023-05-03T16:36:50Z

IMHO, before this annotation, Metaflow leaned towards cost-efficiency instead of reliability. That's why we saw nodes getting terminated even when there were some pods running on them. Technically, a user could use @retry and restart the tasks. But that could've resulted in the same behavior AFAICT.

It seems reasonable to actually lean towards reliability and ensure that flows/tasks get the best chance to running to completion. Therefore, adding this annotation makes sense.

We could make this configurable if there are cases where someone wants the old behavior. But, it seems unlikely that anyone would want the old behavior.

We should certainly note this change in behavior in the release-notes so that users don't get surprised.

savingoyal · 2023-05-07T21:41:05Z

@savingoyal We certainly could do that. I understand that in general it's good to keep cluster-specific configurations separate from a service like Metaflow.

In this case, however, I think it prudent to add this functionality at the client side level. Metaflow jobs/pods aren't stateless because there is an impact to users if k8s attempts to reschedule them. Whenever the autoscaler attempts to move a Metaflow pod, the user loses all progress for that flow. Additionally Metaflow @retry doesn't currently handle the error and the entire flow dies.

I also think that requiring an OPA increases the barrier to entry for people to have a stable experience with Metaflow on k8s. Not only does a user need to diagnose the cause of the failing Metaflow jobs, but they need to be aware of what an OPA is and go through the process of installation/configuration.

@shrinandj do you have any additional thoughts here?

@tylerpotts I understand the need for an out-of-the-box solution, but setting safe-to-evict to false might not be a desirable solution for all workloads (in some scenarios, it is in fact desirable for the workload to be terminated and the cluster to be scaled down). @retry not handling this scenario is a bug that we will investigate. One workaround here would be to enable custom annotations to be passed to metaflow tasks through the metaflow config, which will allow you to ensure the desired behavior.

tylerpotts · 2023-05-17T14:45:39Z

One workaround here would be to enable custom annotations to be passed to metaflow tasks through the metaflow config, which will allow you to ensure the desired behavior.

I think custom annotations is a great idea here

savingoyal · 2023-06-05T14:58:21Z

Closing this PR in-favor of future work that will enable custom annotations.

add annotation that labels jobs as unsafe to evict

8e621a6

tylerpotts mentioned this pull request Apr 25, 2023

Flows on kubernetes can be killed by autoscaler #1371

Closed

t-potts added 2 commits April 25, 2023 15:48

correct typo on annotation

91e5ca0

Add comment (primarily to rerun failed job)

7361504

savingoyal closed this Jun 5, 2023

tylerpotts mentioned this pull request Jun 6, 2023

Adds custom annotations via env variables #1442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add annotion that labels jobs as unsafe to evict #1370

add annotion that labels jobs as unsafe to evict #1370

tylerpotts commented Apr 25, 2023 •

edited

Loading

savingoyal commented May 2, 2023

tylerpotts commented May 2, 2023

roofurmston commented May 3, 2023 •

edited

Loading

shrinandj commented May 3, 2023

savingoyal commented May 7, 2023 •

edited

Loading

tylerpotts commented May 17, 2023

savingoyal commented Jun 5, 2023

add annotion that labels jobs as unsafe to evict #1370

add annotion that labels jobs as unsafe to evict #1370

Conversation

tylerpotts commented Apr 25, 2023 • edited Loading

savingoyal commented May 2, 2023

tylerpotts commented May 2, 2023

roofurmston commented May 3, 2023 • edited Loading

shrinandj commented May 3, 2023

savingoyal commented May 7, 2023 • edited Loading

tylerpotts commented May 17, 2023

savingoyal commented Jun 5, 2023

tylerpotts commented Apr 25, 2023 •

edited

Loading

roofurmston commented May 3, 2023 •

edited

Loading

savingoyal commented May 7, 2023 •

edited

Loading