Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add annotion that labels jobs as unsafe to evict #1370

Closed

Conversation

tylerpotts
Copy link
Contributor

@tylerpotts tylerpotts commented Apr 25, 2023

Adds the annotation .annotation("[autoscaler.kubernetes.io/safe-to-evict](http://autoscaler.kubernetes.io/safe-to-evict)", "false") to all metaflow jobs on kubernetes

@savingoyal
Copy link
Collaborator

@tylerpotts - would it be helpful to use an OPA to add this label to all the relevant pods instead of making a client-side change?

@tylerpotts
Copy link
Contributor Author

@savingoyal We certainly could do that. I understand that in general it's good to keep cluster-specific configurations separate from a service like Metaflow.

In this case, however, I think it prudent to add this functionality at the client side level. Metaflow jobs/pods aren't stateless because there is an impact to users if k8s attempts to reschedule them. Whenever the autoscaler attempts to move a Metaflow pod, the user loses all progress for that flow. Additionally Metaflow @retry doesn't currently handle the error and the entire flow dies.

I also think that requiring an OPA increases the barrier to entry for people to have a stable experience with Metaflow on k8s. Not only does a user need to diagnose the cause of the failing Metaflow jobs, but they need to be aware of what an OPA is and go through the process of installation/configuration.

@shrinandj do you have any additional thoughts here?

@roofurmston
Copy link

roofurmston commented May 3, 2023

FWIW we use Argo Workflows for both production & development Metaflow pipelines in Kubernetes. We do this for various reasons, such as consistent management of resources on the cluster for our platform team.

One additional benefit is that workflows in Argo will not be susceptible to auto-scaling issues. See here. Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc).

We have built a bit of machinery on our side to use Argo for both production & development, so it is not out of the box functionality you get with Metaflow. However, I guess it might still be of interest to you.

@shrinandj
Copy link
Contributor

IMHO, before this annotation, Metaflow leaned towards cost-efficiency instead of reliability. That's why we saw nodes getting terminated even when there were some pods running on them. Technically, a user could use @retry and restart the tasks. But that could've resulted in the same behavior AFAICT.

It seems reasonable to actually lean towards reliability and ensure that flows/tasks get the best chance to running to completion. Therefore, adding this annotation makes sense.

We could make this configurable if there are cases where someone wants the old behavior. But, it seems unlikely that anyone would want the old behavior.

We should certainly note this change in behavior in the release-notes so that users don't get surprised.

@savingoyal
Copy link
Collaborator

savingoyal commented May 7, 2023

@savingoyal We certainly could do that. I understand that in general it's good to keep cluster-specific configurations separate from a service like Metaflow.

In this case, however, I think it prudent to add this functionality at the client side level. Metaflow jobs/pods aren't stateless because there is an impact to users if k8s attempts to reschedule them. Whenever the autoscaler attempts to move a Metaflow pod, the user loses all progress for that flow. Additionally Metaflow @retry doesn't currently handle the error and the entire flow dies.

I also think that requiring an OPA increases the barrier to entry for people to have a stable experience with Metaflow on k8s. Not only does a user need to diagnose the cause of the failing Metaflow jobs, but they need to be aware of what an OPA is and go through the process of installation/configuration.

@shrinandj do you have any additional thoughts here?

@tylerpotts I understand the need for an out-of-the-box solution, but setting safe-to-evict to false might not be a desirable solution for all workloads (in some scenarios, it is in fact desirable for the workload to be terminated and the cluster to be scaled down). @retry not handling this scenario is a bug that we will investigate. One workaround here would be to enable custom annotations to be passed to metaflow tasks through the metaflow config, which will allow you to ensure the desired behavior.

@tylerpotts
Copy link
Contributor Author

One workaround here would be to enable custom annotations to be passed to metaflow tasks through the metaflow config, which will allow you to ensure the desired behavior.

I think custom annotations is a great idea here

@savingoyal
Copy link
Collaborator

Closing this PR in-favor of future work that will enable custom annotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants