-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add annotion that labels jobs as unsafe to evict #1370
Conversation
@tylerpotts - would it be helpful to use an OPA to add this label to all the relevant pods instead of making a client-side change? |
@savingoyal We certainly could do that. I understand that in general it's good to keep cluster-specific configurations separate from a service like Metaflow. In this case, however, I think it prudent to add this functionality at the client side level. Metaflow jobs/pods aren't stateless because there is an impact to users if k8s attempts to reschedule them. Whenever the autoscaler attempts to move a Metaflow pod, the user loses all progress for that flow. Additionally Metaflow I also think that requiring an OPA increases the barrier to entry for people to have a stable experience with Metaflow on k8s. Not only does a user need to diagnose the cause of the failing Metaflow jobs, but they need to be aware of what an OPA is and go through the process of installation/configuration. @shrinandj do you have any additional thoughts here? |
FWIW we use Argo Workflows for both production & development Metaflow pipelines in Kubernetes. We do this for various reasons, such as consistent management of resources on the cluster for our platform team. One additional benefit is that workflows in Argo will not be susceptible to auto-scaling issues. See here. We have built a bit of machinery on our side to use Argo for both production & development, so it is not out of the box functionality you get with Metaflow. However, I guess it might still be of interest to you. |
IMHO, before this annotation, Metaflow leaned towards cost-efficiency instead of reliability. That's why we saw nodes getting terminated even when there were some pods running on them. Technically, a user could use It seems reasonable to actually lean towards reliability and ensure that flows/tasks get the best chance to running to completion. Therefore, adding this annotation makes sense. We could make this configurable if there are cases where someone wants the old behavior. But, it seems unlikely that anyone would want the old behavior. We should certainly note this change in behavior in the release-notes so that users don't get surprised. |
@tylerpotts I understand the need for an out-of-the-box solution, but setting |
I think custom annotations is a great idea here |
Closing this PR in-favor of future work that will enable custom annotations. |
Adds the annotation
.annotation("[autoscaler.kubernetes.io/safe-to-evict](http://autoscaler.kubernetes.io/safe-to-evict)", "false")
to all metaflow jobs on kubernetes