-
Notifications
You must be signed in to change notification settings - Fork 40.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Marking a pending pod as failed after a certain timeout #113211
Comments
@kannon92: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
/sig apps regex is certainly not an API we would like to encourage, as the messages can change from one release to the next. Is it fair to say that you would like to target these specific Pending scenarios?
Are there other Pending scenarios you think are useful to target? The additional requirement for the API is where to define the timeouts. |
maybe @smarterclayton has given this a thought already :) |
So I asked around. Other areas that we found useful are also in InvalidConfigMaps, InvalidSecrets, bad pvc. You covered a lot of them in the NonGoals of the KEP. We also have issues with GPU Plugin not be able to be initialized. |
I guess we can cover most of them in a single PodCondition |
@alculquicondor Could you explain a bit more about why this couldn't be the responsibility of the Job Controller? The one thing that comes to mind is that this is a bigger problem since just for users scheduling jobs. But I don't quite understand the distinction between responsibility of sig-node and sig-apps? Is it primarly because kubelet is responsible for determining if a Pod has failed? |
Yes. |
Could we provide this functionality as an out-of-tree add-on to Kubernetes? Is there anything about this enhancement proposal that requires a change to Kubernetes itself? |
It depends. I'm not sure if in all the cases that @kannon92 mentioned, the Pod is stuck with Pending phase. If so, yes, I think an out-of-tree controller could implement the functionality. If the kubelet makes a transition of the pods to Failed, then we need kubelet changes, because we need to add a Pod condition atomically. |
Although, an important question is how can an out-of-tree controller detect these failures. Is there a stable Pod condition type or reason that can be monitored? If the controller has to do regex on error messages, that can easily break from one release to another. So at the very least in upstream we should have consistent Pod conditions. |
@alculquicondor and I followed up offline. I created a repo that goes through the common cases our batch users experience with the pending status. I created a document to walk through these common use cases and to demonstrate what conditions, statuses and events these states give. https://github.com/kannon92/PendingPodChecks/blob/main/README.md |
I did a little POC on this in my own fork: The PRs are kannon92#2 and kannon92#1. In those PRs I added an ability for the job controller to fail pods based on conditions for PendingPods. A brief review from @alculquicondor explained that we will need a much more strict API to deal with transient conditions. Transient conditions are ones where conditions transition to False -> True -> False throughout a pod lifecycle. I don't think there is any guarantee that Pods conditions get set once and never toched again. So I think an API needs to take this is consideration. Our hope is to bring up to sig-node about the possiblity of adding a new API to Kubelet that can terminate pods based on conditions. This API will need to match on conditions but also take in account transient condition setting. In general I see two mutually exclusive feature requests to achieve this:
There exists conditions already so I believe 2 can be addressed separately from 1. Some of these conditions can be UnableToSchedule or PodHasNetwork (in Beta in 1.26) so it is possible to at least flush out an API with these conditions. |
/cc @thockin |
This is an interesting idea, but I wonder if it's a sub-case of a more general capability. Something like "tolerance" (bad name because it's too close to tolerations) - which allows a pod to express bounded-duration situations which it can or cannot deal with. "I can accept up to 30 seconds of node-unready; after that, kill this pod" "I can accept up to 5 minutes of pending; after that, kill this pod" |
Your suggestion makes it sound like some kind of Readiness check. Sorta of a check to make sure the Pod can continue rather than a test of the container's ability to be ready. For this general case, what kind of fields should we decide to watch for "tolerance"? Conditions seem obvious to me because they kinda tell the progress of the pod but not sure if there are other items I should think of. |
An "environmental suitability check" :) I don't think you want to define it exclusively (or maybe at all?) in terms of conditions, but I haven't thought about that. The one that came up in the past was "node unready" and/or "node unreachable". Some apps have no state and want to be killed more "eagerly" if the node seems down. Other apps have local state that may be expensive to rebuild, and would prefer to wait longer. I could see "disk full" being another or "node OOM rate". |
@msau42 wrt stateful stuff |
In general I started thinking about this issue because users can submit invalid pods (invalid secrets, volumes mounted from non existent secrets, wrong taints/tolerations for nodes) and all those states happen when a pod is pending. There are various of ways to deduce what went wrong but it is difficult to programmatically handle all these states in a third party controller. In the project I work on, we use the reason field (In ContainerStateWaiting) for cases related to invalid docker names and invalid config maps, we use the condition UnableToSchedule for scheduling issues and we use events for volume mounting issues. I wrote up a lot of these cases in https://github.com/kannon92/PendingPodChecks/blob/main/README.md if you are curious. One condition that is being added to Kubernetes is PodHasNetwork and that would also represent a case where I think the Pod is stuck. So that is why I was thinking of conditions. |
@thockin we also wrote Job failures policies to match Conditions, but this is limited to Pods that are also marked as Failed https://kubernetes.io/docs/concepts/workloads/controllers/job/#pod-failure-policy |
Commented in kubernetes-sigs/descheduler#1183 (comment) downstream, but I think this falls between needing work from the scheduler and descheduler. The descheduler currently only looks at pods that are on a node, which we could change. But making a deterministic decision about eviction would be better with some standard condition on the pod. imo, a core controller such as the scheduler should be the one to mutate the pod to add that condition. then downstream components like the descheduler could act on it. the scheduler should also know to remove those pods from the scheduling queue |
The scheduler already adds:
Is there anything else needed? |
Sorry, I haven't read through this whole thread, but if all this takes is to evict pods with |
This thread is long but the main point we are discussing to evict pods that are stuck in the pending stage. I don't think it is as simple as PodScheduled=False (though that is a case actually!). It can mean that In the KEP I tried to summarize common cases that our users get tripped up on: And when I created the KEP I wasn't aware of PodReadyToStartContainers condition so i wanted to target more complicated cases like: So in most of the cases that I have found the Pod gets scheduled and then gets stuck after scheduling stage. |
i am assuming that this functionality is not available yet,right ? is there any workaround using kubelete configurations perhaps ? is it the same on k3s distributions ? is there maybe any sort of workaround CronJob that regularly checks for pending pods and terminates the ones that have been there for long ? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What would you like to be added?
We would like to add configurable timeouts to allow pending pods to transition to failed.
It would be ideal if it could be configurable based on container events or messages rather than a catch all timeout.
A possible API could be as follows:
This can allow certain events/statuses to transition to failed.
Why is this needed?
In the Armada project, we found that for batch users, it is useful to control how long Pods stay in pending before marking the pods as failed. This allows for our scheduler to remove pods from the cluster and allow room for new pods to be scheduled. This also allows users to be notified of invalid pods and they are able to resubmit their pods with correct configurations.
We have discussed this idea in conjunction with the non goals of https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#non-goals and @alculquicondor requested that I create an issue.
The text was updated successfully, but these errors were encountered: