-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deschedule pods that fail to start or restart too often #62
Comments
Seems like a reasonable ask. @kabakaev I am planning to defer this to 0.6 release or later. Hope you are ok with that. |
Ref also kubernetes/kubernetes#14796 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
A whole bunch of issues were referred to this and then this gets auto closed. Should the users just write a controller and delete pod after too many restarts etc |
/reopen |
@mbdas: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@k82cn: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@ravisantoshgudimetla: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#89 tried addressing this. Let's make sure that we're getting this in before the next release |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
This looks like a reasonable proposal. Ephemeral inline volumes have the same problem: a pod gets scheduled onto a node, then the CSI driver's |
For my own understanding: is the proposal to delete such a failed pod and then let a higher-level controller (like a stateful set) create a new pod, or is the proposal to just move the pod off the node and schedule it again? |
If not mistaken a pod gets scheduled only once for its entire lifetime. So unless its deleted and replaced from a controller/operator, new scheduling will not happen. Now there is a chance the pod maybe scheduled back to the bad node (for that specific use case) but proper fleet management will essentially remove a node that has high rate of failure. But in most cases it will land in a good node. For use cases where a fresh pod launch required any node is fine. |
It should also be noted that the descheduler only considers pods for eviction that have an ownerreference (unless this is explicitly overridden), so pods that aren't managed by a controller which would attempt to reschedule them would not be evicted by default. |
That's what I thought, thanks for the confirmation.
That may be true for "broken" nodes, but not for a node that simply doesn't have enough storage capacity left for a certain kind of inline ephemeral volume. I was proposing to add capacity tracking to ensure that Kubernetes will eventually pick a node that has enough capacity, but that KEP has been postponed. |
New strategy Looking at the original requirements provided in the issue description there is request to add a strategy that can ...
@kabakaev do you still have a need for this feature? |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@seanmalloy, i've tested the Unfortunately, the second part I've tested the second case by breaking the CSI node plugin on one of k8s nodes. It led to a new pod hanging in It seems, all necessary info is already written in pod object:
I'd imagine an extra /remove-lifecycle stale |
@kabakaev thanks for the info. How about using the PodLifeTime strategy? We would need to add an additional strategy parameter to handle Maybe something like this ... ---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 300
podStatusPhase:
- pending @damemi @ingvagabund @lixiang233 please add any additional ideas you have. Thanks! |
@seanmalloy I think |
|
@lixiang233 yes based on my understanding of the problem I think it might be reasonable to have an option to only consider pods with a certain
@ingvagabund I think the use case is to deschedule pods that are I know we have a lot of recently added configuration options for including/excluding pods based on different criteria(i.e. namespace and priority). But what do you think of adding one more? We could try adding this to only the @kabakaev do my above comments make sense to you? Would this newly proposed feature handle your use case? |
Yeah, with such a short period of time, it makes sense to limit the phase. Though, maybe not to every phase. Pending is the first phase when a pod is accepted. I can't find any field in pod' status saying when a pod transitioned into a given phase. Also, other phases are completely ignored (Failed, Succeeded) which leaves only |
I think this can be implemented now. The consensus is to add a new strategy parameter to the Maybe something like this ... ---
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"PodLifeTime":
enabled: true
params:
maxPodLifeTimeSeconds: 300
podStatusPhases: # <=== this is the default if not specified
- pending
- running Another example ...
|
@seanmalloy @ingvagabund @kabakaev Does anyone plan to work on this? If not, I'd love to help with the feature. |
/assign |
@lixiang233 this feature enhancement is all yours. Thanks! |
I am not sure if the implementation proposed here is addressing what was actually requested in #62 (comment). Correct me if I'm wrong, but it seems like we're talking about adding a parameter that will only evict Which led to the request of
But my understanding of the problem above is more that a pod that was in descheduler/pkg/descheduler/pod/pods.go Line 75 in 03dbc93
Is there somewhere else in the code that we are only selecting Also, is there a use case for excluding all |
If CNI/CSI plugin failed to set up a pod or the pod's image is not available on a node, the pod will be in @damemi Do you mean we should let every strategy to custom its exclude phases? |
@damemi, first statement is true, but it is because I didn't enable
Yes, my understanding is that |
Ah I see, you want to set a short LifeTime for these pending pods and not be evicting many running pods because of that. Sounds good, I understand now. Thanks for clarifying! |
…ream bug 1950026: Sync with upstream
It is not uncommon that pods get scheduled on nodes that are not able to start it.
For example, a node may have network issues and unable to mount a networked persistent volume, or cannot pull a docker image, or has some docker configuration issue which is seen only on container startup.
Another common issue is when a container gets restarted by liveliness check because of some local node issue (e.g. wrong routing table, slow storage, network latency or packet-drop). In that case, a pod is unhealthy most of the time and hangs in a restart state forever without a chance of being migrated to another node.
As of now, there is no possibility to re-schedule pods with faulty containers. It may be helpful to introduce two new Strategies:
$notReadyPeriod
seconds and one of its containers was restarted$maxRestartCount
times.$maxStartupTime
seconds.The similar issue is filed against kubernetes: kubernetes/kubernetes#13385
The text was updated successfully, but these errors were encountered: