-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodeFit flag should be removed from some strategies #845
Comments
nodeFit is eviction targeted feature. It's acceptable to call
It is valid to allow any evicted pod to be re-schedulable to any node including the same one. It's also valid to require any evicted pod to be re-schedulable to any but the current node. From the kube-scheduler perspective this does not make any difference. However, a pod can be failing for reasons which are induced by the current node (networking issue, hardware issue). So, it may be more beneficial to try to re-schedule a pod to another node. |
I didn't fully understand the first one. To me it still makes sense to remove the field. I might still be missing context about pre-eviction and eviction steps. On the second one, I don´t fully agree. Looking from the users/admin perspective, if the pod is failing, and I set nodeFit=true, and other nodes don't have capacity available, this pod will not get evicted, right? I think it makes sense to [avoid evicting] a pod if it at least is running, from any user's perspective. I understand your reasoning of [forcing it to go to other nodes] if what is making it fail is the current node, but nodeFit=true does not seem to help here as well, since, if a pod is failing (with nodeFit=false), and we evict it, the scheduler will try to re-schedule, and if the scheduler schedules it in the same node, and the node is the culprit, descheduler will evict again, until it lands on a good node for it to run. Avoiding the eviction does not seem to help anyone. |
I agree this can happen. We can not predict how the scheduler reacts here. It's all the best-effort guess. Which reduces the issue in question into "do we want to allow users to have the strategy require a different node to be available for scheduling or not?". |
Marking pods for eviction is the core functionality of a strategy. In theory every pod can be marked for eviction. Whereas, pre-eviction step is checking whether a pod can be actually evicted (e.g. priority class is lower than a specified threshold, a pod is backed up by a controller, ...). You can see the marking for eviction step as "I have some rule which I am applying to eliminate pods which I think needs to be removed". And the pre-eviction step as "I was given a list of pods that are nominated for eviction but there are system limitations which the descheduler has to enforced to avoid evicting pods to avoid breaking some system contracts". |
I think my initial reasoning was that users would look at this option and basically never use it (thinking that if a pod is failing, they don't want to avoid evicting it at all), but now, thinking a bit more about it, they might use it if they want to avoid throwing unnecessary work at the scheduler (and even descheduler). Then I would say that these are the two things to ponder:
I am not sure if this second point is enough of a good reason to keep the field. Maybe it is, since this is just another option. I wonder if we can survey a few users with these kinds of stuff |
In case of |
On this, I agree wholeheartedly. Having a boolean for it, if it is anyways hard-coded to call However the code in certain places in the descheduler seems to have the same One of our use-cases for the descheduler is actually based on telemetry and some of the telemetry triggers are planned for evicting select pods "no matter what", which means we'd really love the descheduler to do what we expected a descheduler to do. Which is evict. And then best-effort mechanisms follow that, as usual. To make it clear, I actually like having the boolean control, because I can understand both kind of use-cases being fulfilled with the descheduler. Some want the descheduler to evict as seldom and as little as possible, and others as quickly as possible no matter what. Both are valid approaches depending on the use-case, and in the best case both would be supported via configurable options. It's just that, this NodeFit boolean control doesn't seem to work as expected in some strategies. I'll do my best at offering a fix for the node-affinity, but it seems new issues keep popping up. Ref issues #863 and #640 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/remove-lifecycle stale I also could benefit from this. In my use case, I have Karpenter managing the nodes. When the K8s eviction occurs from a replica/scaledown event and the not evicted pods are in the same node, since karpenter removed all other nodes, there's only one node available and descheduler refuses to deschedule/evict the pods. If descheduler was able to "forcibly" evict the pod, the pod would try to be rescheduled by K8s and would enter a unschedulable status because of the TopologySpreadConstraint. The Karpenter then would kick in and provide another node in another AZ so the K8s could deploy the pod there. Therefore, High Availability is restored. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
this should be closed in favor of #1149 /close |
@a7i: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What version of descheduler are you using?
descheduler version: v0.24.0
nodeFit
flag does make much sense in the following strategies:RemovePodsViolatingTopologySpreadConstraint
- This strategy callsnodeutil.PodFitsAnyOtherNode
twice. Once when calculating topology domains and once more during eviction. As a result, it's always enabled.RemoveFailedPods
- Failed Pods can be scheduled back onto the same node it was previously on and checking if Pod fits on any other node doesn't make much sense as the intent is to give it another shot to see if it can go back into running state.The text was updated successfully, but these errors were encountered: