nodeFit flag should be removed from some strategies #845

a7i · 2022-06-09T01:29:10Z

What version of descheduler are you using?

descheduler version: v0.24.0

nodeFit flag does make much sense in the following strategies:

RemovePodsViolatingTopologySpreadConstraint - This strategy calls nodeutil.PodFitsAnyOtherNode twice. Once when calculating topology domains and once more during eviction. As a result, it's always enabled.
RemoveFailedPods - Failed Pods can be scheduled back onto the same node it was previously on and checking if Pod fits on any other node doesn't make much sense as the intent is to give it another shot to see if it can go back into running state.

The text was updated successfully, but these errors were encountered:

ingvagabund · 2022-06-10T08:45:14Z

RemovePodsViolatingTopologySpreadConstraint - This strategy calls nodeutil.PodFitsAnyOtherNode twice. Once when calculating topology domains and once more during eviction. As a result, it's always enabled.

nodeFit is eviction targeted feature. It's acceptable to call nodeutil.PodFitsAnyOtherNode in a strategy as part of decision making. Even though it may seem as a duplicate, it's a valid case to mark pods for eviction which are meant to be "re-scheduled" to a different node while actually allowing them to be scheduled on the same node during the pre-eviction checks. In this case invoking nodeutil.PodFitsAnyOtherNode through the nodeFit=true|false produces no-op action as the case of scheduling a pod onto the same node is ruled out.

RemoveFailedPods - Failed Pods can be scheduled back onto the same node it was previously on and checking if Pod fits on any other node doesn't make much sense as the intent is to give it another shot to see if it can go back into running state.

It is valid to allow any evicted pod to be re-schedulable to any node including the same one. It's also valid to require any evicted pod to be re-schedulable to any but the current node. From the kube-scheduler perspective this does not make any difference. However, a pod can be failing for reasons which are induced by the current node (networking issue, hardware issue). So, it may be more beneficial to try to re-schedule a pod to another node.

knelasevero · 2022-06-10T09:06:03Z

I didn't fully understand the first one. To me it still makes sense to remove the field. I might still be missing context about pre-eviction and eviction steps.

On the second one, I don´t fully agree. Looking from the users/admin perspective, if the pod is failing, and I set nodeFit=true, and other nodes don't have capacity available, this pod will not get evicted, right? I think it makes sense to [avoid evicting] a pod if it at least is running, from any user's perspective. I understand your reasoning of [forcing it to go to other nodes] if what is making it fail is the current node, but nodeFit=true does not seem to help here as well, since, if a pod is failing (with nodeFit=false), and we evict it, the scheduler will try to re-schedule, and if the scheduler schedules it in the same node, and the node is the culprit, descheduler will evict again, until it lands on a good node for it to run. Avoiding the eviction does not seem to help anyone.

ingvagabund · 2022-06-10T09:21:01Z

but nodeFit=true does not seem to help here as well, since, if a pod is failing (with nodeFit=false), and we evict it, the scheduler will try to re-schedule, and if the scheduler schedules it in the same node, and the node is the culprit, descheduler will evict again, until it lands on a good node for it to run. Avoiding the eviction does not seem to help anyone.

I agree this can happen. We can not predict how the scheduler reacts here. It's all the best-effort guess. Which reduces the issue in question into "do we want to allow users to have the strategy require a different node to be available for scheduling or not?".

ingvagabund · 2022-06-10T09:26:23Z

I didn't fully understand the first one. To me it still makes sense to remove the field. I might still be missing context about pre-eviction and eviction steps.

Marking pods for eviction is the core functionality of a strategy. In theory every pod can be marked for eviction. Whereas, pre-eviction step is checking whether a pod can be actually evicted (e.g. priority class is lower than a specified threshold, a pod is backed up by a controller, ...). You can see the marking for eviction step as "I have some rule which I am applying to eliminate pods which I think needs to be removed". And the pre-eviction step as "I was given a list of pods that are nominated for eviction but there are system limitations which the descheduler has to enforced to avoid evicting pods to avoid breaking some system contracts".

knelasevero · 2022-06-10T09:34:10Z

I agree this can happen. We can not predict how the scheduler reacts here. It's all the best-effort guess. Which reduces the issue in question into "do we want to allow users to have the strategy require a different node to be available for scheduling or not?".

I think my initial reasoning was that users would look at this option and basically never use it (thinking that if a pod is failing, they don't want to avoid evicting it at all), but now, thinking a bit more about it, they might use it if they want to avoid throwing unnecessary work at the scheduler (and even descheduler). Then I would say that these are the two things to ponder:

Pods are failing, so there isn't a good enough reason to avoid evicting them. Just keep evicting them until we get a good node
Or, we actually have a good enough reason to avoid evicting them: reduce load on scheduler/descheduler, in an situation where current node is the culprit of a pod failing (and other nodes are not a fit [yet]).

I am not sure if this second point is enough of a good reason to keep the field. Maybe it is, since this is just another option. I wonder if we can survey a few users with these kinds of stuff

a7i · 2022-06-14T19:12:04Z

In case of RemovePodsViolatingTopologySpreadConstraint, setting nodeFit to true or false will yield the same result. Which is confusing and should be (at minimum) documented

uniemimu · 2022-06-28T17:06:12Z

In case of RemovePodsViolatingTopologySpreadConstraint, setting nodeFit to true or false will yield the same result. Which is confusing and should be (at minimum) documented

On this, I agree wholeheartedly. Having a boolean for it, if it is anyways hard-coded to call NodeFit via a call to PodFitsAnyOtherNode, the boolean there is rather confusing. When I read the documentation of the NodeFit feature, I expected it to allow for checking if some other node can take the pod before evicting. I also really really, really hoped the opposite to be true, if it was set to be false, because there are use-cases also for evicting "no matter what".

However the code in certain places in the descheduler seems to have the same NodeFit check baked in regardless of configs. RemovePodsViolatingTopologySpreadConstraint being one. The comment in there even says "ignoring pod for eviction as it does not fit on any other node". I'm no expert on that particular strategy, but without studying the details of the strategy one could think that the evictions.go and the NodeFit via the boolean control would be just the right tool for just that part, with the benefit of allowing for users to control it.

One of our use-cases for the descheduler is actually based on telemetry and some of the telemetry triggers are planned for evicting select pods "no matter what", which means we'd really love the descheduler to do what we expected a descheduler to do. Which is evict. And then best-effort mechanisms follow that, as usual.

To make it clear, I actually like having the boolean control, because I can understand both kind of use-cases being fulfilled with the descheduler. Some want the descheduler to evict as seldom and as little as possible, and others as quickly as possible no matter what. Both are valid approaches depending on the use-case, and in the best case both would be supported via configurable options. It's just that, this NodeFit boolean control doesn't seem to work as expected in some strategies. I'll do my best at offering a fix for the node-affinity, but it seems new issues keep popping up. Ref issues #863 and #640

k8s-triage-robot · 2022-09-26T17:33:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ingvagabund · 2022-09-27T12:18:04Z

/remove-lifecycle stale

k8s-triage-robot · 2022-12-26T12:20:28Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ingvagabund · 2023-01-04T13:36:56Z

/remove-lifecycle stale

felipewnp · 2023-01-25T16:13:42Z

/remove-lifecycle stale

I also could benefit from this.

In my use case, I have Karpenter managing the nodes.

When the K8s eviction occurs from a replica/scaledown event and the not evicted pods are in the same node, since karpenter removed all other nodes, there's only one node available and descheduler refuses to deschedule/evict the pods.

If descheduler was able to "forcibly" evict the pod, the pod would try to be rescheduled by K8s and would enter a unschedulable status because of the TopologySpreadConstraint.

The Karpenter then would kick in and provide another node in another AZ so the K8s could deploy the pod there.

Therefore, High Availability is restored.

k8s-triage-robot · 2023-04-25T16:28:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-25T16:52:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

a7i · 2023-10-27T02:26:43Z

this should be closed in favor of #1149

/close

k8s-ci-robot · 2023-10-27T02:26:47Z

@a7i: Closing this issue.

In response to this:

this should be closed in favor of #1149

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

a7i added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2022

knelasevero mentioned this issue Jun 9, 2022

Descheduling framework: define framework types #837

Closed

12 tasks

knelasevero mentioned this issue Aug 3, 2022

REQUEST: New membership for knelasevero kubernetes/org#3588

Closed

9 tasks

uniemimu mentioned this issue Aug 12, 2022

RemovePodsViolatingNodeAffinity should not rely on node capacity when check if pod fits on current node #873

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2023

ingvagabund removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 29, 2023

knelasevero mentioned this issue Jul 11, 2023

Descheduling framework wrap up #1187

Open

k8s-ci-robot closed this as completed Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nodeFit flag should be removed from some strategies #845

nodeFit flag should be removed from some strategies #845

a7i commented Jun 9, 2022

ingvagabund commented Jun 10, 2022

knelasevero commented Jun 10, 2022 •

edited

Loading

ingvagabund commented Jun 10, 2022

ingvagabund commented Jun 10, 2022 •

edited

Loading

knelasevero commented Jun 10, 2022 •

edited

Loading

a7i commented Jun 14, 2022

uniemimu commented Jun 28, 2022

k8s-triage-robot commented Sep 26, 2022

ingvagabund commented Sep 27, 2022

k8s-triage-robot commented Dec 26, 2022

ingvagabund commented Jan 4, 2023

felipewnp commented Jan 25, 2023

k8s-triage-robot commented Apr 25, 2023

k8s-triage-robot commented May 25, 2023

a7i commented Oct 27, 2023

k8s-ci-robot commented Oct 27, 2023

nodeFit flag should be removed from some strategies #845

nodeFit flag should be removed from some strategies #845

Comments

a7i commented Jun 9, 2022

ingvagabund commented Jun 10, 2022

knelasevero commented Jun 10, 2022 • edited Loading

ingvagabund commented Jun 10, 2022

ingvagabund commented Jun 10, 2022 • edited Loading

knelasevero commented Jun 10, 2022 • edited Loading

a7i commented Jun 14, 2022

uniemimu commented Jun 28, 2022

k8s-triage-robot commented Sep 26, 2022

ingvagabund commented Sep 27, 2022

k8s-triage-robot commented Dec 26, 2022

ingvagabund commented Jan 4, 2023

felipewnp commented Jan 25, 2023

k8s-triage-robot commented Apr 25, 2023

k8s-triage-robot commented May 25, 2023

a7i commented Oct 27, 2023

k8s-ci-robot commented Oct 27, 2023

knelasevero commented Jun 10, 2022 •

edited

Loading

ingvagabund commented Jun 10, 2022 •

edited

Loading

knelasevero commented Jun 10, 2022 •

edited

Loading