Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeFit flag should be removed from some strategies #845

Closed
a7i opened this issue Jun 9, 2022 · 16 comments
Closed

nodeFit flag should be removed from some strategies #845

a7i opened this issue Jun 9, 2022 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@a7i
Copy link
Contributor

a7i commented Jun 9, 2022

What version of descheduler are you using?

descheduler version: v0.24.0

nodeFit flag does make much sense in the following strategies:

  1. RemovePodsViolatingTopologySpreadConstraint - This strategy calls nodeutil.PodFitsAnyOtherNode twice. Once when calculating topology domains and once more during eviction. As a result, it's always enabled.

  2. RemoveFailedPods - Failed Pods can be scheduled back onto the same node it was previously on and checking if Pod fits on any other node doesn't make much sense as the intent is to give it another shot to see if it can go back into running state.

@a7i a7i added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2022
@ingvagabund
Copy link
Contributor

RemovePodsViolatingTopologySpreadConstraint - This strategy calls nodeutil.PodFitsAnyOtherNode twice. Once when calculating topology domains and once more during eviction. As a result, it's always enabled.

nodeFit is eviction targeted feature. It's acceptable to call nodeutil.PodFitsAnyOtherNode in a strategy as part of decision making. Even though it may seem as a duplicate, it's a valid case to mark pods for eviction which are meant to be "re-scheduled" to a different node while actually allowing them to be scheduled on the same node during the pre-eviction checks. In this case invoking nodeutil.PodFitsAnyOtherNode through the nodeFit=true|false produces no-op action as the case of scheduling a pod onto the same node is ruled out.

RemoveFailedPods - Failed Pods can be scheduled back onto the same node it was previously on and checking if Pod fits on any other node doesn't make much sense as the intent is to give it another shot to see if it can go back into running state.

It is valid to allow any evicted pod to be re-schedulable to any node including the same one. It's also valid to require any evicted pod to be re-schedulable to any but the current node. From the kube-scheduler perspective this does not make any difference. However, a pod can be failing for reasons which are induced by the current node (networking issue, hardware issue). So, it may be more beneficial to try to re-schedule a pod to another node.

@knelasevero
Copy link
Contributor

knelasevero commented Jun 10, 2022

I didn't fully understand the first one. To me it still makes sense to remove the field. I might still be missing context about pre-eviction and eviction steps.

On the second one, I don´t fully agree. Looking from the users/admin perspective, if the pod is failing, and I set nodeFit=true, and other nodes don't have capacity available, this pod will not get evicted, right? I think it makes sense to [avoid evicting] a pod if it at least is running, from any user's perspective. I understand your reasoning of [forcing it to go to other nodes] if what is making it fail is the current node, but nodeFit=true does not seem to help here as well, since, if a pod is failing (with nodeFit=false), and we evict it, the scheduler will try to re-schedule, and if the scheduler schedules it in the same node, and the node is the culprit, descheduler will evict again, until it lands on a good node for it to run. Avoiding the eviction does not seem to help anyone.

@ingvagabund
Copy link
Contributor

but nodeFit=true does not seem to help here as well, since, if a pod is failing (with nodeFit=false), and we evict it, the scheduler will try to re-schedule, and if the scheduler schedules it in the same node, and the node is the culprit, descheduler will evict again, until it lands on a good node for it to run. Avoiding the eviction does not seem to help anyone.

I agree this can happen. We can not predict how the scheduler reacts here. It's all the best-effort guess. Which reduces the issue in question into "do we want to allow users to have the strategy require a different node to be available for scheduling or not?".

@ingvagabund
Copy link
Contributor

ingvagabund commented Jun 10, 2022

I didn't fully understand the first one. To me it still makes sense to remove the field. I might still be missing context about pre-eviction and eviction steps.

Marking pods for eviction is the core functionality of a strategy. In theory every pod can be marked for eviction. Whereas, pre-eviction step is checking whether a pod can be actually evicted (e.g. priority class is lower than a specified threshold, a pod is backed up by a controller, ...). You can see the marking for eviction step as "I have some rule which I am applying to eliminate pods which I think needs to be removed". And the pre-eviction step as "I was given a list of pods that are nominated for eviction but there are system limitations which the descheduler has to enforced to avoid evicting pods to avoid breaking some system contracts".

@knelasevero
Copy link
Contributor

knelasevero commented Jun 10, 2022

I agree this can happen. We can not predict how the scheduler reacts here. It's all the best-effort guess. Which reduces the issue in question into "do we want to allow users to have the strategy require a different node to be available for scheduling or not?".

I think my initial reasoning was that users would look at this option and basically never use it (thinking that if a pod is failing, they don't want to avoid evicting it at all), but now, thinking a bit more about it, they might use it if they want to avoid throwing unnecessary work at the scheduler (and even descheduler). Then I would say that these are the two things to ponder:

  • Pods are failing, so there isn't a good enough reason to avoid evicting them. Just keep evicting them until we get a good node

  • Or, we actually have a good enough reason to avoid evicting them: reduce load on scheduler/descheduler, in an situation where current node is the culprit of a pod failing (and other nodes are not a fit [yet]).

I am not sure if this second point is enough of a good reason to keep the field. Maybe it is, since this is just another option. I wonder if we can survey a few users with these kinds of stuff

@a7i
Copy link
Contributor Author

a7i commented Jun 14, 2022

In case of RemovePodsViolatingTopologySpreadConstraint, setting nodeFit to true or false will yield the same result. Which is confusing and should be (at minimum) documented

@uniemimu
Copy link

In case of RemovePodsViolatingTopologySpreadConstraint, setting nodeFit to true or false will yield the same result. Which is confusing and should be (at minimum) documented

On this, I agree wholeheartedly. Having a boolean for it, if it is anyways hard-coded to call NodeFit via a call to PodFitsAnyOtherNode, the boolean there is rather confusing. When I read the documentation of the NodeFit feature, I expected it to allow for checking if some other node can take the pod before evicting. I also really really, really hoped the opposite to be true, if it was set to be false, because there are use-cases also for evicting "no matter what".

However the code in certain places in the descheduler seems to have the same NodeFit check baked in regardless of configs. RemovePodsViolatingTopologySpreadConstraint being one. The comment in there even says "ignoring pod for eviction as it does not fit on any other node". I'm no expert on that particular strategy, but without studying the details of the strategy one could think that the evictions.go and the NodeFit via the boolean control would be just the right tool for just that part, with the benefit of allowing for users to control it.

One of our use-cases for the descheduler is actually based on telemetry and some of the telemetry triggers are planned for evicting select pods "no matter what", which means we'd really love the descheduler to do what we expected a descheduler to do. Which is evict. And then best-effort mechanisms follow that, as usual.

To make it clear, I actually like having the boolean control, because I can understand both kind of use-cases being fulfilled with the descheduler. Some want the descheduler to evict as seldom and as little as possible, and others as quickly as possible no matter what. Both are valid approaches depending on the use-case, and in the best case both would be supported via configurable options. It's just that, this NodeFit boolean control doesn't seem to work as expected in some strategies. I'll do my best at offering a fix for the node-affinity, but it seems new issues keep popping up. Ref issues #863 and #640

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2022
@ingvagabund
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2022
@ingvagabund
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023
@felipewnp
Copy link

/remove-lifecycle stale

I also could benefit from this.

In my use case, I have Karpenter managing the nodes.

When the K8s eviction occurs from a replica/scaledown event and the not evicted pods are in the same node, since karpenter removed all other nodes, there's only one node available and descheduler refuses to deschedule/evict the pods.

If descheduler was able to "forcibly" evict the pod, the pod would try to be rescheduled by K8s and would enter a unschedulable status because of the TopologySpreadConstraint.

The Karpenter then would kick in and provide another node in another AZ so the K8s could deploy the pod there.

Therefore, High Availability is restored.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 25, 2023
@ingvagabund ingvagabund removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 29, 2023
@a7i
Copy link
Contributor Author

a7i commented Oct 27, 2023

this should be closed in favor of #1149

/close

@k8s-ci-robot
Copy link
Contributor

@a7i: Closing this issue.

In response to this:

this should be closed in favor of #1149

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants