Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodeFit = false doesn't work as expected with RemovePodsViolatingNodeAffinity #640

Open
uniemimu opened this issue Oct 5, 2021 · 34 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@uniemimu
Copy link

uniemimu commented Oct 5, 2021

What version of descheduler are you using?

descheduler version: 0.22

Does this issue reproduce with the latest release?
Yes

Which descheduler CLI options are you using?

        - "--policy-config-file"
        - "/policy-dir/policy.yaml"
        - "--descheduling-interval"
        - "30s"
        - "--v"
        - "4"

Please provide a copy of your descheduler policy config file

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemovePodsViolatingNodeAffinity":
        enabled: true
        params:
          nodeAffinityType:
            - "requiredDuringSchedulingIgnoredDuringExecution"
          labelSelector:
            matchExpressions:
              - {key: "foo", operator: In, values: [bar]}
          nodeFit: false

What k8s version are you using (kubectl version)?

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T21:01:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.9", GitCommit:"7a576bc3935a6b555e33346fd73ad77c925e9e4a", GitTreeState:"clean", BuildDate:"2021-07-15T20:56:38Z", GoVersion:"go1.15.14", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

I created a deployment with a node selector forcing the pods to land on a specific single node. The deployment pod template had a node affinity rule, which was not triggered during scheduling. After scheduling, I changed the node labels so that the affinity should trigger descheduling, and labeled the pods with the required foo=bar label.

What did you expect to see?
I expected the descheduler to evict my pod, and the pod to end up in PENDING state as there is no node where it could fit.

What did you see instead?
Descheduler prints that the pod will not fit the node, but it won't evict it.

The culprit is at the strategy, node_affinity.go line 89.

nodeutil.PodFitsAnyNode(pod, nodes)

The node affinity strategy will always check for finding a node that fits, regardless of how nodeFit is set. The documentation of the nodeFit filtering would suggest that the default is not to check for node fitting:

descheduler/README.md

Lines 703 to 707 in 5b55794

If set to `true` the descheduler will consider whether or not the pods that meet eviction criteria will fit on other nodes before evicting them. If a pod cannot be rescheduled to another node, it will not be evicted. Currently the following criteria are considered when setting `nodeFit` to `true`:
- A `nodeSelector` on the pod
- Any `Tolerations` on the pod and any `Taints` on the other nodes
- `nodeAffinity` on the pod
- Whether any of the other nodes are marked as `unschedulable`

It would be logical, that when nodeFit=false, the pod would be evicted regardless of whether it fits somewhere, or not. A one-line change would fix this. Line 89 could be e.g. changed to something like:
(!nodeFit || nodeutil.PodFitsAnyNode(pod, nodes))

@uniemimu uniemimu added the kind/bug Categorizes issue or PR as related to a bug. label Oct 5, 2021
@damemi
Copy link
Contributor

damemi commented Oct 5, 2021

/cc @RyanDevlin (if you have the time)
The description seems reasonable to me at first glance, do you have any opposition? You know the nodefit code best

@RyanDevlin
Copy link
Contributor

@damemi Sure I can take a look. If I had to guess, this is due to the fact that all strategies are not aligned to the same NodeFit considerations. PR #636 should fix this, but I just noticed something I forgot to change. Let me report back here in a bit once I have a chance to test something.

@RyanDevlin
Copy link
Contributor

/assign

@RyanDevlin
Copy link
Contributor

@damemi @uniemimu Reporting back....this is now fully fixed in #636.

@uniemimu
Copy link
Author

uniemimu commented Oct 7, 2021

@RyanDevlin I verified that your patches fix this

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
@uniemimu
Copy link
Author

uniemimu commented Jan 5, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2022
@uniemimu
Copy link
Author

Since @RyanDevlin is busy and #636 isn't moving overly fast, may I propose an alternative approach for making sure this nearly 6 months old bug gets fixed.

  1. I push my one-liner fix into a PR, and it is merged by e.g @damemi
  2. Ryan later rebases on top of it, and makes my one-liner go away. I don't mind, and that rebase isn't going to be difficult.

@damemi
Copy link
Contributor

damemi commented Feb 14, 2022

Hi @uniemimu,
I'm very sorry about the delay. Kubernetes and its subprojects (like this one) are maintained by a significant number of volunteers, many of whom review issues and PRs outside of their day jobs. This can lead to some issues being unfortunately neglected for some time. It's always okay in this case to kindly ping someone to check on the status of an issue that is affecting you, otherwise some things (like this one) can get buried under the many other topics that come up. Thank you for understanding this.

Regarding your suggestion, this change may be more complicated than the one-liner you proposed. While it would resolve this issue, it would also change the current default behavior of the NodeAffinity strategy. Since this strategy has been implemented to not evict pods that don't fit on another node by default, users may be operating under that assumption.

This actually brings up a good point for #636, since we default NodeFit to false (in the same idea of preserving default behavior, though with a feature-level implementation like this we can give appropriate notice to users to eventually make this the new default). Defaulting to false in this case would add the same change, by conflating the NodeSelector fit check with the others provided by that feature. I believe this is why the patch worked to solve your problem.

tl;dr is that this change, while technically simple, would break existing user workflows for different use cases that assume the current default. Additionally, I think what you have pointed out raises a bigger concern for the changes in #636 which might also break the default behavior in this strategy (perhaps others too?) @RyanDevlin does this seem right with what you were working on?

@uniemimu
Copy link
Author

@damemi I took a look at the legacy here:

40bb490#diff-92b554150104f06be4ac18898266bad157dc2f611a6268d873765bf9608aadb3R56-R60

And indeed you are correct in that the legacy is such that it fits better NodeFit being true considering how things have worked since introducing this particular strategy.

For all the other strategies of that era, and to my knowledge also now, they used to work like the default for NodeFit being false. None of the other strategies checked, whether the pod fits any other node. The difference between the strategies is somewhat unfortunate considering trying to put some default value for the field covering all strategies. I'm no expert on the matter, but the api defaults.go and zz_generated.defaults.go are rather empty, which sort of leaves me assuming the zero value false is the one being defaulted to. Also all strategies now set locally nodeFit variable to false, after which the value is pulled from a snippet such as this:

nodeFit := false
if strategy.Params != nil {
nodeFit = strategy.Params.NodeFit
}

So I suppose it is safe to say the current default value is false, and how NodeAffinity strategy works is another thing.

Now in order to maintain how things used to work in the original NodeAffinity strategy, and how things have worked in the rest of the strategies, and how the nodeFit feature is documented to be working, I would presume that the correct fix would be to have the default value for NodeAffinity strategy NodeFit to be true and for the rest of strategies false, and then potentially a removal of this check at NodeAffinity:
&& nodeutil.PodFitsAnyNode(pod, nodes)

Since that check is already in the PodEvictor and controlled by NodeFit

If you need more hands in the project, like for reviews, I would gladly volunteer.

@RyanDevlin
Copy link
Contributor

@damemi @uniemimu I apologize for my absence, I didn't have time to contribute until this week. I have much more time now to focus on closing this issue and merging my PR, if that's still the desired outcome here.

With regard to the defaults discussion above, #636 is going to be in conflict with some strategies no matter how we default it. This is because, like @damemi pointed out, some strategies have a few of the NodeFit checks baked in, while others perform no checks. If we want to align all strategies under the single NodeFit mechanism, we will need to eventually change the behavior of some of them.

One way to do this more gracefully might be to overlap the strategy and NodeFit checks. Since NodeFit considers node affinity before evicting, and the NodeAffinity strategy also considers this, we could merge the checks in node.go of #636, but leave the NodeAffinity strategy as is. Then in a later PR we can remove the checks in NodeAffinity that overlap with NodeFit. This causes some redundancy and inefficiency when running some strategies with NodeFit = true, but it allows for stable behavior that can be gradually changed.

@uniemimu
Copy link
Author

@RyanDevlin at least I'd like to see this bug sorted one way or the other. I like the nodeFit feature and I'd like to be able to use it with the node affinity as it is documented. As to how the defaults should be done, I suppose that is up to @damemi to decide how it needs to be. My take on the matter is in the previous comment, and I don't think the default should be the same for all strategies unless a decision is made that the previous way of working doesn't matter.

@damemi
Copy link
Contributor

damemi commented Feb 23, 2022

This is a tough one, but it's not just entirely my call 🙂 I agree with @uniemimu that the default doesn't necessarily have to be the same for all strategies. This takes more work on @RyanDevlin's part to refactor his NodeFit changes, but it might be the best way to avoid breaking the existing behavior.

@ingvagabund you've done a lot of the review for NodeFit so far, wdyt?

tl;dr is that NodeFit changes the default affinity checks for the NodeAffinity strategy, because it bakes in an affinity check already. So moving this check to NodeFit will either have to:

  • remove the affinity check by default for this strategy (if we default to disabling NodeAffinity), or
  • include the other NodeFit checks by default (if we default to enabling NodeAffinity)

I think the 2nd option is better because while it is a change, it leans toward evicting fewer pods from existing setups rather than more. Thoughts?

@RyanDevlin
Copy link
Contributor

@damemi I agree that the least disruptive way to add this feature would be to enable it by default. This way (like you said) the worst case scenario would involve evicting less pods than in a previous version.

I still have some cleanup in my PR to meet the changes requested by @ingvagabund, but I'm going to hold off until we agree on a direction here. If we think this is the way to go I will make that PR ready to merge.

@uniemimu
Copy link
Author

@damemi @ingvagabund this seems to have been left without a decision and thus Ryan hasn't been able to work on this.

@RyanDevlin are you still active on those patches of yours or should I just implement a PR for this bug based on e.g. what Mike said on Feb 23 as his preference?

@ingvagabund
Copy link
Contributor

The 2nd option is more preferable

@ingvagabund
Copy link
Contributor

@RyanDevlin very sorry for the delay. I completely missed the last comments. I have rebased your PR and address the review comments in #790. Feel free to cherry-pick the changes if you still wanna finish your PR.

@uniemimu
Copy link
Author

uniemimu commented May 6, 2022

The 2nd option is more preferable

@ingvagabund please review #793. And perhaps somebody could trigger ok-to-test?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 4, 2022
@ingvagabund
Copy link
Contributor

Current state:

  • RemovePodsViolatingNodeAffinity does not fully respect NodeFit enabled/disabled functionality due to the fact there was no shared concept of checking whether a node fits replacement for an evicted pod in the past.
  • RemovePodsViolatingNodeAffinity has additional PodFitsCurrentNode and PodFitsAnyNode checks before actual eviction which are hard-coded and can not be disabled
  • Requested functionality is to allow eviction of a pod even though there's no node feasiable for scheduling its replacement

Framework current direction:
Once the evictor filter gets migrated into a DefaultEvictor plugin, one of the goals is to split it into filter and preEvictionFilter extension points. The filter extension point will correspond to the current evictor filter. Just without the current NodeFit function. The NodeFit function gets turned into preEvictionFilter extension point with its own arguments. Depending on the discussion we will have during the review, we can identify arguments like:

  • minimalNodeFit: apply only PodMatchNodeSelector, TolerationsTolerateTaints and IsNodeUnschedulable checks
  • resourceFit: apply fitsRequest (when disabled a new pending pod can trigger a preemption)
  • currentNodeFit: when set to false, exclude the pod's current node from node feasibility checks

In order to implement the requested functionality of allowing eviction of pods without checking for node-feasibility will be equivalent to disabling the preEvictionFilter extension point of DefaultEvictor plugin. Effectively bypassing any non-essential checks. Open to debate whether we still need to run the filter extension point (or its essential subset) right before the actual eviction is performed to make sure a plugin does not evict a system critical pod by accident (unless overruled).

For backward compatibility every v1alpha1 strategy configuration will be converted in a separate v1alpha2 profile internally. In case of RemovePodsViolatingNodeAffinity strategy/plugin, the backward compatible setting will correspond to:

  • when NodeFit=false (PodFitsAnyNode + !PodFitsCurrentNode) or NodeFit=true (PodFitsAnyOtherNode + PodFitsAnyNode + !PodFitsCurrentNode = PodFitsAnyNode + !PodFitsCurrentNode) (please double check :)):
apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: RemovePodsViolatingNodeAffinity
  pluginConfig:
  - name: DefaultEvictor
    args:
      ...
      currentNodeFit: false
  - name: RemovePodsViolatingNodeAffinity
    args:
      ...
  plugins:
    deschedule:
      enabled:
      - RemovePodsViolatingNodeAffinity

In case of other strategies the defaults will be:

  • for PodLifeTime:
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerPolicy
    profiles:
    - name: PodLifeTime
      pluginConfig:
      - name: DefaultEvictor
        args:
          ...
          currentNodeFit: true # all nodes are feasible for rescheduling
      - name: PodLifeTime
        args:
          ...
      plugins:
        deschedule:
          enabled:
          - PodLifeTime
  • for RemoveFailedPods:
    apiVersion: descheduler/v1alpha2
    kind: DeschedulerPolicy
    profiles:
    - name: RemoveFailedPods
      pluginConfig:
      - name: DefaultEvictor
        args:
          ...
          currentNodeFit: true # all nodes are feasible for rescheduling
      - name: RemoveFailedPods
        args:
          ...
      plugins:
        deschedule:
          enabled:
          - RemoveFailedPods

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 15, 2022
@knelasevero
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 15, 2022
@knelasevero
Copy link
Contributor

I am starting to work on the preEvictionFilter extension point, do we want to discuss this 👉 Depending on the discussion we will have during the review, we can identify arguments like: in the PR review, or here in this issue? I think it would make sense to make some decision here first, but I guess I could come up with something and check with you all what you think.

cc @damemi @ingvagabund @uniemimu

@damemi
Copy link
Contributor

damemi commented Sep 15, 2022

Adding those options to NodeFit sounds good to me, but maybe we should split that into a separate follow-up task after converting the current NodeFit to a plugin.

Then at that point it's just a discussion of a change for a single plugin. I don't think that has any architectural impact on the framework refactor.

@knelasevero
Copy link
Contributor

Makes sense!

@uniemimu
Copy link
Author

For backward compatibility every v1alpha1 strategy configuration will be converted in a separate v1alpha2 profile internally. In case of RemovePodsViolatingNodeAffinity strategy/plugin, the backward compatible setting will correspond to:

  • when NodeFit=false (PodFitsAnyNode + !PodFitsCurrentNode) or NodeFit=true (PodFitsAnyOtherNode + PodFitsAnyNode + !PodFitsCurrentNode = PodFitsAnyNode + !PodFitsCurrentNode) (please double check :)):

Looks pretty backwards compatible since NodeFit seems to be working just the same whether it is set to true or false.

The proposed and already merged preEvictionFilter and DefaultEvictor plugin look fine to me, but the key for this bug to be solvable is to be able to take out the checks at

!nodeutil.PodFitsCurrentNode(d.handle.GetPodsAssignedToNodeFunc(), pod, node) &&
nodeutil.PodFitsAnyNode(d.handle.GetPodsAssignedToNodeFunc(), pod, nodes)

A NodeFit plugin sounds interesting.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 26, 2022
@ingvagabund
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2023
@knelasevero
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 4, 2023
@knelasevero
Copy link
Contributor

I think we should freeze all feature requests that we plan to tackle (for sure) at some point in the future. Does not make sense to keep managing stale or rotten labels on those, if we are sure we want to address it at some point

@damemi
Copy link
Contributor

damemi commented Apr 4, 2023

+1, I am totally fine with making liberal use of the the lifecycle/frozen label for feature requests

@ingvagabund
Copy link
Contributor

ingvagabund commented May 29, 2024

I created a deployment with a node selector forcing the pods to land on a specific single node. The deployment pod template had a node affinity rule, which was not triggered during scheduling. After scheduling, I changed the node labels so that the affinity should trigger descheduling, and labeled the pods with the required foo=bar label.

@uniemimu I'd like to thank you for all the patience and energy you put into this issue. I can totally understand if you are no longer interested in pursuing this issue. Yet, if you still are and can spare few moments, I'd like to ask for more clarification. Do you still remember what you meant by "The deployment pod template had a node affinity rule, which was not triggered during scheduling"? Can you provide an example of the deployment spec?

It has been more than 2 years since this issue was open. I am currently in process of writing a proposal for the nodefit improvements. Picking up as many cases/requests for improvement I can. Any clarification of your use cases is appreciated.

@uniemimu
Copy link
Author

I created a deployment with a node selector forcing the pods to land on a specific single node. The deployment pod template had a node affinity rule, which was not triggered during scheduling. After scheduling, I changed the node labels so that the affinity should trigger descheduling, and labeled the pods with the required foo=bar label.

@uniemimu I'd like to thank you for all the patience and energy you put into this issue. I can totally understand if you are no longer interested in pursuing this issue. Yet, if you still are and can spare few moments, I'd like to ask for more clarification. Do you still remember what you meant by "The deployment pod template had a node affinity rule, which was not triggered during scheduling"? Can you provide an example of the deployment spec?

It has been more than 2 years since this issue was open. I am currently in process of writing a proposal for the nodefit improvements. Picking up as many cases/requests for improvement I can. Any clarification of your use cases is appreciated.

As you probably guessed, this issue is no longer critical for us. But it is an issue nevertheless, and I can clarify. What we were doing was telemetry aware scheduling where the pods had this sort of an affinity rule (at the end of the deployment):

https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/deploy/health-metric-demo/demo-pod.yaml

and the whole scheme is explained here:

https://github.com/intel/platform-aware-scheduling/tree/master/telemetry-aware-scheduling#descheduling-workloads

The description of this issue 640 described more manual ways of adjusting the node affinity status, but we were doing it with that telemetry aware scheduling solution which basically adjusts node labels based on telemetry events. My twist to that setup was that the pod had a node selector which forced it to land in a specific node, meaning there were no alternative nodes if that one node was deemed as unsuitable. Also there was some pod label based limiting happening so that only certain pods got selected by the descheduler, but for the bug that detail did not matter, it worked just the same if all the pods in the deployment were treated the same. The end result was always that the descheduler did not evict the pod, if there wasn't another suitable node available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants