-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nodeFit = false doesn't work as expected with RemovePodsViolatingNodeAffinity #640
Comments
/cc @RyanDevlin (if you have the time) |
/assign |
@RyanDevlin I verified that your patches fix this |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Since @RyanDevlin is busy and #636 isn't moving overly fast, may I propose an alternative approach for making sure this nearly 6 months old bug gets fixed.
|
Hi @uniemimu, Regarding your suggestion, this change may be more complicated than the one-liner you proposed. While it would resolve this issue, it would also change the current default behavior of the NodeAffinity strategy. Since this strategy has been implemented to not evict pods that don't fit on another node by default, users may be operating under that assumption. This actually brings up a good point for #636, since we default tl;dr is that this change, while technically simple, would break existing user workflows for different use cases that assume the current default. Additionally, I think what you have pointed out raises a bigger concern for the changes in #636 which might also break the default behavior in this strategy (perhaps others too?) @RyanDevlin does this seem right with what you were working on? |
@damemi I took a look at the legacy here: 40bb490#diff-92b554150104f06be4ac18898266bad157dc2f611a6268d873765bf9608aadb3R56-R60 And indeed you are correct in that the legacy is such that it fits better For all the other strategies of that era, and to my knowledge also now, they used to work like the default for descheduler/pkg/descheduler/strategies/pod_antiaffinity.go Lines 74 to 77 in 3c251fb
So I suppose it is safe to say the current default value is Now in order to maintain how things used to work in the original NodeAffinity strategy, and how things have worked in the rest of the strategies, and how the nodeFit feature is documented to be working, I would presume that the correct fix would be to have the default value for NodeAffinity strategy Since that check is already in the PodEvictor and controlled by If you need more hands in the project, like for reviews, I would gladly volunteer. |
@damemi @uniemimu I apologize for my absence, I didn't have time to contribute until this week. I have much more time now to focus on closing this issue and merging my PR, if that's still the desired outcome here. With regard to the defaults discussion above, #636 is going to be in conflict with some strategies no matter how we default it. This is because, like @damemi pointed out, some strategies have a few of the One way to do this more gracefully might be to overlap the strategy and |
@RyanDevlin at least I'd like to see this bug sorted one way or the other. I like the nodeFit feature and I'd like to be able to use it with the node affinity as it is documented. As to how the defaults should be done, I suppose that is up to @damemi to decide how it needs to be. My take on the matter is in the previous comment, and I don't think the default should be the same for all strategies unless a decision is made that the previous way of working doesn't matter. |
This is a tough one, but it's not just entirely my call 🙂 I agree with @uniemimu that the default doesn't necessarily have to be the same for all strategies. This takes more work on @RyanDevlin's part to refactor his NodeFit changes, but it might be the best way to avoid breaking the existing behavior. @ingvagabund you've done a lot of the review for NodeFit so far, wdyt? tl;dr is that NodeFit changes the default affinity checks for the NodeAffinity strategy, because it bakes in an affinity check already. So moving this check to NodeFit will either have to:
I think the 2nd option is better because while it is a change, it leans toward evicting fewer pods from existing setups rather than more. Thoughts? |
@damemi I agree that the least disruptive way to add this feature would be to enable it by default. This way (like you said) the worst case scenario would involve evicting less pods than in a previous version. I still have some cleanup in my PR to meet the changes requested by @ingvagabund, but I'm going to hold off until we agree on a direction here. If we think this is the way to go I will make that PR ready to merge. |
@damemi @ingvagabund this seems to have been left without a decision and thus Ryan hasn't been able to work on this. @RyanDevlin are you still active on those patches of yours or should I just implement a PR for this bug based on e.g. what Mike said on Feb 23 as his preference? |
The 2nd option is more preferable |
@RyanDevlin very sorry for the delay. I completely missed the last comments. I have rebased your PR and address the review comments in #790. Feel free to cherry-pick the changes if you still wanna finish your PR. |
@ingvagabund please review #793. And perhaps somebody could trigger ok-to-test? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Current state:
Framework current direction:
In order to implement the requested functionality of allowing eviction of pods without checking for node-feasibility will be equivalent to disabling the For backward compatibility every v1alpha1 strategy configuration will be converted in a separate v1alpha2 profile internally. In case of
apiVersion: descheduler/v1alpha2
kind: DeschedulerPolicy
profiles:
- name: RemovePodsViolatingNodeAffinity
pluginConfig:
- name: DefaultEvictor
args:
...
currentNodeFit: false
- name: RemovePodsViolatingNodeAffinity
args:
...
plugins:
deschedule:
enabled:
- RemovePodsViolatingNodeAffinity In case of other strategies the defaults will be:
|
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
I am starting to work on the preEvictionFilter extension point, do we want to discuss this 👉 |
Adding those options to NodeFit sounds good to me, but maybe we should split that into a separate follow-up task after converting the current NodeFit to a plugin. Then at that point it's just a discussion of a change for a single plugin. I don't think that has any architectural impact on the framework refactor. |
Makes sense! |
Looks pretty backwards compatible since NodeFit seems to be working just the same whether it is set to true or false. The proposed and already merged preEvictionFilter and DefaultEvictor plugin look fine to me, but the key for this bug to be solvable is to be able to take out the checks at descheduler/pkg/framework/plugins/removepodsviolatingnodeaffinity/node_affinity.go Lines 93 to 94 in da3ebb7
A NodeFit plugin sounds interesting. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/lifecycle frozen |
I think we should freeze all feature requests that we plan to tackle (for sure) at some point in the future. Does not make sense to keep managing stale or rotten labels on those, if we are sure we want to address it at some point |
+1, I am totally fine with making liberal use of the the lifecycle/frozen label for feature requests |
@uniemimu I'd like to thank you for all the patience and energy you put into this issue. I can totally understand if you are no longer interested in pursuing this issue. Yet, if you still are and can spare few moments, I'd like to ask for more clarification. Do you still remember what you meant by "The deployment pod template had a node affinity rule, which was not triggered during scheduling"? Can you provide an example of the deployment spec? It has been more than 2 years since this issue was open. I am currently in process of writing a proposal for the nodefit improvements. Picking up as many cases/requests for improvement I can. Any clarification of your use cases is appreciated. |
As you probably guessed, this issue is no longer critical for us. But it is an issue nevertheless, and I can clarify. What we were doing was telemetry aware scheduling where the pods had this sort of an affinity rule (at the end of the deployment): and the whole scheme is explained here: The description of this issue 640 described more manual ways of adjusting the node affinity status, but we were doing it with that telemetry aware scheduling solution which basically adjusts node labels based on telemetry events. My twist to that setup was that the pod had a node selector which forced it to land in a specific node, meaning there were no alternative nodes if that one node was deemed as unsuitable. Also there was some pod label based limiting happening so that only certain pods got selected by the descheduler, but for the bug that detail did not matter, it worked just the same if all the pods in the deployment were treated the same. The end result was always that the descheduler did not evict the pod, if there wasn't another suitable node available. |
What version of descheduler are you using?
descheduler version: 0.22
Does this issue reproduce with the latest release?
Yes
Which descheduler CLI options are you using?
Please provide a copy of your descheduler policy config file
What k8s version are you using (
kubectl version
)?kubectl version
OutputWhat did you do?
I created a deployment with a node selector forcing the pods to land on a specific single node. The deployment pod template had a node affinity rule, which was not triggered during scheduling. After scheduling, I changed the node labels so that the affinity should trigger descheduling, and labeled the pods with the required foo=bar label.
What did you expect to see?
I expected the descheduler to evict my pod, and the pod to end up in PENDING state as there is no node where it could fit.
What did you see instead?
Descheduler prints that the pod will not fit the node, but it won't evict it.
The culprit is at the strategy, node_affinity.go line 89.
descheduler/pkg/descheduler/strategies/node_affinity.go
Line 89 in 5b55794
The node affinity strategy will always check for finding a node that fits, regardless of how nodeFit is set. The documentation of the nodeFit filtering would suggest that the default is not to check for node fitting:
descheduler/README.md
Lines 703 to 707 in 5b55794
It would be logical, that when nodeFit=false, the pod would be evicted regardless of whether it fits somewhere, or not. A one-line change would fix this. Line 89 could be e.g. changed to something like:
(!nodeFit || nodeutil.PodFitsAnyNode(pod, nodes))
The text was updated successfully, but these errors were encountered: