Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RemovePodsViolatingNodeAffinity should not rely on node capacity when check if pod fits on current node #873

Closed
seleznev opened this issue Jul 7, 2022 · 6 comments · Fixed by #916
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@seleznev
Copy link

seleznev commented Jul 7, 2022

What version of descheduler are you using?

descheduler version: 0.24.1

Does this issue reproduce with the latest release?

Yes.

Which descheduler CLI options are you using?

- --policy-config-file
- /policy-dir/policy.yaml
- --v
- "4"

Please provide a copy of your descheduler policy config file

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
evictLocalStoragePods: true
ignorePvcPods: true
maxNoOfPodsToEvictPerNamespace: 1
maxNoOfPodsToEvictPerNode: 1
strategies:
  LowNodeUtilization:
    enabled: false
    params:
      nodeResourceUtilizationThresholds:
        targetThresholds:
          cpu: 50
          memory: 50
          pods: 50
        thresholds:
          cpu: 20
          memory: 20
          pods: 20
  RemoveDuplicates:
    enabled: false
  RemovePodsViolatingInterPodAntiAffinity:
    enabled: false
  RemovePodsViolatingNodeAffinity:
    enabled: true
    params:
      nodeAffinityType:
      - requiredDuringSchedulingIgnoredDuringExecution
  RemovePodsViolatingNodeTaints:
    enabled: false

What k8s version are you using (kubectl version)?

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.14", GitCommit:"57a3aa3f13699cf3db9c52d228c18db94fa81876", GitTreeState:"clean", BuildDate:"2021-12-15T14:52:33Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.14", GitCommit:"57a3aa3f13699cf3db9c52d228c18db94fa81876", GitTreeState:"clean", BuildDate:"2021-12-15T14:47:10Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

What did you do?

Steps to reproduce:

  1. Create 2 k8s nodes with same labels and same capacity.
  2. Create Deployment with corresponding NodeAffinity, replicas: 1 and requests >50% of node's capacity (so, you can't schedule 2 pods on the same node).
  3. Run descheduler with enabled RemovePodsViolatingNodeAffinity strategy.

What did you expect to see?

Descheduler do nothing.

What did you see instead?

Descheduler evicts pod every iteration (because it fits on another node, but not on current one).

First iteration (I removed unrelevant messages about another pods and nodes):

I0707 07:15:07.523600       1 node_affinity.go:91] "Processing node" node="node1"
I0707 07:15:07.524743       1 node.go:166] "Pod does not fit on node" pod="test/app-69d6bbd5b4-bzh9j" node="node1"
I0707 07:15:07.524768       1 node.go:168] "insufficient cpu"
I0707 07:15:07.525726       1 node.go:163] "Pod fits on node" pod="test/app-69d6bbd5b4-bzh9j" node="node2"
I0707 07:15:07.525953       1 node_affinity.go:108] "Evicting pod" pod="test/app-69d6bbd5b4-bzh9j"
I0707 07:15:07.565879       1 evictions.go:163] "Evicted pod" pod="test/app-69d6bbd5b4-bzh9j" reason="NodeAffinity" strategy="NodeAffinity" node="node1"
I0707 07:15:07.566870       1 event.go:294] "Event occurred" object="test/app-69d6bbd5b4-bzh9j" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="Descheduled" message="pod evicted by sigs.k8s.io/deschedulerNodeAffinity"

Second:

I0707 07:20:07.464621       1 node_affinity.go:91] "Processing node" node="node2"
I0707 07:20:07.465559       1 node.go:183] "Pod does not fit on node" pod="test/app-69d6bbd5b4-4msq7" node="node2"
I0707 07:20:07.465597       1 node.go:185] "insufficient cpu"
I0707 07:20:07.467974       1 node.go:163] "Pod fits on node" pod="test/app-69d6bbd5b4-4msq7" node="node1"
I0707 07:20:07.468864       1 node_affinity.go:108] "Evicting pod" pod="test/app-69d6bbd5b4-4msq7"
I0707 07:20:07.545268       1 evictions.go:163] "Evicted pod" pod="test/app-69d6bbd5b4-4msq7" reason="NodeAffinity" strategy="NodeAffinity" node="node2"
I0707 07:20:07.546559       1 event.go:294] "Event occurred" object="test/app-69d6bbd5b4-4msq7" fieldPath="" kind="Pod" apiVersion="v1" type="Normal" reason="Descheduled" message="pod evicted by sigs.k8s.io/deschedulerNodeAffinity"
@seleznev seleznev added the kind/bug Categorizes issue or PR as related to a bug. label Jul 7, 2022
@JaneLiuL
Copy link
Member

JaneLiuL commented Jul 7, 2022

hi @seleznev thanks for raise this issue.
it seems that RemovePodsViolatingNodeAffinity need to calcuate is it pod fix another node, then start to evict.
so it means that each time before we evict we need to finger out is it there have other node fit to be schedule.

@damemi
Copy link
Contributor

damemi commented Jul 7, 2022

@JaneLiuL I think @seleznev is actually reporting the opposite. It seems that RemovePodsViolatingNodeAffinity is evicting pods that do not violate the NodeAffinity solely because they don't fit on their current node (for whatever reason). The bug here is that pods should not be evicted just based on their requests in the current node.

Pretty sure it's doing this because:

  1. RemovePodsViolatingNodeAffinity calls !PodFitsOnCurrentNode() (intending to check if the pod does not fit based on its node selectors, which is what that function used to do. However,
  2. The NodeFit feature implementation changed this function to call NodeFit().
  3. Now, calling NodeFit() checks the node selector and resource requests, failing on either.

So what needs to happen is that the NodeFit implementation needs to somehow decouple the node selector check from the fitsRequests check. We should also add a test for this. cc @RyanDevlin @ingvagabund

@JaneLiuL
Copy link
Member

JaneLiuL commented Jul 8, 2022

thanks @damemi , looks clear to me now, i would like to take add some tests on it.

@uniemimu
Copy link

2. The NodeFit feature implementation [changed this function to call NodeFit()](https://github.com/kubernetes-sigs/descheduler/commit/16eb9063b6525bd7fc3cfd57b506ce2843818804#diff-fc4033754cd376d78b09ff84a934296f26e774702505885309268f816ba3817cR178-R179).

And there is the problem and cause for many issues. The same sort of baked-in NodeFit() call exists also in PodFitsAnyNode function. This means, that those strategies (or plugins) which use these functions in the filtering phase, have de-facto hard-coded NodeFit functionality which cannot be turned off. This then creates even more issues, since the NodeFit does have a controlling configuration boolean, which would imply it could be turned off.. but it can't. That boolean only has effect during the actual eviction step. Which is unfortunate as it prevents usage of these strategies (or plugins) for use-cases where getting an eviction done is more important than avoiding it entirely.

If those NodeFit() calls weren't hard-coded in these functions, meaning if that function would only be called when it is configured to be enabled, then issue #845 would be fixed and issue #863 would at least have a workaround by disabling NodeFit.

Issue #640 probably requires a bit more, but it is very much related, and removal of the PodFitsAnyNode check and relying on the configurable boolean which controls the eviction step NodeFit() -call, is part of the offered PR.

@ingvagabund
Copy link
Contributor

ingvagabund commented Aug 16, 2022

This is definitely a bug in https://github.com/kubernetes-sigs/descheduler/blob/master/pkg/descheduler/node/node.go#L122. Invoking fitsRequest needs to be made conditional:

if pod.Spec.Name != node.Name {
	// Check if the pod can fit on a node based off it's requests
	ok, reqErrors := fitsRequest(nodeIndexer, pod, node)
	if !ok {
		errors = append(errors, reqErrors...)
	}
}

There's no reason to check whether a pod fits resources of its assigned node.

@ingvagabund
Copy link
Contributor

Given this issue is not about decoupling the NodeFit, I am dropping the generic approach comment into #640 (comment) instead to follow the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
5 participants