Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug 1950026: Sync with upstream #62

Merged
merged 51 commits into from
Jun 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
92cb1b2
add helm-test configurations
pravarag Mar 14, 2021
079bd61
e2e: TestLowNodeUtilization: normalize nodes before running the strategy
ingvagabund Mar 28, 2021
bb55741
Update vendor
ingvagabund Mar 29, 2021
534a30a
e2e: deleteRC: replace loop with wait.PollImmediate
ingvagabund Mar 28, 2021
2781106
TestEvictAnnotation: replace LowNodeUtilization strategy with PodLife…
ingvagabund Mar 28, 2021
f4e24a4
Drop klog
ingvagabund Mar 31, 2021
8b5c4e8
update docs with helm test info
pravarag Apr 14, 2021
c4afb6b
code cleanup: remove check on length
binacs Apr 25, 2021
780ac7a
Merge pull request #554 from BinacsLee/binacs-utils-predicates-cleanup
k8s-ci-robot Apr 26, 2021
feae158
Invert main strategy loop for performance and customizability
damemi Apr 26, 2021
724ff8a
Merge pull request #556 from damemi/change-main-loop
k8s-ci-robot Apr 28, 2021
6bde95c
Use Structured Logging For Unknown Strategy Log Message
seanmalloy Apr 29, 2021
161f66a
Merge pull request #558 from KohlsTechnology/structured-logging
k8s-ci-robot Apr 29, 2021
58408d7
Update error messages in verify scripts to be more informative
damemi Apr 30, 2021
2a3529c
Merge pull request #560 from damemi/update-verify-messages
k8s-ci-robot May 1, 2021
54f6726
Define TolerationsEqual
ingvagabund May 5, 2021
4edbecc
Define NodeSelectorsEqual predicate
ingvagabund May 9, 2021
9b69962
Merge pull request #535 from ingvagabund/e2e-refactor
k8s-ci-robot May 11, 2021
fc83c13
Add test cases for soft constraints/multi constraints
damemi May 10, 2021
9b26abd
Merge pull request #565 from damemi/issue-564
k8s-ci-robot May 13, 2021
24c0ca2
Take node's taints into consideration when balancing domains
a7i May 14, 2021
a1709e9
Merge pull request #567 from a7i/topology-taint-toleration
k8s-ci-robot May 14, 2021
fe8e17f
fix staticcheck failure for pkg/descheduler/descheduler_test.go
binacs May 16, 2021
a9ff644
Merge pull request #568 from BinacsLee/binacs-pkg-descheduler-desched…
k8s-ci-robot May 17, 2021
5396282
RemoveDuplicates: take node taints, node affinity and node selector i…
ingvagabund May 5, 2021
0397425
Bump Kind To v0.11.0
seanmalloy May 19, 2021
8480e03
Fail unit and e2e tests on any errors
seanmalloy May 19, 2021
11143d5
Merge pull request #570 from KohlsTechnology/bump-kind-version
k8s-ci-robot May 19, 2021
31fd097
Merge pull request #527 from pravarag/add-helm-test
k8s-ci-robot May 19, 2021
449383c
Add run descheduler as deployment files and update README
May 21, 2021
3b9d3d9
Merge pull request #563 from ingvagabund/removeduplicates-take-taints…
k8s-ci-robot May 21, 2021
646c13a
Fix grammar and indentation issue for deployment resource
May 21, 2021
41d46d0
Working nodeFit feature
RyanDevlin Apr 22, 2021
bfd5fea
Merge pull request #559 from RyanDevlin/nodeFit
k8s-ci-robot May 31, 2021
a54b59f
Use stable batch/v1 API Group for Kubernetes 1.21
a7i May 25, 2021
f07089d
Bump Helm Chart, kind, and Kubernetes version for helm-test
a7i May 25, 2021
012ca23
Filter pods by labelSelector during eviction for TopologySpreadConstr…
a7i May 24, 2021
d7dc0ab
Merge pull request #576 from a7i/amira/topology-spread-label-filter
k8s-ci-robot Jun 2, 2021
e40620e
Remove namespace from ClusterRoleBinding
jsravn Jun 4, 2021
6e71068
Refractoring lownodeutilization - extracting common functions
bytetwin Jun 1, 2021
2f18864
Refractor - Modify the common functions to be used by high utilisation
bytetwin Jun 1, 2021
4cd1e66
Adding highnodeutilization strategy
bytetwin May 7, 2021
839a237
Merge pull request #581 from jsravn/patch-1
k8s-ci-robot Jun 7, 2021
3843a2d
Merge pull request #550 from hanumanthan/highnodeutilisation
k8s-ci-robot Jun 8, 2021
fe8d4c0
Merge pull request #572 from audip/feature/add-deployment-k8s-yaml-files
k8s-ci-robot Jun 8, 2021
f51ea72
Merge pull request #577 from a7i/amira/cronjob-ga
k8s-ci-robot Jun 8, 2021
d998d82
HighNodeUtilization: add NodeFit feature
ingvagabund Jun 8, 2021
b59995e
Merge pull request #583 from ingvagabund/highnodeutil-nodefit
k8s-ci-robot Jun 8, 2021
eb1f0ec
Update Go report card badge
damemi Jun 8, 2021
0f785b9
Merge pull request #584 from damemi/update-go-badge
k8s-ci-robot Jun 8, 2021
00c1931
Merge branch 'master' into release-4.8
ingvagabund Jun 9, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 148 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
[![Go Report Card](https://goreportcard.com/badge/kubernetes-sigs/descheduler)](https://goreportcard.com/report/sigs.k8s.io/descheduler)
[![Go Report Card](https://goreportcard.com/badge/sigs.k8s.io/descheduler)](https://goreportcard.com/report/sigs.k8s.io/descheduler)
![Release Charts](https://github.com/kubernetes-sigs/descheduler/workflows/Release%20Charts/badge.svg)

# Descheduler for Kubernetes
Expand Down Expand Up @@ -28,12 +28,14 @@ Table of Contents
- [Quick Start](#quick-start)
- [Run As A Job](#run-as-a-job)
- [Run As A CronJob](#run-as-a-cronjob)
- [Run As A Deployment](#run-as-a-deployment)
- [Install Using Helm](#install-using-helm)
- [Install Using Kustomize](#install-using-kustomize)
- [User Guide](#user-guide)
- [Policy and Strategies](#policy-and-strategies)
- [RemoveDuplicates](#removeduplicates)
- [LowNodeUtilization](#lownodeutilization)
- [HighNodeUtilization](#highnodeutilization)
- [RemovePodsViolatingInterPodAntiAffinity](#removepodsviolatinginterpodantiaffinity)
- [RemovePodsViolatingNodeAffinity](#removepodsviolatingnodeaffinity)
- [RemovePodsViolatingNodeTaints](#removepodsviolatingnodetaints)
Expand All @@ -44,6 +46,7 @@ Table of Contents
- [Namespace filtering](#namespace-filtering)
- [Priority filtering](#priority-filtering)
- [Label filtering](#label-filtering)
- [Node Fit filtering](#node-fit-filtering)
- [Pod Evictions](#pod-evictions)
- [Pod Disruption Budget (PDB)](#pod-disruption-budget-pdb)
- [Metrics](#metrics)
Expand All @@ -56,7 +59,7 @@ Table of Contents

## Quick Start

The descheduler can be run as a Job or CronJob inside of a k8s cluster. It has the
The descheduler can be run as a `Job`, `CronJob`, or `Deployment` inside of a k8s cluster. It has the
advantage of being able to be run multiple times without needing user intervention.
The descheduler pod is run as a critical pod in the `kube-system` namespace to avoid
being evicted by itself or by the kubelet.
Expand All @@ -77,6 +80,14 @@ kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/cronjob/cronjob.yaml
```

### Run As A Deployment

```
kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/deployment/deployment.yaml
```

### Install Using Helm

Starting with release v0.18.0 there is an official helm chart that can be used to install the
Expand All @@ -99,16 +110,29 @@ Run As A CronJob
kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/cronjob?ref=v0.21.0' | kubectl apply -f -
```

Run As A Deployment
```
kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/deployment?ref=v0.21.0' | kubectl apply -f -
```

## User Guide

See the [user guide](docs/user-guide.md) in the `/docs` directory.

## Policy and Strategies

Descheduler's policy is configurable and includes strategies that can be enabled or disabled.
Eight strategies `RemoveDuplicates`, `LowNodeUtilization`, `RemovePodsViolatingInterPodAntiAffinity`,
`RemovePodsViolatingNodeAffinity`, `RemovePodsViolatingNodeTaints`, `RemovePodsViolatingTopologySpreadConstraint`,
`RemovePodsHavingTooManyRestarts`, and `PodLifeTime` are currently implemented. As part of the policy, the
Nine strategies
1. `RemoveDuplicates`
2. `LowNodeUtilization`
3. `HighNodeUtilization`
4. `RemovePodsViolatingInterPodAntiAffinity`
5. `RemovePodsViolatingNodeAffinity`
6. `RemovePodsViolatingNodeTaints`
7. `RemovePodsViolatingTopologySpreadConstraint`
8. `RemovePodsHavingTooManyRestarts`
9. `PodLifeTime`
are currently implemented. As part of the policy, the
parameters associated with the strategies can be configured too. By default, all strategies are enabled.

The following diagram provides a visualization of most of the strategies to help
Expand Down Expand Up @@ -157,6 +181,7 @@ should include `ReplicaSet` to have pods created by Deployments excluded.
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**
```yaml
Expand Down Expand Up @@ -204,6 +229,7 @@ strategy evicts pods from `overutilized nodes` (those with usage above `targetTh
|`numberOfNodes`|int|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand All @@ -226,8 +252,10 @@ strategies:
```

Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`. If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in `thresholds` and `targetThresholds` explicitly.
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`.
If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional,
and will not be used to compute node's usage if it's not specified in `thresholds` and `targetThresholds` explicitly.
* `thresholds` or `targetThresholds` can not be nil and they must configure exactly the same types of resources.
* The valid range of the resource's percentage value is \[0, 100\]
* Percentage value of `thresholds` can not be greater than `targetThresholds` for the same resource.
Expand All @@ -237,6 +265,63 @@ This parameter can be configured to activate the strategy only when the number o
are above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.

### HighNodeUtilization

This strategy finds nodes that are under utilized and evicts pods in the hope that these pods will be scheduled compactly into fewer nodes.
This strategy **must** be used with the
scheduler strategy `MostRequestedPriority`. The parameters of this strategy are configured under `nodeResourceUtilizationThresholds`.

The under utilization of nodes is determined by a configurable threshold `thresholds`. The threshold
`thresholds` can be configured for cpu, memory, number of pods, and extended resources in terms of percentage. The percentage is
calculated as the current resources requested on the node vs [total allocatable](https://kubernetes.io/docs/concepts/architecture/nodes/#capacity).
For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node.

If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized.
Currently, pods request resource requirements are considered for computing node resource utilization.
Any node above `thresholds` is considered appropriately utilized and is not considered for eviction.

The `thresholds` param could be tuned as per your cluster requirements. Note that this
strategy evicts pods from `underutilized nodes` (those with usage below `thresholds`)
so that they can be recreated in appropriately utilized nodes.
The strategy will abort if any number of `underutilized nodes` or `appropriately utilized nodes` is zero.

**Parameters:**

|Name|Type|
|---|---|
|`thresholds`|map(string:int)|
|`numberOfNodes`|int|
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"HighNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
```

Policy should pass the following validation checks:
* Three basic native types of resources are supported: `cpu`, `memory` and `pods`. If any of these resource types is not specified, all its thresholds default to 100%.
* Extended resources are supported. For example, resource type `nvidia.com/gpu` is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in `thresholds` explicitly.
* `thresholds` can not be nil.
* The valid range of the resource's percentage value is \[0, 100\]

There is another parameter associated with the `HighNodeUtilization` strategy, called `numberOfNodes`.
This parameter can be configured to activate the strategy only when the number of under utilized nodes
is above the configured value. This could be helpful in large clusters where a few nodes could go
under utilized frequently or for a short period of time. By default, `numberOfNodes` is set to zero.

### RemovePodsViolatingInterPodAntiAffinity

This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example,
Expand All @@ -253,6 +338,7 @@ node.
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`labelSelector`|(see [label filtering](#label-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand Down Expand Up @@ -291,6 +377,7 @@ podA gets evicted from nodeA.
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`labelSelector`|(see [label filtering](#label-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand Down Expand Up @@ -320,6 +407,7 @@ and will be evicted.
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`labelSelector`|(see [label filtering](#label-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand All @@ -340,6 +428,8 @@ This strategy requires k8s version 1.18 at a minimum.
By default, this strategy only deals with hard constraints, setting parameter `includeSoftConstraints` to `true` will
include soft constraints.

Strategy parameter `labelSelector` is not utilized when balancing topology domains and is only applied during eviction to determine if the pod can be evicted.

**Parameters:**

|Name|Type|
Expand All @@ -348,6 +438,8 @@ include soft constraints.
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`labelSelector`|(see [label filtering](#label-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand Down Expand Up @@ -379,6 +471,7 @@ which determines whether init container restarts should be factored into that ca
|`thresholdPriority`|int (see [priority filtering](#priority-filtering))|
|`thresholdPriorityClassName`|string (see [priority filtering](#priority-filtering))|
|`namespaces`|(see [namespace filtering](#namespace-filtering))|
|`nodeFit`|bool (see [node fit filtering](#node-fit-filtering))|

**Example:**

Expand Down Expand Up @@ -529,6 +622,7 @@ to filter pods by their labels:
* `RemovePodsViolatingNodeTaints`
* `RemovePodsViolatingNodeAffinity`
* `RemovePodsViolatingInterPodAntiAffinity`
* `RemovePodsViolatingTopologySpreadConstraint`

This allows running strategies among pods the descheduler is interested in.

Expand All @@ -551,6 +645,53 @@ strategies:
- {key: environment, operator: NotIn, values: [dev]}
```


### Node Fit filtering

The following strategies accept a `nodeFit` boolean parameter which can optimize descheduling:
* `RemoveDuplicates`
* `LowNodeUtilization`
* `HighNodeUtilization`
* `RemovePodsViolatingInterPodAntiAffinity`
* `RemovePodsViolatingNodeAffinity`
* `RemovePodsViolatingNodeTaints`
* `RemovePodsViolatingTopologySpreadConstraint`
* `RemovePodsHavingTooManyRestarts`

If set to `true` the descheduler will consider whether or not the pods that meet eviction criteria will fit on other nodes before evicting them. If a pod cannot be rescheduled to another node, it will not be evicted. Currently the following criteria are considered when setting `nodeFit` to `true`:
- A `nodeSelector` on the pod
- Any `Tolerations` on the pod and any `Taints` on the other nodes
- `nodeAffinity` on the pod
- Whether any of the other nodes are marked as `unschedulable`

E.g.

```yaml
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"LowNodeUtilization":
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
"cpu" : 20
"memory": 20
"pods": 20
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50
nodeFit: true
```

Note that node fit filtering references the current pod spec, and not that of it's owner.
Thus, if the pod is owned by a ReplicationController (and that ReplicationController was modified recently),
the pod may be running with an outdated spec, which the descheduler will reference when determining node fit.
This is expected behavior as the descheduler is a "best-effort" mechanism.

Using Deployments instead of ReplicationControllers provides an automated rollout of pod spec changes, therefore ensuring that the descheduler has an up-to-date view of the cluster state.

## Pod Evictions

When the descheduler decides to evict pods from a node, it employs the following general mechanism:
Expand Down
4 changes: 2 additions & 2 deletions charts/descheduler/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: v1
name: descheduler
version: 0.20.0
appVersion: 0.20.0
version: 0.21.0
appVersion: 0.21.0
description: Descheduler for Kubernetes is used to rebalance clusters by evicting pods that can potentially be scheduled on better nodes. In the current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.
keywords:
- kubernetes
Expand Down
1 change: 1 addition & 0 deletions charts/descheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ The following table lists the configurable parameters of the _descheduler_ chart
| `image.pullPolicy` | Docker image pull policy | `IfNotPresent` |
| `nameOverride` | String to partially override `descheduler.fullname` template (will prepend the release name) | `""` |
| `fullnameOverride` | String to fully override `descheduler.fullname` template | `""` |
| `cronJobApiVersion` | CronJob API Group Version | `"batch/v1"` |
| `schedule` | The cron schedule to run the _descheduler_ job on | `"*/2 * * * *"` |
| `startingDeadlineSeconds` | If set, configure `startingDeadlineSeconds` for the _descheduler_ job | `nil` |
| `successfulJobsHistoryLimit` | If set, configure `successfulJobsHistoryLimit` for the _descheduler_ job | `nil` |
Expand Down
2 changes: 1 addition & 1 deletion charts/descheduler/templates/cronjob.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
apiVersion: batch/v1beta1
apiVersion: {{ .Values.cronJobApiVersion | default "batch/v1" }}
kind: CronJob
metadata:
name: {{ template "descheduler.fullname" . }}
Expand Down
29 changes: 29 additions & 0 deletions charts/descheduler/templates/tests/test-descheduler-pod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: v1
kind: Pod
metadata:
name: descheduler-test-pod
annotations:
"helm.sh/hook": test
spec:
restartPolicy: Never
serviceAccountName: descheduler-ci
containers:
- name: descheduler-test-container
image: alpine:latest
imagePullPolicy: IfNotPresent
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- All
privileged: false
runAsNonRoot: false
command: ["/bin/ash"]
args:
- -c
- >-
apk --no-cache add curl &&
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl &&
chmod +x ./kubectl &&
mv ./kubectl /usr/local/bin/kubectl &&
/usr/local/bin/kubectl get pods --namespace kube-system --token "$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" | grep "descheduler" | grep "Completed"
1 change: 1 addition & 0 deletions charts/descheduler/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ resources:
nameOverride: ""
fullnameOverride: ""

cronJobApiVersion: "batch/v1" # Use "batch/v1beta1" for k8s version < 1.21.0. TODO(@7i) remove with 1.23 release
schedule: "*/2 * * * *"
#startingDeadlineSeconds: 200
#successfulJobsHistoryLimit: 1
Expand Down
13 changes: 13 additions & 0 deletions docs/contributor-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,18 @@ make test-unit
make test-e2e
```

## Run Helm Tests
Run the helm test for a particular descheduler release by setting below variables,
```
HELM_IMAGE_REPO="descheduler"
HELM_IMAGE_TAG="helm-test"
HELM_CHART_LOCATION="./charts/descheduler"
```
The helm tests runs as part of descheduler CI. But, to run it manually from the descheduler root,

```
make test-helm
```

### Miscellaneous
See the [hack directory](https://github.com/kubernetes-sigs/descheduler/tree/master/hack) for additional tools and scripts used for developing the descheduler.
Loading