Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Eviction Strategy to take Priority into account #946

Merged
merged 2 commits into from
Aug 26, 2017
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 23 additions & 14 deletions contributors/design-proposals/kubelet-eviction.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,6 +241,20 @@ the `kubelet` will select a subsequent pod.

## Eviction Strategy

The `kubelet` will implement an eviction strategy oriented around
[Priority](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-priority-api.md)
and pod usage relative to requests. It will target pods that are the lowest
Priority, and are the largest consumers of the starved resource relative to
their scheduling request.

It will target pods whose usage of the starved resource exceeds its requests.
Of those pods, it will rank by a function of priority, and usage - requests.
If system daemons are exceeding their allocation (see [Strategy Caveat](strategy-caveat) below),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than system daemons, there are critical system pods whose eviction may break the functionality of the node or even the whole cluster. In the future (next couple of months) we will set a priority class for critical system pods. There are two priority classes for system pods: system-cluster-critical and system-node-critical. The former should be set for critical system pods that should be present in every cluster, but they don't need to run on every node. The latter should be set for critical system pods that should be present on every node, e.g., kube-proxy. Pods with system-node-critical priority class should be treated like system daemons and should not get evicted as much as possible.

and all pods are using less than their requests, then it will evict a pod
whose usage is less than requests, based on the function of priority, and
usage - requests.

Prior to v1.8:
The `kubelet` will implement a default eviction strategy oriented around
the pod quality of service class.

Expand All @@ -258,14 +272,16 @@ starved resource.
relative to their request are killed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.

A guaranteed pod is guaranteed to never be evicted because of another pod's
resource consumption. That said, guarantees are only as good as the underlying
foundation they are built upon. If a system daemon
### Strategy Caveat

A pod consuming less resources than its requests is guaranteed to never be
evicted because of another pod's resource consumption. That said, guarantees
are only as good as the underlying foundation they are built upon. If a system daemon
(i.e. `kubelet`, `docker`, `journald`, etc.) is consuming more resources than
were reserved via `system-reserved` or `kube-reserved` allocations, and the node
only has guaranteed pod(s) remaining, then the node must choose to evict a
guaranteed pod in order to preserve node stability, and to limit the impact
of the unexpected consumption to other guaranteed pod(s).
were reserved via `system-reserved` or `kube-reserved` allocations, then the node
must choose to evict a pod, even if it is consuming less than its requests.
It must take action in order to preserve node stability, and to limit the impact
of the unexpected consumption to other well-behaved pod(s).

## Disk based evictions

Expand Down Expand Up @@ -458,13 +474,6 @@ for eviction. Instead `DaemonSet` should ideally include Guaranteed pods only.
The pod eviction may evict more pods than needed due to stats collection timing gap. This can be mitigated by adding
the ability to get root container stats on an on-demand basis (https://github.com/google/cadvisor/issues/1247) in the future.

### How kubelet ranks pods for eviction in response to inode exhaustion

At this time, it is not possible to know how many inodes were consumed by a particular container. If the `kubelet` observes
inode exhaustion, it will evict pods by ranking them by quality of service. The following issue has been opened in cadvisor
to track per container inode consumption (https://github.com/google/cadvisor/issues/1422) which would allow us to rank pods
by inode consumption. For example, this would let us identify a container that created large numbers of 0 byte files, and evict
that pod over others.

<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/proposals/kubelet-eviction.md?pixel)]()
Expand Down