Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design proposal for adding priority to Kubernetes API #604

Merged
merged 2 commits into from
Jul 13, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
243 changes: 243 additions & 0 deletions contributors/design-proposals/pod-priority-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# Priority in Kubernetes API

@bsalamat

May 2017
* [Objective](#objective)
* [Non-Goals](#non-goals)
* [Background](#background)
* [Overview](#overview)
* [Detailed Design](#detailed-design)
* [Effect of priority on scheduling](#effect-of-priority-on-scheduling)
* [Effect of priority on preemption](#effect-of-priority-on-preemption)
* [Priority in PodSpec](#priority-in-podspec)
* [Priority Classes](#priority-classes)
* [Resolving priority class names](#resolving-priority-class-names)
* [Ordering of priorities](#ordering-of-priorities)
* [System Priority Class Names](#system-priority-class-names)
* [Modifying Priority Classes](#modifying-priority-classes)
* [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes)
* [Priority and QoS classes](#priority-and-qos-classes)


## Objective



* How to specify priority for workloads in Kubernetes API.
* Define how the order of these priorities are specified.
* Define how new priority levels are added.
* Effect of priority on scheduling and preemption.

### Non-Goals



* How preemption works in Kubernetes.
* How quota allocation and accounting works for each priority.

## Background

It is fairly common in clusters to have more tasks than what the cluster
resources can handle. Often times the workload is a mix of high priority
critical tasks, and non-urgent tasks that can wait. Cluster management should be
able to distinguish these workloads in order to decide which ones should acquire
the resources sooner and which ones can wait. Priority of the workload is one of
the key metrics that provides the information to the cluster. This document is a
more detailed design proposal for part of the high-level architecture described
in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA).

## Overview
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a use case section?

I am interested in the following use case not discussed:

  1. as a user of the cluster, i would like to discover the set of priorities supported by my cluster with descriptions about their intended usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ConfigMap with a documented name provides the list of PriorityNames and their corresponding values. The intended usage of each priority (other than system) is cluster-specific. We can add a description for each PriorityName in the ConfigMap to provide intended usage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with how we are building similar features in other parts of kubernetes, StorageClasses and ResourceClasses (as currently proposed) come to mind. Both of these have defined API specs that are validated at creation time and that a UI can reliably build on top of to present the user with the available options. As someone who builds a UI on top of k8s I prefer validated resources to throwing things into ConfigMaps. If this was just a name -> int mapping, maybe a config map. But its going to be a unique name, possibly a display name, a longer description, and a priority value. So already you have every config map key mapping to a JSON blob.

If you put this kind of information into a ConfigMap you are putting a lot of extra burden on the cluster administrator to make sure their data blobs in the config map are valid. You will also have to make sure that particular named ConfigMap, in whatever namespace you add it to, has View permissions for every authenticated user in the cluster. Whereas with something like a PriorityClass object it can be properly defined as a cluster scoped resource and you can have the View role generally applied to all PriorityClasses for all users by default. That also gains you the option in the future to make certain PriorityClasses invisible to certain users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with using a new object called PriorityClass to define mapping from PriorityNames to their integer values. I agree that it will make things easier.
If everyone is ok with this idea, we can create the PriorityClass object in policy API group as an alpha object.


This design doc introduces the concept of priorities for pods in Kubernetes and
how the priority impacts scheduling and preemption of pods when the cluster
runs out of resources. A pod can specify a priority at the creation time. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have admission to check it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, described a little farther below.

priority must be one of the valid values and there is a total order on the
values. The priority of a pod is independent of its workload type. The priority
is global and not specific to a particular namespace.

## Detailed Design

### Effect of priority on scheduling

One could generally expect a pod with higher priority has a higher chance of
getting scheduled than the same pod with lower priority. However, there are
many other parameters that affect scheduling decisions. So, a high priority pod
may or may not be scheduled before lower priority pods. The details of
what determines the order at which pods are scheduled are beyond the scope of
this document.

### Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher
priority pods when cluster has reached a threshold. In such a case, scheduler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How and where to define the threshold? BTW: I think that here should be when a node reach a threshold but not cluster?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if threshold is relevant here. I guess your are referring to node out of resource preemption which is not the focus of this doc. Please add more context if you are referring to priority-based preemption.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat In my understanding, even with priority-based preemption, I think that this can only happen when there are not enough resources on a specified node, the problem is that with priority-based preemption, when we need to trigger the preemption?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that with priority-based preemption, when we need to trigger the preemption?

This design does not make any assumption about what percentage of a node resources are usable. So, it should work identically whether 100% or a much smaller percentage of a node resources are usable.

Copy link
Contributor

@gyliu513 gyliu513 May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so to be more accurate, how about update here a bit by s/when cluster has reached a threshold/when a specified node has reached a threshold ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the current wording is more accurate as the available resources of the cluster are much more significant when making scheduling and preemption decisions than an individual node resources. That said, this document is not meant to be a reference for how preemption will work. I am going to write a separate document on preemption. So, hopefully that document will clarify the details of preemption logic better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm, thanks @bsalamat

may decide to preempt lower priority pods to release enough resources for higher
priority pending pods. As mentioned before, there are many other parameters
that affect scheduling decisions, such as affinity and anti-affinity. If
scheduler determines that a high priority pod cannot be scheduled even if lower
priority pods are preempted, it will not preempt lower priority pods. Scheduler
may have other restrictions on preempting pods, for example, it may refuse to
preempt a pod if PodDisruptionBudget is violated. The details of scheduling and
preemption decisions are beyond the scope of this document.

### Priority in PodSpec

Pods may have priority in their pod spec. PodSpec will have two new fields
called "PriorityClassName" which is specified by user, and "Priority" which will
be populated by Kubernetes. User-specified priority (PriorityClassName) is a
string and all of the valid priority classes are defined by a system wide
mapping that maps each string to an integer. The PriorityClassName specified in
a pod spec must be found in this map or the pod creation request will be
rejected. If PriorityClassName is empty, it will resolve to the default
priority (See below for more info on name resolution). Once the
PriorityClassName is resolved to an integer, it is placed in "Priority" field of
PodSpec.


```
type PodSpec struct {
...
PriorityClassName string
Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly.
}
```

### Priority Classes

The cluster may have many user defined priority classes for
various use cases. The following list is an example of how the priorities and
their values may look like.
Kubernetes will also have special priority class names reserved for critical system
pods. Please see [System Priority Class Names](#system-priority-class-names) for
more information. Any priority value above 1 billion is reserved for system use.
Aside from those system priority classes, Kubernetes is not shipped with predefined
priority classes usable by user pods. The main goal of having no built-in
priority classes for user pods is to avoid creating defacto standard names which
may be hard to change in the future.

```
system 2147483647 (int_max)
tier1 4000
tier2 2000
tier3 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the detailed list below, why define just 4 default priorities?
If defining the default set is an issue why not define just two - default and system? Any pod without a priority field maps to the default priority.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the above and below lists are examples. That said, in the list below some items can have equal priority. Moreover, not all clusters have all types of workloads given below and some clusters may have many more types of workloads with different priorities. So, I think we should have a small number of default priorities, like the two that you mentioned, and let customers add more when they need more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh also the doc specifies that it is configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default I meant an empty PriorityName field. There is a clear need for defining priority for system pods. Can we avoid defining standard priorities for all other use cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will ship with only system as a built-in value and the rest will be user configurable. If PriorityName is empty, I'd propose resolving the priority to zero.

```

The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another factor is amount of disruption budget. Do we expect users to create more priority levels to differentiate between highly replicated (lower priority) and SPOF (higher priority) pods, or do we expect to take that information into account in a more first-class way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is that clusters will have enough priority levels to differentiate between replicated high priority pods and SPOF high priority pods. @erictune suggested that we should ship Kubernetes with a few predefined (but changeable) priority classes for these scenarios. The predefined classes give users an example for creating more priority classes if needed. They also make the cluster usable in most of the well known scenarios without needing much more configuration.


* Kubernetes system daemons (per-node like fluentd, and cluster-level like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on above, seems that a big priority value will have a high priority as you are defining system as 2147483647, but I can see in linux, it usually using some smaller value to identify the priority, such as the nice value.

root@k8s001:~# nice --help
Usage: nice [OPTION] [COMMAND [ARG]...]
Run COMMAND with an adjusted niceness, which affects process scheduling.
With no COMMAND, print the current niceness.  Niceness values range from
-20 (most favorable to the process) to 19 (least favorable to the process).

So shall we follow the similar way of nice, setting value from -xxx to xxx as a range? and -xxx have the most priority while the xxx is the least priority?

If this make sense, then how about define the default value of priority still as 0, which means the default priority would be a middle level priority pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we follow nice? I think it is more intuitive when a higher value means higher priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some system administrator may already using nice for a long time so it is better that the Kubernetes design can also follow some exciting policy.

Also I searched quite a lot and found that many designing for priority are using lower value as high priority, such as http://research.cs.wisc.edu/htcondor/manual/v8.6/2_7Priorities_Preemption.html etc

Machines are allocated to users based upon a user's priority. **A lower numerical value for user priority means higher priority**, so a user with priority 5 will get more resources than a user with priority 50. User priorities in HTCondor can be examined with the condor_userprio command (see page [*]). HTCondor administrators can set and change individual user priorities with the same utility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the link that you sent:

A job priority can be any integer, and larger values are of higher priority. So, 0 is a higher job priority than -3, and 6 is a higher job priority than 5.

Based on this design proposal, defining priority classes is a one-time operation and after that users (and admins) will work mostly with the priority class names and not with the actual integer value of the priority. So, hopefully they shouldn't worry much about the actual integer value.
Besides, if we wanted to use lower values as higher priority, we would have had to choose a different name for the field instead of priority, maybe something like nice. When "niceness" is lower, it makes sense for the process to not yield to other processes, but it does not make sense when a lower value for a field called priority means higher priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some discussion about this: https://softwareengineering.stackexchange.com/questions/77365/priority-value-meaning , perhaps you can take a look at this as a reference ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking how to define the default priority if the PriorityName is empty. In this design, it will be zero which means lowest priority, just thinking why want to have such pods as lowest priority? How about defining them with a middle level priority?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on recent versions of the doc, one of the priority classes can be marked as "default" and that priority class will be used when a pod spec does not have any priority class name. If there is no default, an empty priority class name is resolved to zero, which is not the lowest priority. Zero is in the middle of the range as we allow negative values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bsalamat , we are close now ;-)

Zero is in the middle of the range as we allow negative values, for this, I did not see this was in the document, am I missing anything?

Also we will keep the logic that large value have high priority? or using smaller value as high priority?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero is in the middle of the range as we allow negative values, for this, I did not see this was in the document, am I missing anything?

It is not mentioned explicitly, but priority is an int, not an unsigned int. I have not placed any restriction on the value in my PR to add the field.

Also we will keep the logic that large value have high priority? or using smaller value as high priority?

At the moment, I believe higher value should be higher priority, unless more people want to change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, thanks @bsalamat

Heapster)
* Critical user infrastructure (e.g. storage servers, monitoring system like
Prometheus, etc.)
* Components that are in the user-facing request serving path and must be able
to scale up arbitrarily in response to load spikes (web servers, middleware,
etc.)
* Important interruptible workloads that need strong guarantee of
schedulability and of not being interrupted
* Less important interruptible workloads that need a less strong guarantee of
schedulability and of not being interrupted
* Best effort / opportunistic

### Resolving priority class names

User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec.
Admission controller resolves a PriorityClassName to its corresponding number
and populates the "Priority" field of the pod spec. The rest of Kubernetes
components look at the "Priority" field of pod status and work with the integer
value. In other words, `PriorityClassName` will be ignored by the rest of the
system.

We are going to add a new API object called PriorityClass. The priority class
defines the mapping between the priority name and its value. It can have an
optional description. It is an arbitrary string and is provided
only as a guideline for users.

A priority class can be marked as "Global Default" by setting its
`GlobalDefault` field to true. If a pod does not specify any `PriorityClassName`,
the system resolves it to the value of the global default priority class if
exists. If there is no global default, the pod's priority will be resolved to
zero. Priority admission controller ensures that there is only one global
default priority class.

```
type PriorityClass struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta

// The value of this priority class. This is the actual priority that pods
// recieve when they have the above name in their pod spec.
Value int32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to clarify, its possible that both "foo" and "bar" map to priority value "1", correct?

i assume we will let users update Value after the fact?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple priority classes can have the same value, but we do not allow changing the value. We may allow a user to delete a priority class if s/he has permission and no pods in the system has that priority class name.

GlobalDefault bool
Description string
}
```

### Ordering of priorities

As mentioned earlier, a PriorityClassName is resolved by the admission controller to
its integral value and Kubernetes components use the integral value. The higher
the value, the higher the priority.

### System Priority Class Names
There will be special priority class names reserved for system use only. These
classes have a value larger than one billion.
Priority admission controller ensures that new priority classes will be not
created with those names. They are used for critical system pods that must not
be preempted. We set default policies that deny creation of pods with
PriorityClassNames corresponding to these priorities. Cluster admins can
authorize users or service accounts to create pods with these priorities. When
non-authorized users set PriorityClassName to one of these priority classes in
their pod spec, their pod creation request will be rejected. For pods created by
controllers, the service account must be authorized by cluster admins.

### Modifying priority classes

Priority classes can be added or removed, but their name and value cannot be
updated. We allow updating `GlobalDefault` and `Description` as long as there is
a maximum of one global default. While
Kubernetes can work fine if priority classes are changed at run-time, the change
can be confusing to users as pods with a priority class which were created
before the change will have a different priority value than those created after
the change. Deletion of priority classes is allowed, despite the fact that there
may be existing pods that have specified such priority class names in their pod
spec. In other words, there will be no referential integrity for priority
classes. This is another reason that all system components should only work with
the integer value of the priority and not with the `PriorityClassName`.

One could delete an existing priority class and create another one with the same
name and a different value. By doing so, they can achieve the same effect as
updating a priority class, but we still do not allow updating priority classes
to prevent accidental changes.

Newly added priority classes cannot have a value higher than what is reserved
for "system". The reason for this restriction
is that Kubernetes critical system pods will have one of the "system" priorities
and no pod should be able to preempt them.

#### Drawbacks of changing priority classes

While Kubernetes effectively allows changing priority classes (by deleting and
adding them with a different value), it should be done only when
absolutely needed. Changing priority classes has the following disadvantages:


* May remove config portability: pod specs written for one cluster are no
longer guaranteed to work on a different cluster if the same priority classes
do not exist in the second cluster.
* If quota is specified for existing priority classes (at the time of this writing,
we don't have this feature in Kubernetes), adding or deleting priority classes
will require reconfiguration of quota allocations.
* An existing pods may have an integer value of priority that does not reflect
the current value of its PriorityClass.

### Priority and QoS classes

Kubernetes has [three QoS
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes)
which are derived from request and limit of pods. Priority is introduced as an
independent concept; meaning that any QoS class may have any valid priority.
When a node is out of resources and pods needs to be preempted, we give
Copy link
Member

@derekwaynecarr derekwaynecarr May 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i continue to believe it is an anti-pattern to allow a best effort pod to have higher priority than pods in other qos classes.

from a scheduling perspective, if a node is reporting memory pressure, and no longer accepts BestEffort pods, would the scheduler really preempt pods in the other QoS classes to make room? there is no guarantee room will ever be made available, so what is the point? best effort pods may not have access to any node resources once scheduled given how qos is implemented, and will effectively just be starved anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the pod priorities will be zero, when pods don't specify any. Once they do, why should we make it harder for users to understand the meaning of priority by adding QoS to the mix? IMO, if a user wants to have their BestEffort pods to have very high priority, they should be able to do so.
The concern that BestEffort pods may be scheduled and starved exists even today. A best effort may land on a machine with no free resources and may be starved. This is acceptable behavior of the system when dealing with BestEffort pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is that if a guaranteed pod can be evicted because of a high priority besteffort pod, the guaranteed qos doesn't mean much.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be unclear how many guaranteed pods should be preempted in order to free sufficient resources for a best effort pod that does not declare a resource request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, Guarantee, Burstable, BestEffort is also a kind of priority; the question is how we handle them together. One option is to make priority only effect within same QoS class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not want to limit the range of usable priorities based on QoS class. So, if a user chooses to use a higher priority for their besteffort pods, they can do so, but you are right that the besteffort class may be scheduled on a node with no resources (as it could happen today). And, yes priority will be considered by kubelet for out of resource evictions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is: scheduler and kubelet will take different action against priority & QoS class; I agree to discuss this in preemption doc in future :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Guaranteed" is guaranteed resource/performance QoS while scheduled on the node, not with respect to durability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a mechanism to say that this priority can only be used with an allowed set of QoS class? i want a way to prevent what i find is an abuse vector (a BE pod using high priorities), as I think it will be a future support case when it causes an unforeseen consequence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr What unforeseen consequences are you thinking of?
If we have a mechanism to control the set of priorities a namespace can use and if we can restrict the QoS classes a namespace can use (using LimitRanges) why would we need an additional priority -> QoS link?

priority a higher weight over QoS classes. In other words, we preempt the lowest
priority pod and break ties with some other metrics, such as, QoS class, usage
above request, etc. This is not finalized yet. We will discuss and finalize
preemption in a separate doc.