-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design proposal for adding priority to Kubernetes API #604
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,243 @@ | ||
# Priority in Kubernetes API | ||
|
||
@bsalamat | ||
|
||
May 2017 | ||
* [Objective](#objective) | ||
* [Non-Goals](#non-goals) | ||
* [Background](#background) | ||
* [Overview](#overview) | ||
* [Detailed Design](#detailed-design) | ||
* [Effect of priority on scheduling](#effect-of-priority-on-scheduling) | ||
* [Effect of priority on preemption](#effect-of-priority-on-preemption) | ||
* [Priority in PodSpec](#priority-in-podspec) | ||
* [Priority Classes](#priority-classes) | ||
* [Resolving priority class names](#resolving-priority-class-names) | ||
* [Ordering of priorities](#ordering-of-priorities) | ||
* [System Priority Class Names](#system-priority-class-names) | ||
* [Modifying Priority Classes](#modifying-priority-classes) | ||
* [Drawbacks of changing priority names](#drawbacks-of-changing-priority-classes) | ||
* [Priority and QoS classes](#priority-and-qos-classes) | ||
|
||
|
||
## Objective | ||
|
||
|
||
|
||
* How to specify priority for workloads in Kubernetes API. | ||
* Define how the order of these priorities are specified. | ||
* Define how new priority levels are added. | ||
* Effect of priority on scheduling and preemption. | ||
|
||
### Non-Goals | ||
|
||
|
||
|
||
* How preemption works in Kubernetes. | ||
* How quota allocation and accounting works for each priority. | ||
|
||
## Background | ||
|
||
It is fairly common in clusters to have more tasks than what the cluster | ||
resources can handle. Often times the workload is a mix of high priority | ||
critical tasks, and non-urgent tasks that can wait. Cluster management should be | ||
able to distinguish these workloads in order to decide which ones should acquire | ||
the resources sooner and which ones can wait. Priority of the workload is one of | ||
the key metrics that provides the information to the cluster. This document is a | ||
more detailed design proposal for part of the high-level architecture described | ||
in [Resource sharing architecture for batch and serving workloads in Kubernetes](https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA). | ||
|
||
## Overview | ||
|
||
This design doc introduces the concept of priorities for pods in Kubernetes and | ||
how the priority impacts scheduling and preemption of pods when the cluster | ||
runs out of resources. A pod can specify a priority at the creation time. The | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have admission to check it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, described a little farther below. |
||
priority must be one of the valid values and there is a total order on the | ||
values. The priority of a pod is independent of its workload type. The priority | ||
is global and not specific to a particular namespace. | ||
|
||
## Detailed Design | ||
|
||
### Effect of priority on scheduling | ||
|
||
One could generally expect a pod with higher priority has a higher chance of | ||
getting scheduled than the same pod with lower priority. However, there are | ||
many other parameters that affect scheduling decisions. So, a high priority pod | ||
may or may not be scheduled before lower priority pods. The details of | ||
what determines the order at which pods are scheduled are beyond the scope of | ||
this document. | ||
|
||
### Effect of priority on preemption | ||
|
||
Generally, lower priority pods are more likely to get preempted by higher | ||
priority pods when cluster has reached a threshold. In such a case, scheduler | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How and where to define the threshold? BTW: I think that here should be when a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure if threshold is relevant here. I guess your are referring to node out of resource preemption which is not the focus of this doc. Please add more context if you are referring to priority-based preemption. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bsalamat In my understanding, even with priority-based preemption, I think that this can only happen when there are not enough resources on a specified node, the problem is that with priority-based preemption, when we need to trigger the preemption? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This design does not make any assumption about what percentage of a node resources are usable. So, it should work identically whether 100% or a much smaller percentage of a node resources are usable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, so to be more accurate, how about update here a bit by s/when cluster has reached a threshold/when a specified node has reached a threshold ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, I think the current wording is more accurate as the available resources of the cluster are much more significant when making scheduling and preemption decisions than an individual node resources. That said, this document is not meant to be a reference for how preemption will work. I am going to write a separate document on preemption. So, hopefully that document will clarify the details of preemption logic better. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sgtm, thanks @bsalamat |
||
may decide to preempt lower priority pods to release enough resources for higher | ||
priority pending pods. As mentioned before, there are many other parameters | ||
that affect scheduling decisions, such as affinity and anti-affinity. If | ||
scheduler determines that a high priority pod cannot be scheduled even if lower | ||
priority pods are preempted, it will not preempt lower priority pods. Scheduler | ||
may have other restrictions on preempting pods, for example, it may refuse to | ||
preempt a pod if PodDisruptionBudget is violated. The details of scheduling and | ||
preemption decisions are beyond the scope of this document. | ||
|
||
### Priority in PodSpec | ||
|
||
Pods may have priority in their pod spec. PodSpec will have two new fields | ||
called "PriorityClassName" which is specified by user, and "Priority" which will | ||
be populated by Kubernetes. User-specified priority (PriorityClassName) is a | ||
string and all of the valid priority classes are defined by a system wide | ||
mapping that maps each string to an integer. The PriorityClassName specified in | ||
a pod spec must be found in this map or the pod creation request will be | ||
rejected. If PriorityClassName is empty, it will resolve to the default | ||
priority (See below for more info on name resolution). Once the | ||
PriorityClassName is resolved to an integer, it is placed in "Priority" field of | ||
PodSpec. | ||
|
||
|
||
``` | ||
type PodSpec struct { | ||
... | ||
PriorityClassName string | ||
Priority *int32 // Populated by Admission Controller. Users are not allowed to set it directly. | ||
} | ||
``` | ||
|
||
### Priority Classes | ||
|
||
The cluster may have many user defined priority classes for | ||
various use cases. The following list is an example of how the priorities and | ||
their values may look like. | ||
Kubernetes will also have special priority class names reserved for critical system | ||
pods. Please see [System Priority Class Names](#system-priority-class-names) for | ||
more information. Any priority value above 1 billion is reserved for system use. | ||
Aside from those system priority classes, Kubernetes is not shipped with predefined | ||
priority classes usable by user pods. The main goal of having no built-in | ||
priority classes for user pods is to avoid creating defacto standard names which | ||
may be hard to change in the future. | ||
|
||
``` | ||
system 2147483647 (int_max) | ||
tier1 4000 | ||
tier2 2000 | ||
tier3 1000 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the detailed list below, why define just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both the above and below lists are examples. That said, in the list below some items can have equal priority. Moreover, not all clusters have all types of workloads given below and some clusters may have many more types of workloads with different priorities. So, I think we should have a small number of default priorities, like the two that you mentioned, and let customers add more when they need more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @vishh also the doc specifies that it is configurable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we will ship with only |
||
``` | ||
|
||
The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another factor is amount of disruption budget. Do we expect users to create more priority levels to differentiate between highly replicated (lower priority) and SPOF (higher priority) pods, or do we expect to take that information into account in a more first-class way? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The expectation is that clusters will have enough priority levels to differentiate between replicated high priority pods and SPOF high priority pods. @erictune suggested that we should ship Kubernetes with a few predefined (but changeable) priority classes for these scenarios. The predefined classes give users an example for creating more priority classes if needed. They also make the cluster usable in most of the well known scenarios without needing much more configuration. |
||
|
||
* Kubernetes system daemons (per-node like fluentd, and cluster-level like | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on above, seems that a big priority value will have a high priority as you are defining
So shall we follow the similar way of If this make sense, then how about define the default value of priority still as 0, which means the default priority would be a middle level priority pod. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why should we follow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Some system administrator may already using Also I searched quite a lot and found that many designing for priority are using lower value as high priority, such as http://research.cs.wisc.edu/htcondor/manual/v8.6/2_7Priorities_Preemption.html etc
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is from the link that you sent:
Based on this design proposal, defining priority classes is a one-time operation and after that users (and admins) will work mostly with the priority class names and not with the actual integer value of the priority. So, hopefully they shouldn't worry much about the actual integer value. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are also some discussion about this: https://softwareengineering.stackexchange.com/questions/77365/priority-value-meaning , perhaps you can take a look at this as a reference ;-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was also thinking how to define the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on recent versions of the doc, one of the priority classes can be marked as "default" and that priority class will be used when a pod spec does not have any priority class name. If there is no default, an empty priority class name is resolved to zero, which is not the lowest priority. Zero is in the middle of the range as we allow negative values. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @bsalamat , we are close now ;-)
Also we will keep the logic that large value have high priority? or using smaller value as high priority? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It is not mentioned explicitly, but priority is an int, not an unsigned int. I have not placed any restriction on the value in my PR to add the field.
At the moment, I believe higher value should be higher priority, unless more people want to change it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good to know, thanks @bsalamat |
||
Heapster) | ||
* Critical user infrastructure (e.g. storage servers, monitoring system like | ||
Prometheus, etc.) | ||
* Components that are in the user-facing request serving path and must be able | ||
to scale up arbitrarily in response to load spikes (web servers, middleware, | ||
etc.) | ||
* Important interruptible workloads that need strong guarantee of | ||
schedulability and of not being interrupted | ||
* Less important interruptible workloads that need a less strong guarantee of | ||
schedulability and of not being interrupted | ||
* Best effort / opportunistic | ||
|
||
### Resolving priority class names | ||
|
||
User requests sent to Kubernetes may have `PriorityClassName` in their PodSpec. | ||
Admission controller resolves a PriorityClassName to its corresponding number | ||
and populates the "Priority" field of the pod spec. The rest of Kubernetes | ||
components look at the "Priority" field of pod status and work with the integer | ||
value. In other words, `PriorityClassName` will be ignored by the rest of the | ||
system. | ||
|
||
We are going to add a new API object called PriorityClass. The priority class | ||
defines the mapping between the priority name and its value. It can have an | ||
optional description. It is an arbitrary string and is provided | ||
only as a guideline for users. | ||
|
||
A priority class can be marked as "Global Default" by setting its | ||
`GlobalDefault` field to true. If a pod does not specify any `PriorityClassName`, | ||
the system resolves it to the value of the global default priority class if | ||
exists. If there is no global default, the pod's priority will be resolved to | ||
zero. Priority admission controller ensures that there is only one global | ||
default priority class. | ||
|
||
``` | ||
type PriorityClass struct { | ||
metav1.TypeMeta | ||
// +optional | ||
metav1.ObjectMeta | ||
|
||
// The value of this priority class. This is the actual priority that pods | ||
// recieve when they have the above name in their pod spec. | ||
Value int32 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just to clarify, its possible that both "foo" and "bar" map to priority value "1", correct? i assume we will let users update Value after the fact? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Multiple priority classes can have the same value, but we do not allow changing the value. We may allow a user to delete a priority class if s/he has permission and no pods in the system has that priority class name. |
||
GlobalDefault bool | ||
Description string | ||
} | ||
``` | ||
|
||
### Ordering of priorities | ||
|
||
As mentioned earlier, a PriorityClassName is resolved by the admission controller to | ||
its integral value and Kubernetes components use the integral value. The higher | ||
the value, the higher the priority. | ||
|
||
### System Priority Class Names | ||
There will be special priority class names reserved for system use only. These | ||
classes have a value larger than one billion. | ||
Priority admission controller ensures that new priority classes will be not | ||
created with those names. They are used for critical system pods that must not | ||
be preempted. We set default policies that deny creation of pods with | ||
PriorityClassNames corresponding to these priorities. Cluster admins can | ||
authorize users or service accounts to create pods with these priorities. When | ||
non-authorized users set PriorityClassName to one of these priority classes in | ||
their pod spec, their pod creation request will be rejected. For pods created by | ||
controllers, the service account must be authorized by cluster admins. | ||
|
||
### Modifying priority classes | ||
|
||
Priority classes can be added or removed, but their name and value cannot be | ||
updated. We allow updating `GlobalDefault` and `Description` as long as there is | ||
a maximum of one global default. While | ||
Kubernetes can work fine if priority classes are changed at run-time, the change | ||
can be confusing to users as pods with a priority class which were created | ||
before the change will have a different priority value than those created after | ||
the change. Deletion of priority classes is allowed, despite the fact that there | ||
may be existing pods that have specified such priority class names in their pod | ||
spec. In other words, there will be no referential integrity for priority | ||
classes. This is another reason that all system components should only work with | ||
the integer value of the priority and not with the `PriorityClassName`. | ||
|
||
One could delete an existing priority class and create another one with the same | ||
name and a different value. By doing so, they can achieve the same effect as | ||
updating a priority class, but we still do not allow updating priority classes | ||
to prevent accidental changes. | ||
|
||
Newly added priority classes cannot have a value higher than what is reserved | ||
for "system". The reason for this restriction | ||
is that Kubernetes critical system pods will have one of the "system" priorities | ||
and no pod should be able to preempt them. | ||
|
||
#### Drawbacks of changing priority classes | ||
|
||
While Kubernetes effectively allows changing priority classes (by deleting and | ||
adding them with a different value), it should be done only when | ||
absolutely needed. Changing priority classes has the following disadvantages: | ||
|
||
|
||
* May remove config portability: pod specs written for one cluster are no | ||
longer guaranteed to work on a different cluster if the same priority classes | ||
do not exist in the second cluster. | ||
* If quota is specified for existing priority classes (at the time of this writing, | ||
we don't have this feature in Kubernetes), adding or deleting priority classes | ||
will require reconfiguration of quota allocations. | ||
* An existing pods may have an integer value of priority that does not reflect | ||
the current value of its PriorityClass. | ||
|
||
### Priority and QoS classes | ||
|
||
Kubernetes has [three QoS | ||
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes) | ||
which are derived from request and limit of pods. Priority is introduced as an | ||
independent concept; meaning that any QoS class may have any valid priority. | ||
When a node is out of resources and pods needs to be preempted, we give | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i continue to believe it is an anti-pattern to allow a best effort pod to have higher priority than pods in other qos classes. from a scheduling perspective, if a node is reporting memory pressure, and no longer accepts BestEffort pods, would the scheduler really preempt pods in the other QoS classes to make room? there is no guarantee room will ever be made available, so what is the point? best effort pods may not have access to any node resources once scheduled given how qos is implemented, and will effectively just be starved anyway. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All the pod priorities will be zero, when pods don't specify any. Once they do, why should we make it harder for users to understand the meaning of priority by adding QoS to the mix? IMO, if a user wants to have their BestEffort pods to have very high priority, they should be able to do so. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The concern is that if a guaranteed pod can be evicted because of a high priority besteffort pod, the guaranteed qos doesn't mean much. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, it would be unclear how many guaranteed pods should be preempted in order to free sufficient resources for a best effort pod that does not declare a resource request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm, Guarantee, Burstable, BestEffort is also a kind of priority; the question is how we handle them together. One option is to make priority only effect within same QoS class. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do not want to limit the range of usable priorities based on QoS class. So, if a user chooses to use a higher priority for their besteffort pods, they can do so, but you are right that the besteffort class may be scheduled on a node with no resources (as it could happen today). And, yes priority will be considered by kubelet for out of resource evictions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The concern is: scheduler and kubelet will take different action against priority & QoS class; I agree to discuss this in preemption doc in future :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Guaranteed" is guaranteed resource/performance QoS while scheduled on the node, not with respect to durability. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about a mechanism to say that this priority can only be used with an allowed set of QoS class? i want a way to prevent what i find is an abuse vector (a BE pod using high priorities), as I think it will be a future support case when it causes an unforeseen consequence. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @derekwaynecarr What unforeseen consequences are you thinking of? |
||
priority a higher weight over QoS classes. In other words, we preempt the lowest | ||
priority pod and break ties with some other metrics, such as, QoS class, usage | ||
above request, etc. This is not finalized yet. We will discuss and finalize | ||
preemption in a separate doc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a use case section?
I am interested in the following use case not discussed:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A ConfigMap with a documented name provides the list of PriorityNames and their corresponding values. The intended usage of each priority (other than
system
) is cluster-specific. We can add a description for each PriorityName in the ConfigMap to provide intended usage.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with how we are building similar features in other parts of kubernetes, StorageClasses and ResourceClasses (as currently proposed) come to mind. Both of these have defined API specs that are validated at creation time and that a UI can reliably build on top of to present the user with the available options. As someone who builds a UI on top of k8s I prefer validated resources to throwing things into ConfigMaps. If this was just a name -> int mapping, maybe a config map. But its going to be a unique name, possibly a display name, a longer description, and a priority value. So already you have every config map key mapping to a JSON blob.
If you put this kind of information into a ConfigMap you are putting a lot of extra burden on the cluster administrator to make sure their data blobs in the config map are valid. You will also have to make sure that particular named ConfigMap, in whatever namespace you add it to, has View permissions for every authenticated user in the cluster. Whereas with something like a PriorityClass object it can be properly defined as a cluster scoped resource and you can have the View role generally applied to all PriorityClasses for all users by default. That also gains you the option in the future to make certain PriorityClasses invisible to certain users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with using a new object called PriorityClass to define mapping from PriorityNames to their integer values. I agree that it will make things easier.
If everyone is ok with this idea, we can create the PriorityClass object in
policy
API group as an alpha object.