-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design proposal for adding priority to Kubernetes API #604
Conversation
A Kubernetes cluster has four predefined (built-in) priority names. The top | ||
priority is reserved for the system pods and its value is the largest 32-bit | ||
integer. The following list is an example of how the priorities and their | ||
values may look like. I am open to suggestions for the names and values of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to give an example of what kinds of workloads might map to each priority level. For example in decreasing order of priority
- Kubernetes system daemons (per-node like fluentd, and cluster-level like Heapster)
- critical user infrastructure (e.g. storage servers, monitoring system like Prometheus, etc.)
- components that are in the user-facing request serving path and must be able to scale up arbitrarily in response to load spikes (web servers, middleware, etc.)
- important interruptible workloads that need strong-ish guarantee of schedulability and of not being interrupted
- less important interruptible workloads that need a less strong guarantee of schedulability and of not being interrupted
- best effort / opportunistic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the list.
|
||
This design doc introduces the concept of priorities for pods in Kubernetes and | ||
how the priority impacts scheduling and preemption of pods when the cluster | ||
runs out of resources. A pod can specify a priority at the creation time. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have admission to check it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, described a little farther below.
One could generally expect a pod with higher priority has a higher chance of | ||
getting scheduled than the same pod with lower priority. However, there are | ||
many other parameters that affect scheduling decisions. So, a high priority pod | ||
may or may not be scheduled sooner than lower priority pods. The details of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused :(; my understand is that: priority does not guarantee scheduler oder, e.g. lower priority maybe scheduled firstly because of race condition; but scheduler may preempt lower priority if scheduler decide to schedule pod to that host with higher pod, the decision was made according to current scheduler algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's more or less what I was trying to say as well that a higher priority pod is not guaranteed to be scheduled sooner than lower priority ones, as there are many other parameters that affect scheduling order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about earlier
instead of sooner
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure which one is grammatically more suitable. I will change sooner than
to before
.
string to an integer. The PriorityName specified in a pod spec must be found in | ||
this map or the pod creation request will be rejected. If PriorityName is | ||
empty, "tier3" (see below) is assumed. I am open to other suggestions about | ||
what the default priority should be. Once the PriorityName is resolved to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer to set priority to lowest one by AdmissionController.
specs. A new ConfigMap may be fed to the admission controller to define new | ||
priority names or change the integer value of existing names (other than | ||
"system"). These new changes affects only the pods created after the changes | ||
and existing pods will keep their existing Priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change priority may make cluster & admin confused; for example, if we changed following config tier_1:1000, tier_2:2000
to tier_1:2000, tier_2:1000
, it's hard for admin/dev to debug issue (with new/old Pods). Similar issue with deleting.
Draft option in my mind is to mark old priority as deprecated: can not create new Pod with that priority but priority for old Pod is still here, and let admin to delete it until all Objects exits (for now, did not provide a way to delete all Objects with the same priority).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a huge fan of changing priorities at runtime, but looks like there are customers and engineers who want this capability.
I am not sure if I understand the second paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re second paragraph: if rename "tier1" to "tier2", we just create a new name "tier2" and let admin delete "tier1" when all Objects are deleted; but "tier1" is deprecated (new pods can not use it.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should add all the logic to obsolete one name and restrict the use of names. Pretty much all the components of the system will use the integer value of Priority anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think we can just document it: it's better to delete/change name after delete all related objects. Add such logic many make it complex :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can only add or delete objects.
I don't think deleting should be possible. Once you created a priority, it shouldn't be possible to delete it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you created a priority, it shouldn't be possible to delete it too.
What is the concern with deleting? If one deletes an existing PriorityClass, no more pods can be created with that name. I think we should give the flexibility to cluster admins to delete PriorityClasses if needed. They may later create one with the same name as the delete one and different value. I don't see any problem with that either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t , I wrote "document it", I'd like to keep is flexible enough to user :). But the admin can delete it: it need admin to make sure all related objects are deleted.
With PriorityClass, k8s can help to delete them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat - my main problem with deleting is basically that "delete + create = update". That said, relying on the admin checking if it is fine to delete a priority and deleting it is not a good solution in my opinion.
If we really want to allow for deleting a priority, then the system itself has to somehow check if it is safe to remove a given priority and disallow it if there are any pods with it in the system.
But to be honest, I'm not sure how to do it in non-racy way, that's why I'm very very against allowing this.
Another problem with allowing deleting a priority is performance - if we want to allow it, I would like to see a similar discussion of how we would like to solve the races that we have in #611
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t We can add a restriction on allowing deletion only when there is no pod with that priority class in the system, but completely disallowing deletion of priority classes wouldn't be practical IMO. Over time these classes might get accumulated and there would be no way to clean them up.
Race condition will exist when performing the check till the class is actually deleted, but I guess it will be OK given that deletion is probably infrequent and the fact that if a pod manages to sneak in due a race, it won't break the cluster.
One could generally expect a pod with higher priority has a higher chance of | ||
getting scheduled than the same pod with lower priority. However, there are | ||
many other parameters that affect scheduling decisions. So, a high priority pod | ||
may or may not be scheduled sooner than lower priority pods. The details of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about earlier
instead of sooner
?
system 2147483647 (int_max) | ||
tier1 4000 | ||
tier2 2000 | ||
tier3 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the detailed list below, why define just 4
default priorities?
If defining the default set is an issue why not define just two - default
and system
? Any pod without a priority field maps to the default
priority.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both the above and below lists are examples. That said, in the list below some items can have equal priority. Moreover, not all clusters have all types of workloads given below and some clusters may have many more types of workloads with different priorities. So, I think we should have a small number of default priorities, like the two that you mentioned, and let customers add more when they need more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh also the doc specifies that it is configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default
I meant an empty PriorityName
field. There is a clear need for defining priority for system pods. Can we avoid defining standard priorities for all other use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will ship with only system
as a built-in value and the rest will be user configurable. If PriorityName is empty, I'd propose resolving the priority to zero.
|
||
### Resolving priority names | ||
|
||
User requests sent to Kubernetes may have "PriorityName" in their PodSpec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of "PriorityName"? Is it meant to define portable names? If so, we need a set of default Priority names that can satisfy the requirements for most users.
If it's only a matter of defaulting, can pod presets be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PriorityNames limit pods to have a small set of priorities, as opposed to any valid integer. A small set makes it easier to predict schedulability of pods. It also makes it easier to create policies based on known priorities. For example, we may add quota for various priorities in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand an explicit name brings in portability constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand an explicit name brings in portability constraints.
That's why I think we should ship Kubernetes with a few predefined priority names. These predefined priority names can be changed (except for the system
), but they bring config portability to default Kubernetes deployment.
system pods will have "system" priority and no pod should be able to preempt | ||
them. | ||
|
||
Alternatively, we may ship Kubernetes with only the highest priority level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will the system
priority level be hard-coded? In the admission controller or across the codebase?
Admission controllers are moderately easy to plugin and so how will this be enforced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought much about placement of code, but I guess admission controller is the only place that should know about priority names. When one replaces an admission controller that resolves priority names, they should add proper priority name resolution logic to their controller. Whether it is the system
or any other priority name, if it is not resolved to the right priority, the rest of the system may not behave correctly.
specs. A new ConfigMap may be fed to the admission controller to define new | ||
priority names or change the integer value of existing names (other than | ||
"system"). These new changes affects only the pods created after the changes | ||
and existing pods will keep their existing Priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
priority is essentially an API. updating or deleting existing priority names is a breaking change. Why should we optimize for handling such a situation?
+1 for documentation.
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes) | ||
which are derived from request and limit of pods. Priority is introduced as an | ||
independent concept; meaning that any QoS class may have any valid priority. | ||
When a node is out of resources and pods needs to be preempted, we may give |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been discussed in multiple forums. In the case of a tie, I'd prefer kubelet computing relative priority based on absolute priority and usage above request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine with me.
### Effect of priority on preemption | ||
|
||
Generally, lower priority pods are more likely to get preempted by higher | ||
priority pods when cluster is out of resources. In such a case, scheduler may |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/is out of resources/has reached a threshold.
|
||
type PodStatus struct { | ||
... | ||
Priority int // Populated by Admission Controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Priority/EffectivePriority ;-) b/c there may be scaling applied at different layers, e.g. admission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still trying to avoid adding adjectives. ;-) Even with scaling applied, can't we use just "Priority"? It looks cleaner to me, but it is just personal preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to eliminate any ambiguity an operator might have in the case where they might have a (PriorityName, Value) association ... and the expectation that Priority will always equal the Value.
EffectivePriority to me indicates that it could be scaled by some other value (e.g. PriorityFactor - which could be based on user or group).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my two cents, I think "effective" will raise more questions in the user's mind than it answers. For example, if they see "effective" they might assume the system has some scaling factor like what you described, but it may not have one. I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not loosing sleep on the name, but we should have someone other then us comment on this too.
/cc @smarterclayton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the part that is unclear is that the PriorityName coyul end up with a different value over time, but that is latched on this pod, and not updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integer value of PriorityName could be changed (although some people believe we should not allow it), but pods created before the change will keep the old value. I believe it is fine and clearly say that a change applied only to the pods created after the change.
/cc @derekwaynecarr you will want to review this one. |
BTW here is the info on how to add an alpha (or beta) field to a GA API object (which is what you're doing in this feature) |
cc/ @dchen1107 @dashpole |
I plan to review this this week. |
cc/ @wojtek-t |
@kubernetes/sig-apps-misc this is going to affect the workload controllers |
@Kargakis: These labels do not exist in this repository: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Can you say why? Priority (or PriorityName, to be more precise) would just go in the pod template and shouldn't really need to be visible to the workload controller. |
@davidopp: These labels do not exist in this repository: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
IIUC this feature can be used for implementing a pruning policy for pods, ie. by killing pods with lower priority first. The latter feature will be part of the controllers. |
See kubernetes/kubernetes#4301 kubernetes/kubernetes#45339 kubernetes/kubernetes#45509 |
No that's different. For that we are proposing to use something called The priority @bsalamat is introducing here is global across all controllers/sets and namespaces, and is used by the scheduler to decide which pods to evict to make room for a pending pod (think of it as a fancy version of the critical pod rescheduler). |
|
||
type PodStatus struct { | ||
... | ||
Priority int // Populated by Admission Controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think calling it "Priority" and making it clear in the comment that this is the priority used by the system to make decisions might be sufficient.
+1
of the valid priority names are defined by a system wide mapping that maps each | ||
string to an integer. The PriorityName specified in a pod spec must be found in | ||
this map or the pod creation request will be rejected. If PriorityName is | ||
empty, it will resolve to priority zero. Once the PriorityName is resolved to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be resolving PriorityName and when? I imagine that it will be some admission plugin, but no matter what I think this information should be added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - it seems it's described below.
specs. A new ConfigMap may be fed to the admission controller to define new | ||
priority names or change the integer value of existing names (other than | ||
"system"). These new changes affects only the pods created after the changes | ||
and existing pods will keep their existing Priority. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for jumping late to discussion.
I completely agree with all the concerns mentioned above. I don't see any strong argument for the need of changing the "integer value of a given priority name".
What is more, if someone does this it will be super extremely confusing for the users. As an example imagine the scenario:
- we have
priorityX: 1000
- then we create a number of pods with this priority - it is resolved to
1000
integer priority - then we change the priority value to:
priorityX: 2000
- then we create a number of pods with this priority - it is resolved to
2000
integer priority.
And what we end up with? We have a number of pods with priorityX
PriorityName, but some of them are resolved to priority 1000 and the other to priority 2000.
This in my opinion would be extremely confusing to users and I'm really against introducing such operation.
That said, I thing that users want to have an ability to add new priority names during runtime and that we should allow.
@davidopp - for opinions
/cc @liggitt |
classes](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-qos.md#qos-classes) | ||
which are derived from request and limit of pods. Priority is introduced as an | ||
independent concept; meaning that any QoS class may have any valid priority. | ||
When a node is out of resources and pods needs to be preempted, we give |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i continue to believe it is an anti-pattern to allow a best effort pod to have higher priority than pods in other qos classes.
from a scheduling perspective, if a node is reporting memory pressure, and no longer accepts BestEffort pods, would the scheduler really preempt pods in the other QoS classes to make room? there is no guarantee room will ever be made available, so what is the point? best effort pods may not have access to any node resources once scheduled given how qos is implemented, and will effectively just be starved anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the pod priorities will be zero, when pods don't specify any. Once they do, why should we make it harder for users to understand the meaning of priority by adding QoS to the mix? IMO, if a user wants to have their BestEffort pods to have very high priority, they should be able to do so.
The concern that BestEffort pods may be scheduled and starved exists even today. A best effort may land on a machine with no free resources and may be starved. This is acceptable behavior of the system when dealing with BestEffort pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is that if a guaranteed pod can be evicted because of a high priority besteffort pod, the guaranteed qos doesn't mean much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be unclear how many guaranteed pods should be preempted in order to free sufficient resources for a best effort pod that does not declare a resource request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, Guarantee, Burstable, BestEffort is also a kind of priority; the question is how we handle them together. One option is to make priority only effect within same QoS class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not want to limit the range of usable priorities based on QoS class. So, if a user chooses to use a higher priority for their besteffort pods, they can do so, but you are right that the besteffort class may be scheduled on a node with no resources (as it could happen today). And, yes priority will be considered by kubelet for out of resource evictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is: scheduler and kubelet will take different action against priority & QoS class; I agree to discuss this in preemption doc in future :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Guaranteed" is guaranteed resource/performance QoS while scheduled on the node, not with respect to durability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a mechanism to say that this priority can only be used with an allowed set of QoS class? i want a way to prevent what i find is an abuse vector (a BE pod using high priorities), as I think it will be a future support case when it causes an unforeseen consequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekwaynecarr What unforeseen consequences are you thinking of?
If we have a mechanism to control the set of priorities a namespace can use and if we can restrict the QoS classes a namespace can use (using LimitRanges) why would we need an additional priority -> QoS link?
|
||
type PodStatus struct { | ||
... | ||
Priority int // Populated by Admission Controller. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shouldn't this be an int64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm, i was informed by @liggitt that int is fine for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be int32 (no unsized ints)
I am trying to understand how I will support quota by priority in the future relative to the current proposal. Today, I am able to support quota by storage-class via the following:
In the above example, I have 5000Gi of storage quota, but only 500Gi of it can have gold class. Now that StorageClass is promoted, we could look to model this differently (more akin to option 2 below). The scope of this proposal appears to be resources that surround a pod (i.e. cpu, memory, etc.) and not storage. Option 1: priority based resource names
This would make quota controller PriorityClass name aware. Option 2: priority constraint
The priorityValue could be set on admission similar to your Pod example. The quota would be restricted to the set of resources that are allowed on pods. If we went this route, I would want to augment quota by storage class to use the same pattern, but just restrict quota to values allowed on a PVC. |
@derekwaynecarr What am writing is not finalized, but what I currently have in mind about quota is like your option 2. Basically, we add |
08e3a31
to
25afe86
Compare
@davidopp @timothysc @derekwaynecarr |
@bsalamat -- see my question re: ability to say this priority can only be used with this QoS. i also want to discuss another use case that came up with storage-classes. i want to restrict the set of allowed priorities that can be used in a namespace w/o having to create an explicit quota for each enumerated priority in my cluster. for example, just because the cluster operator added a "super privileged system daemon priority", the operator should not have to update every existing namespace/project to say they have 0 super privileged system daemon priority priority resources. do you guys have similar concerns? can i have whitelist of allowed priorities associated with a namespace? |
@derekwaynecarr We have LimitRanges and in the near future, ResourceQuota to limit what users what users can create in a namespace and/or at a particular priority. Why should we limit priorities that can be used with QoS classes. Linking priority to QoS might make the management of priorities hard. Regarding your second question, once we add priority to ResourceQuota, an admin can create a zero ResourceQuota with no particular priority. Such quota will prevent users from creating pods at all priorities. The admin can then assign quota to users at the particular priorities. So, when new priorities are added, admins don't need to explicitly give zero quota at the new priority. |
@derekwaynecarr It would be great if we could resolve your comments so we can merge the proposal. My two cents: All of the quotas that ResourceQuota controlls today should be additionally indexed by priority (I haven't thought about whether it should be PriorityClass or absolute priority). Controlling which priorities can use which QoSes should be done as an extension to LimitRange, as @bsalamat suggested But this doc explicitly lists quota as outside the scope of the doc so maybe its OK to merge as-is and continue discussion of quota elsewhere? |
sorry, previous comment was for @derekwaynecarr, fixed now |
@davidopp - i think i am fine deferring the quota problem to another proposal. |
The cluster may have many user defined priority classes for | ||
various use cases. The following list is an example of how the priorities and | ||
their values may look like. | ||
We decided to ship Kubernetes with no predefined priority classes, but any |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The section later called "system PriorityClassName" seems to imply that this sentence is not correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it. Thanks!
Priority admission controller ensures that new priority classes will be not | ||
created with those names. They are used for critical system pods that must not | ||
be preempted. We set default policies that deny creation of pods with these | ||
`system` priorities. Cluster admins can authorize users or service accounts to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to be using system
both to refer to a specific PriorityClassName as well as to refer to a set of different PriorityClassNames (system
being one of them) that have the property of requiring pre-approval to use them. You should use different words for these two concepts. And earlier you say about everything above one billion being reserved for these kinds of priority classes, but then you don't mention it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We initially wanted to have only one system priority class name, but we later decided to have more than one. I fixed the text.
Priority admission controller ensures that new priority classes will be not | ||
created with those names. They are used for critical system pods that must not | ||
be preempted. We set default policies that deny creation of pods with these | ||
system priorities. Cluster admins can authorize users or service accounts to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more accurate to say "We set default policies that deny creation of pods with PriorityClassNames corresponding to these priorities"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Done.
Just one comment, then LGTM |
/lgtm |
* Add a design proposal for adding priority to Kubernetes API
* Add a design proposal for adding priority to Kubernetes API
* Add a design proposal for adding priority to Kubernetes API
@kubernetes/sig-scheduling-api-reviews
@kubernetes/api-reviewers