Addressed reviewers comments

kubernetes · Sep 8, 2017 · 409d23b · 409d23b
1 parent 8c5ebe3
commit 409d23b
Showing 1 changed file with 55 additions and 35 deletions.
diff --git a/docs/concepts/configuration/pod-priority-preemption.md b/docs/concepts/configuration/pod-priority-preemption.md
@@ -2,12 +2,16 @@
 approvers:
 - davidopp
 - wojtek-t
-title: Pod Priority and Preemption
+title: Pod Priority and Preemption (Alpha)
 ---
 
 [Pods](/docs/user-guide/pods) in Kubernetes 1.8 and later can have priority. Priority
-indicates the importance of a pod. When a pod cannot be scheduled, scheduler tries
-to preempt lower priority pods in order to make scheduling of the pending pod possible.
+indicates importance of a pod relative to other pods. When a pod cannot be scheduled, scheduler tries
+to preempt (evict) lower priority pods in order to make scheduling of the pending pod possible.
+Soon, priority will also affect out-of-resource eviction ordering on the node.
+
+Note that preemption does not respect PodDisruptionBudget; see 
+[the limitations section](#poddisruptionbudget-is-not-supported) for more details.
 
 * TOC
 {:toc}
@@ -24,32 +28,34 @@ The following sections provide more information about these steps.
 
 ## Enable Priority and Preemption
 Pod priority and preemption is disabled by default in Kubernetes 1.8 as it is an
-__alpha__ feature. It can be enabled by a command-line flag:
+__alpha__ feature. It can be enabled by a command-line flag for API server and scheduler:
 
 ```
 --feature-gates=PodPriority=true
 ```
 
-Once enabled you can add PriorityClasses and create pods with `PriorityClassName` set.
+Once enabled you can add [PriorityClasses](#priorityclass) and create pods with [`PriorityClassName`](#pod-priority) set.
 If you tried it and decided to disable it, you must remove this command-line flag or
 set it to false and restart API server and Scheduler. Once disabled, the existing
-pods may keep their priority fields, but preemption will be disabled and priority
-fields will be ignored.
+pods will keep their priority fields, but preemption will be disabled and priority
+fields will be ignored, and you will not be able to set PriorityClassName in new pods.
 
 **Note:** Alpha features should not be used in production systems! Alpha 
 features are more likely to have bugs and future changes to them are not guaranteed to
 be backward compatible.
 
 ## PriorityClass
-A PriorityClass object defines a mapping from a PriorityClassName to the integer
-value of the priority. The higher the value, the higher the priority. PriorityClass
+PriorityClass is a non-namespaced object that defines a mapping from a PriorityClassName to the integer
+value of the priority. The higher the value, the higher the priority. The value is
+specified in `value` field which is required. PriorityClass
 objects can have any 32-bit integer value smaller than or equal to 1 billion. Larger
-numbers are reserved for system use.
+numbers are reserved for critical system pods that should not normally be preempted or
+evicted.
 
-PriorityClass also has two optional fields: `globaleDefault` and `description`.
+PriorityClass also has two optional fields: `globalDefault` and `description`.
 `globalDefault` indicates that the value of this PriorityClass should be used for
 pods without a `PriorityClassName`. Only one PriorityClass with `globalDefault` 
-set to true can exists in the system. If there is no PriorityClass with `globalDefault`
+set to true can exist in the system. If there is no PriorityClass with `globalDefault`
 set, priority of pods with no `PriorityClassName` will be zero.
 
 `description` is an arbitrary string. It is meant to tell users of the cluster
@@ -63,6 +69,10 @@ of your existing pods will be considered to be zero.
 change priority of existing pods. The value of such PriorityClass will be used only
 for pods created after the PriorityClass is added.
 
+**Note 3:** If you delete a PriorityClass, existing pods that use the name of the
+deleted priority class will remain unchanged, but you will not be able to create more pods
+that use the name of the deleted priority class.
+
 #### Example PriorityClass
 ```yaml
 apiVersion: v1
@@ -76,10 +86,13 @@ description: "This priority class should be used for XYZ service pods only."
 
 ## Pod Priority
 Once you have one or more PriorityClasses, you can create pods which specify one
-of those PriorityClass names in their spec.
+of those PriorityClass names in their spec. Priority admission controller uses
+`priorityClassName` field and populates the integer value of priority. If the priority
+class is not found, the pod will be rejected.
+
 The following YAML is an example of a pod configuration that uses the PriorityClass
 created above. Priority admission controller checks the spec and resolves the
-priority of the pod to 1,000,000.
+priority of the pod to 1000000.
 
 
 ```yaml
@@ -100,32 +113,30 @@ spec:
 ## Preemption
 When pods are created, they go to a queue and wait to be scheduled. Scheduler picks a pod
 from the queue and tries to schedule it on a node. If no node is found that satisfies
-all the specified requirements of the pod, the pod is determined infeasible. At this
-point preemption logic is triggered for the pending pod. Let's call the pending pod P.
-Preemption logic tries to find a node where removal of pods with lower priority than
-P helps schedule P. If such node is found, one or more lower priority pods will
+all the specified requirements (predicates) of the pod, preemption logic is triggered 
+for the pending pod. Let's call the pending pod P.
+Preemption logic tries to find a node where removal of one or more pods with lower priority 
+than P would enable P to schedule on that node. If such a node is found, one or more lower priority pods will
 be deleted from the node. Once the pods are gone, P may be scheduled on the node. 
 
 ### Limitations of Preemption (alpha version)
 
 #### Starvation of Preempting Pod
 When pods are preempted, the victims get their
 [graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
-They have so much time to finish their work and exit. If they don't, they will be
+They have that much time to finish their work and exit. If they don't, they will be
 killed. This graceful termination period creates a time gap between the point that
 scheduler preempts pods until the pending pod (P) can be scheduled on the node (N).
-When there are multiple victims on node N, they may exit or get terminated at
-various points in time. When one exits, it creates some room on node N. Pod P can
-only be scheduled on node N when __all__ the victims exit, but other pods with different requirements
-may exist in the scheduling queue that fit on node N when some of the victims have
-exited. Scheduler may schedule them on the node. In such a case, it is likely that
+In the meantime, scheduler keeps scheduling other pending pods. When one or more victims
+exit or get terminated, scheduler may place other pending pods on the node if the pods are
+ahead of P in the scheduling queue. In such a case, it is likely that
 when all victims exit, pod P won't fit on node N anymore. So, scheduler will have to
 preempt other pods on node N or another node to let P schedule. This scenario may 
 be repeated again for the second and subsequent rounds of preemption and P may not
 get scheduled for a while. This scenario can cause problems in various clusters, but
 is particularly problematic in clusters where many new pods are created all the time.
 
-We intend to address this problem in beta version of pod preemption. The solution
+We will address this problem in beta version of pod preemption. The solution
 we plan to implement is [provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-preemption.md#preemption-mechanics).
 
 #### PodDisruptionBudget is not supported
@@ -144,34 +155,43 @@ The current implementation of preemption considers a node for preemption only wh
 the answer to this question is positive: "If all the pods with lower priority than
 the pending pod are removed from the node, can the pending pod be scheduled on
 the node?"
+(Note that preemption does not always remove all lower-priority pods, e.g. if the 
+pending pod can be scheduled by removing fewer than all lower-priority pods, but this
+test must always pass for preemption to be considered on a node.)
 
 If the answer is no, that node will not be considered for preemption. If the pending
-pod has inter-pod affinity on one or more of those lower priority pods on the node, the
+pod has inter-pod affinity to one or more of those lower priority pods on the node, the
 inter-pod affinity rule cannot be satisfied in the absence of the lower priority
 pods and scheduler will find the pending pod infeasible on the node. As a result,
 it will not try to preempt any pods on that node.
 Scheduler will try to find other nodes for preemption and could possibly find another
 one, but there is no guarantee that such a node will be found.
 
-We may address this issue in future versions, but we don't have a clear plan. Part
+We may address this issue in future versions, but we don't have a clear plan and cannot
+promise that it will be fixed in Beta or GA. Part
 of the reason is that finding the set of lower priority pods that satisfy all
 inter-pod affinity/anti-affinity rules is computationally expensive and adds
 substantial complexity to the preemption logic. Besides, even if preemption keeps the lower
-priority pods to satisfy inte-pod affinity, the lower priority pods may be preempted
+priority pods to satisfy inter-pod affinity, the lower priority pods may be preempted
 later by other pods, which removes the benefits of having the complex logic of 
 respecting inter-pod affinity to lower priority pods.
 
 Our recommended solution for this problem is to create inter-pod affinity towards
 equal or higher priority pods.
 
 #### Cross Node Preemption
-When considering a node N for preemption in order to schedule pending pod P,
-P may become feasible on N only when pods on other nodes are preempted. For 
-example, if there is anti-affinity from existing lower priority pods in a zone
-towards pod P, P may be scheduled in the zone only when those lower priority pods
-are preempted. Current preemption algorithm does not perform preemption of pods
-on nodes other than N, when considering N for preemption.
+When considering a node N for preemption in order to schedule a pending pod P,
+P may become feasible on N only if pods on other nodes are preempted. For example, P may
+have zone anti-affinity with some currently-running, lower-priority pod Q. P may not be
+scheduled on Q's node even if it preempts Q, for example if P is larger than Q so
+preempting Q does not free up enough space on Q's node and P is not high-priority enough
+to preempt other pods on Q's node. But P might theoretically be able to schedule on 
+another node M by preempting Q and some pod(s) on M (preempting Q removes the
+anti-affinity violation, and preempting pod(s) on M frees up space for P to schedule
+there). The current preemption algorithm does not detect and execute such preemptions;
+that is, when determining whether P can schedule onto N, it only considers preempting
+pods on N.
 
 We may consider adding cross node preemption in future versions if we find an
-algorithm with reasonable performance. 
+algorithm with reasonable performance, but we cannot promise anything at this point.