Update KEP with criteria for GA

kubernetes · Jan 26, 2023 · c3df52a · c3df52a
1 parent 34b46e8
commit c3df52a
Show file tree

Hide file tree

Showing 2 changed files with 95 additions and 50 deletions.
diff --git a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
@@ -108,7 +108,8 @@ tags, and then generate with `hack/update-toc.sh`.
       - [Resource limits exceeded](#resource-limits-exceeded)
     - [JobSpec API alternatives](#jobspec-api-alternatives)
     - [Failing delete after a condition is added](#failing-delete-after-a-condition-is-added)
-    - [Marking pods as Failed](#marking-pods-as-failed)
+    - [Marking pods as Failed by PodGC](#marking-pods-as-failed-by-podgc)
+    - [Marking Scheduled Pending and Terminating pods as Failed by Kubelet](#marking-scheduled-pending-and-terminating-pods-as-failed-by-kubelet)
   - [Risks and Mitigations](#risks-and-mitigations)
     - [Garbage collected pods](#garbage-collected-pods)
     - [Evolving condition types](#evolving-condition-types)
@@ -230,7 +231,7 @@ thousands of nodes requires usage of pod restart policies in order
 to account for infrastructure failures.
 
 Currently, kubernetes Job API offers a way to account for infrastructure
-failures by setting `.backoffLimit > 0`. However, this mechanism intructs the
+failures by setting `.backoffLimit > 0`. However, this mechanism instructs the
 job controller to restart all failed pods - regardless of the root cause
 of the failures. Thus, in some scenarios this leads to unnecessary
 restarts of many pods, resulting in a waste of time and computational
@@ -354,7 +355,7 @@ As a machine learning researcher, I run jobs comprising thousands
 of long-running pods on a cluster comprising thousands of nodes. The jobs often
 run at night or over weekend without any human monitoring. In order to account
 for random infrastructure failures we define `.backoffLimit: 6` for the job.
-However, a signifficant portion of the failures happen due to bugs in code.
+However, a significant portion of the failures happen due to bugs in code.
 Moreover, the failures may happen late during the program execution time. In
 such case, restarting such a pod results in wasting a lot of computational time.
 
@@ -729,7 +730,7 @@ in different messages for pods.
 - Reproduction: Run kube-controller-manager with disabled taint-manager (with the
   flag `--enable-taint-manager=false`). Then, run a job with a long-running pod and
   disconnect the node
-- Comments: handled by node lifcycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
+- Comments: handled by node lifecycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
   However, the pod phase remains `Running`.
 - Pod status:
   - status: Unknown
@@ -762,7 +763,7 @@ In Alpha, there is no support for Pod conditions for failures or disruptions ini
 
 For Beta we introduce handling of Pod failures initiated by Kubelet by adding
 the pod disruption condition (introduced in Alpha) in case of disruptions
-initiated by kubetlet (see [Design details](#design-details)).
+initiated by Kubelet (see [Design details](#design-details)).
 
 Kubelet can also evict a pod in some scenarios which are not covered with
 adding a pod failure condition:
@@ -818,9 +819,12 @@ for event handling and
 for the constant definition)
 and CRI-O (see
 [here](https://github.com/cri-o/cri-o/blob/edf889bd277ae9a8aa699c354f12baaef3d9b71d/server/container_status.go#L88-L89)).
-- setting the `reason` field to `OOMKilled` is not standardized, either. We have
-started an effort to standardize the handling of OOM killed containers
-(see: [Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)).
+- setting the `reason` field to `OOMKilled` is not standardized, either. During
+the Beta phase implementation we discussed the standardization issue within the
+community (involving CNCF Technical Advisory Group for Runtime, SIG-node,
+container runtime implementations). We have also started an effort to standardize
+the handling of OOM killed containers (see:
+[Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)).
 However, in the process it turned out that in some configurations
 (for example the CRI-O with cgroupv2, see:
 [Add e2e_node test for oom killed container reason](https://github.com/kubernetes/kubernetes/pull/113205)),
@@ -830,8 +834,12 @@ but also when the system is running low on memory. In such scenario there
 can be race conditions in which both the `DisruptionTarget` condition and the
 `ResourceExhausted` could be added.
 
-Thus, we decide not to annotate the scenarios with the `ResourceExhausted`
-condition. While there are not known issues with detection of the exceeding of
+Thus, we've decided not to annotate the scenarios with the `ResourceExhausted`
+condition in this in this KEP. While the effort is welcomed by the community,
+handling of exceeded limits might be done as a follow up KEP once the ground
+work of introducing pod failure conditions is done in this KEP.
+
+While there are not known issues with detection of the exceeding of
 Pod's ephemeral storage limits, we prefer to avoid future extension of the
 semantics of the new condition type. Alternatively, we could introduce a pair of
 dedicated pod condition types: `OOMKilled` and `EphemeralStorageLimitExceeded`.
@@ -863,13 +871,13 @@ dies) between appending a pod condition and deleting the pod.
 In particular, scheduler can possibly decide to preempt
 a different pod the next time (or none). This would leave a pod with a
 condition that it was preempted, when it actually wasn't. This in turn
-could lead to inproper handling of the pod by the job controller.
+could lead to improper handling of the pod by the job controller.
 
 As a solution we implement a worker, part of the disruption
 controller, which clears the pod condition added if `DeletionTimestamp` is
 not added to the pod for a long enough time (for example 2 minutes).
 
-#### Marking pods as Failed
+#### Marking pods as Failed by PodGC
 
 As indicated by our experiments (see: [Disconnected node](#disconnected-node))
 a failed pod may get stuck in the `Running` phase when there is no kubelet
@@ -884,19 +892,48 @@ However, as disabling taint-manager is deprecated it is not a concern for this
 KEP.
 
 For Alpha, we implemented a fix for this issue by setting the pod phase as `Failed`
-in podgc.
-
-For Beta, we are going to simplify the code in job controller responsible for
-detecting pod failures. For this purpose we need to make sure that pods stuck
-in the pending phase which are terminating (with set deletionTimestamp) are
-eventually marked as failed. Such pods may either be not scheduled
-(for example when they are unschedulable due to node affinity configuration for
-which no node exists in the cluster) or scheduled (for example when they fail
-to pull the the docker image). In case of non-scheduled pods they will be moved
-to the `Failed` phase by the PodGC controller as it marks as failed all
-non-scheduled terminating pods (after the Alpha changes for this feature, see
-above). We plan to extend the PodGC controller to also move to the failed phase
-(and delete) scheduled pods in the pending state which are terminating.
+in PodGC. Similarly, we for Alpha we modified PodGC to mark as `Failed`
+unscheduled Pending pods which as Terminating.
+
+#### Marking Scheduled Pending and Terminating pods as Failed by Kubelet
+
+A scheduled pod may also get stuck in a Pending phase (for example due to an
+invalid image reference). In that case, as long as the pod is not Terminating it
+could be a transient issue and Kubelet can may progress after retrying. However,
+once such a Pod (in a Pending phase) is also Terminating (due to an external
+controller or user's action) Kubelet is not going to attempt starting of the pod.
+
+This could be problematic in case of Job pods which have finalizers, because
+such a pod would get stuck in the `Pending` phase if the Job controller only
+removed the finalizer once the pod is in a terminal phase. Thus, Job controller
+currently (prior to this KEP) considers such pods as failed, counts it
+accordingly, and removes its finalizer, allowing the API server to complete the
+pod's deletion.
+
+However, It is important for the podFailurePolicy to match against pods in
+terminal phase to make sure there are no race conditions with pod failure
+conditions or exit codes being added while the pod is being matched against the
+policy. Thus in the current implementation of Beta we only require pods to be in
+terminal phase when `podFailurePolicy` is specified (and only in that case).
+This allows to opt-out for the behavior without disabling the feature gate flags
+in case the handling of Pending and Terminating pods is problematic for a Beta user
+(see Issue: [Job controller should wait for Pods to terminate to match the failure policy](https://github.com/kubernetes/kubernetes/issues/113855)).
+
+For the second iteration of Beta (1.27), in order to solve this issue, we plan
+to modify Kubelet to transition such pods (scheduled, Pending and Terminating)
+into the `Failed` phase, allowing the Job controller to count them, match
+against the (optional) pod failure policy and remove their Job finalizers,
+allowing API server to complete deletion. A prototype implementation is prepared
+here: [WIP: Mark Pending Terminating pods as Failed by Kubelet](https://github.com/kubernetes/kubernetes/pull/115331).
+The PR scopes the logic to only pods with finalizers to save an unnecessary API
+call by Kubelet for the transition in case there is no finalizer, as pods
+without finalizers are about to be deleted by API server anyway.
+
+One release after the PodDisruptionCondition feature gate graduates to GA
+we plan to simplify Job controller to consider as failed (and count as such)
+pods which are in terminal phase regardless of the fact if the `podFailurePolicy`
+is specified, fixing the Issue:
+[Job controller should wait for Pods to terminate to match the failure policy](https://github.com/kubernetes/kubernetes/issues/113855).
 
 ### Risks and Mitigations
 
@@ -1218,7 +1255,7 @@ the pod failure does not match any of the specified rules, then default
 handling of failed pods applies.
 
 If we limit this feature to use `onExitCodes` only when `restartPolicy=Never`
-(see: [limitting this feature](#limitting-this-feature)), then the rules using
+(see: [limiting this feature](#limitting-this-feature)), then the rules using
 `onExitCodes` are evaluated only against the exit codes in the `state` field
 (under `terminated.exitCode`) of `pod.status.containerStatuses` and
 `pod.status.initContainerStatuses`. We may also need to check for the exit codes
@@ -1279,9 +1316,9 @@ the following scenarios will be covered with unit tests:
 - handling of a pod failure, in accordance with the specified `spec.podFailurePolicy`,
   when the failure is associated with
   - a failed container with non-zero exit code,
-  - a dedicated Pod condition indicating termmination originated by a kubernetes component
+  - a dedicated Pod condition indicating termination originated by a kubernetes component
 - adding of the `DisruptionTarget` by Kubelet in case of:
-  - eviciton due to graceful node shutdown
+  - eviction due to graceful node shutdown
   - eviction due to node pressure
 <!--
 Additionally, for Alpha try to enumerate the core package you will be touching
@@ -1313,7 +1350,7 @@ The following scenarios will be covered with integration tests:
 - pod failure is caused by a failed container with a non-zero exit code
 
 More integration tests might be added to ensure good code coverage based on the
-actual implemention.
+actual implementation.
 
 <!--
 This question should be filled when targeting a release.
@@ -1402,19 +1439,11 @@ Below are some examples to consider, in addition to the aforementioned [maturity
 #### GA
 
 - Address reviews and bug reports from Beta users
-- The feature is unconditionally enabled
-- Simplify the code in job controller responsible for detection of failed pods
-  based on the fix for pods stuck in the running phase (see: [Marking pods as Failed](marking-pods-as-failed)).
-  Also, extend PodGC to mark as failed (and delete) terminating pods
-  (with set deletionTimestamp) which stuck in the pending.
-- Discuss within the community (involving CNCF Technical Advisory Group for
-  Runtime, SIG-node, container runtime implementations) the standarization of
-  the CRI API to communicate an OOM kill occurrence by contatiner runtime to
-  Kubelet. In particular, suggest that the API should allow to convey the reason
-  for OOM killer being invoked (to distinguish if the container was killed due
-  to exceeding its limits or due to system running low on memory).
-- Review of the implementation of hanling OOM kill events by Kubelet depending
-  on the outcome of the discussions and the standardized API.
+- Extend Kubelet to mark as failed pending terminating pods (see: [Marking Scheduled Pending and Terminating pods as Failed by Kubelet](#marking-scheduled-pending-and-terminating-pods-as-failed-by-kubelet)).
+- Write a blog post about the feature
+- Graduate e2e tests as conformance tests
+- Lock the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates
+- Declare deprecation of the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates in documentation
 
 <!--
 **Note:** Generally we also wait at least two releases between beta and
@@ -1430,7 +1459,14 @@ in back-to-back releases.
 
 #### Deprecation
 
-N/A
+In GA+1 release:
+- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
+- Simplify the Job controller to wait for pods to terminate before counting them
+  as failed or matching against the pod failure policy, regardless if the pod
+  failure policy is specified ([https://github.com/kubernetes/kubernetes/issues/113855](#113855)).
+
+In GA+2 release:
+- Remove the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
 
 ### Upgrade / Downgrade Strategy
 
@@ -1439,7 +1475,7 @@ N/A
 An upgrade to a version which supports this feature should not require any
 additional configuration changes. In order to use this feature after an upgrade
 users will need to configure their Jobs by specifying `spec.podFailurePolicy`. The
-only noticeable difference in behaviour, without specifying `spec.podFailurePolicy`,
+only noticeable difference in behavior, without specifying `spec.podFailurePolicy`,
 is that Pods terminated by kubernetes components will have an additional
 condition appended to `status.conditions`.
 
@@ -1654,7 +1690,7 @@ Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
   - Scenario 2:
     - Create a job with a long running containers and `backoffLimit=0`.
     - Verify that the job continues after the node in uncordoned
-1. Disable the feature gates. Verify that the above scenarios result in default behaviour:
+1. Disable the feature gates. Verify that the above scenarios result in default behavior:
   - In scenario 1: the job restarts pods failed with exit code `42`
   - In scenario 2: the job is failed due to exceeding the `backoffLimit` as the failed pod failed during the draining
 1. Re-enable the feature gates
@@ -1953,7 +1989,7 @@ technics apply):
   is an increase of the Job controller processing time.
 - Inspect the Job controller's `job_pods_finished_total` metric for the
   to check if the numbers of pod failures handled by specific actions (counted
-  by the `failure_policy_action` label) agree with the expetations.
+  by the `failure_policy_action` label) agree with the expectations.
   For example, if a user configures job failure policy with `Ignore` action for
   the `DisruptionTarget` condition, then a node drain is expected to increase
   the metric for `failure_policy_action=Ignore`.
@@ -1963,7 +1999,7 @@ technics apply):
 
 - 2022-06-23: Initial KEP merged
 - 2022-07-12: Preparatory PR "Refactor gc_controller to do not use the deletePod stub" merged
-- 2022-07-14: Preparatory PR "efactor taint_manager to do not use getPod and getNode stubs" merged
+- 2022-07-14: Preparatory PR "Refactor taint_manager to do not use getPod and getNode stubs" merged
 - 2022-07-20: Preparatory PR "Add integration test for podgc" merged
 - 2022-07-28: KEP updates merged
 - 2022-08-01: Additional KEP updates merged
@@ -1972,7 +2008,7 @@ technics apply):
 - 2022-08-04: PR "Support handling of pod failures with respect to the configured rules" merged
 - 2022-09-09: Bugfix PR for test "Fix the TestRoundTripTypes by adding default to the fuzzer" merged
 - 2022-09-26: Prepared PR for KEP Beta update. Summary of the changes:
-  - propsal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
+  - proposal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
     - DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
     - ResourceExhausted for evictions due to OOM killer and exceeding Pod's ephemeral-storage limits
   - extended the review of pod eviction scenarios by kubelet-initiated pod evictions:
@@ -1990,6 +2026,15 @@ technics apply):
 - 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304))
 - 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324))
 - 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580))
+- 2022-11-08: KEP update for Beta ([link](https://github.com/kubernetes/enhancements/pull/3646)) with main changes:
+  - do not introduce the `ResourceExhausted` condition (it was planned to be used for pods killed due to OOM killer or exceeding ephemeral storage limits)
+  - do not add `DisruptionTarget` condition in case of admission failures
+- 2022-11-11: PR "Fix match onExitCodes when Pod is not terminated" ([link](https://github.com/kubernetes/kubernetes/pull/113856))
+- 2022-11-11: PR "Wait for Pods to finish before considering Failed in Job" ([link](https://github.com/kubernetes/kubernetes/pull/113860))
+- 2022-11-15: PR "Add e2e test to ignore failures with 137 exit code" ([link](https://github.com/kubernetes/kubernetes/pull/113927))
+- 2023-01-03: PR "Fix clearing of rate-limiter for the queue of checks for cleaning stale pod disruption conditions" ([link](https://github.com/kubernetes/kubernetes/pull/114770))
+- 2023-01-09: PR "Adjust DisruptionTarget condition message to do not include preemptor pod metadata" ([link](https://github.com/kubernetes/kubernetes/pull/114914))
+- 2023-01-13: PR "PodGC should not add DisruptionTarget condition for pods which are in terminal phase" ([link](https://github.com/kubernetes/kubernetes/pull/115056))
 
 <!--
 Major milestones in the lifecycle of a KEP should be tracked in this section.

diff --git a/keps/sig-apps/3329-retriable-and-non-retriable-failures/kep.yaml b/keps/sig-apps/3329-retriable-and-non-retriable-failures/kep.yaml
@@ -24,13 +24,13 @@ stage: beta
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone: "v1.26"
+latest-milestone: "v1.27"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
   alpha: "v1.25"
   beta: "v1.26"
-  stable: "v1.27"
+  stable: "v1.28"
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled