Skip to content

Commit

Permalink
Update KEP with criteria for GA
Browse files Browse the repository at this point in the history
  • Loading branch information
mimowo committed Jan 26, 2023
1 parent 34b46e8 commit c3df52a
Show file tree
Hide file tree
Showing 2 changed files with 95 additions and 50 deletions.
141 changes: 93 additions & 48 deletions keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,8 @@ tags, and then generate with `hack/update-toc.sh`.
- [Resource limits exceeded](#resource-limits-exceeded)
- [JobSpec API alternatives](#jobspec-api-alternatives)
- [Failing delete after a condition is added](#failing-delete-after-a-condition-is-added)
- [Marking pods as Failed](#marking-pods-as-failed)
- [Marking pods as Failed by PodGC](#marking-pods-as-failed-by-podgc)
- [Marking Scheduled Pending and Terminating pods as Failed by Kubelet](#marking-scheduled-pending-and-terminating-pods-as-failed-by-kubelet)
- [Risks and Mitigations](#risks-and-mitigations)
- [Garbage collected pods](#garbage-collected-pods)
- [Evolving condition types](#evolving-condition-types)
Expand Down Expand Up @@ -230,7 +231,7 @@ thousands of nodes requires usage of pod restart policies in order
to account for infrastructure failures.

Currently, kubernetes Job API offers a way to account for infrastructure
failures by setting `.backoffLimit > 0`. However, this mechanism intructs the
failures by setting `.backoffLimit > 0`. However, this mechanism instructs the
job controller to restart all failed pods - regardless of the root cause
of the failures. Thus, in some scenarios this leads to unnecessary
restarts of many pods, resulting in a waste of time and computational
Expand Down Expand Up @@ -354,7 +355,7 @@ As a machine learning researcher, I run jobs comprising thousands
of long-running pods on a cluster comprising thousands of nodes. The jobs often
run at night or over weekend without any human monitoring. In order to account
for random infrastructure failures we define `.backoffLimit: 6` for the job.
However, a signifficant portion of the failures happen due to bugs in code.
However, a significant portion of the failures happen due to bugs in code.
Moreover, the failures may happen late during the program execution time. In
such case, restarting such a pod results in wasting a lot of computational time.

Expand Down Expand Up @@ -729,7 +730,7 @@ in different messages for pods.
- Reproduction: Run kube-controller-manager with disabled taint-manager (with the
flag `--enable-taint-manager=false`). Then, run a job with a long-running pod and
disconnect the node
- Comments: handled by node lifcycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
- Comments: handled by node lifecycle controller in: `controller/nodelifecycle/node_lifecycle_controller.go`.
However, the pod phase remains `Running`.
- Pod status:
- status: Unknown
Expand Down Expand Up @@ -762,7 +763,7 @@ In Alpha, there is no support for Pod conditions for failures or disruptions ini

For Beta we introduce handling of Pod failures initiated by Kubelet by adding
the pod disruption condition (introduced in Alpha) in case of disruptions
initiated by kubetlet (see [Design details](#design-details)).
initiated by Kubelet (see [Design details](#design-details)).

Kubelet can also evict a pod in some scenarios which are not covered with
adding a pod failure condition:
Expand Down Expand Up @@ -818,9 +819,12 @@ for event handling and
for the constant definition)
and CRI-O (see
[here](https://github.com/cri-o/cri-o/blob/edf889bd277ae9a8aa699c354f12baaef3d9b71d/server/container_status.go#L88-L89)).
- setting the `reason` field to `OOMKilled` is not standardized, either. We have
started an effort to standardize the handling of OOM killed containers
(see: [Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)).
- setting the `reason` field to `OOMKilled` is not standardized, either. During
the Beta phase implementation we discussed the standardization issue within the
community (involving CNCF Technical Advisory Group for Runtime, SIG-node,
container runtime implementations). We have also started an effort to standardize
the handling of OOM killed containers (see:
[Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)).
However, in the process it turned out that in some configurations
(for example the CRI-O with cgroupv2, see:
[Add e2e_node test for oom killed container reason](https://github.com/kubernetes/kubernetes/pull/113205)),
Expand All @@ -830,8 +834,12 @@ but also when the system is running low on memory. In such scenario there
can be race conditions in which both the `DisruptionTarget` condition and the
`ResourceExhausted` could be added.

Thus, we decide not to annotate the scenarios with the `ResourceExhausted`
condition. While there are not known issues with detection of the exceeding of
Thus, we've decided not to annotate the scenarios with the `ResourceExhausted`
condition in this in this KEP. While the effort is welcomed by the community,
handling of exceeded limits might be done as a follow up KEP once the ground
work of introducing pod failure conditions is done in this KEP.

While there are not known issues with detection of the exceeding of
Pod's ephemeral storage limits, we prefer to avoid future extension of the
semantics of the new condition type. Alternatively, we could introduce a pair of
dedicated pod condition types: `OOMKilled` and `EphemeralStorageLimitExceeded`.
Expand Down Expand Up @@ -863,13 +871,13 @@ dies) between appending a pod condition and deleting the pod.
In particular, scheduler can possibly decide to preempt
a different pod the next time (or none). This would leave a pod with a
condition that it was preempted, when it actually wasn't. This in turn
could lead to inproper handling of the pod by the job controller.
could lead to improper handling of the pod by the job controller.

As a solution we implement a worker, part of the disruption
controller, which clears the pod condition added if `DeletionTimestamp` is
not added to the pod for a long enough time (for example 2 minutes).

#### Marking pods as Failed
#### Marking pods as Failed by PodGC

As indicated by our experiments (see: [Disconnected node](#disconnected-node))
a failed pod may get stuck in the `Running` phase when there is no kubelet
Expand All @@ -884,19 +892,48 @@ However, as disabling taint-manager is deprecated it is not a concern for this
KEP.

For Alpha, we implemented a fix for this issue by setting the pod phase as `Failed`
in podgc.

For Beta, we are going to simplify the code in job controller responsible for
detecting pod failures. For this purpose we need to make sure that pods stuck
in the pending phase which are terminating (with set deletionTimestamp) are
eventually marked as failed. Such pods may either be not scheduled
(for example when they are unschedulable due to node affinity configuration for
which no node exists in the cluster) or scheduled (for example when they fail
to pull the the docker image). In case of non-scheduled pods they will be moved
to the `Failed` phase by the PodGC controller as it marks as failed all
non-scheduled terminating pods (after the Alpha changes for this feature, see
above). We plan to extend the PodGC controller to also move to the failed phase
(and delete) scheduled pods in the pending state which are terminating.
in PodGC. Similarly, we for Alpha we modified PodGC to mark as `Failed`
unscheduled Pending pods which as Terminating.

#### Marking Scheduled Pending and Terminating pods as Failed by Kubelet

A scheduled pod may also get stuck in a Pending phase (for example due to an
invalid image reference). In that case, as long as the pod is not Terminating it
could be a transient issue and Kubelet can may progress after retrying. However,
once such a Pod (in a Pending phase) is also Terminating (due to an external
controller or user's action) Kubelet is not going to attempt starting of the pod.

This could be problematic in case of Job pods which have finalizers, because
such a pod would get stuck in the `Pending` phase if the Job controller only
removed the finalizer once the pod is in a terminal phase. Thus, Job controller
currently (prior to this KEP) considers such pods as failed, counts it
accordingly, and removes its finalizer, allowing the API server to complete the
pod's deletion.

However, It is important for the podFailurePolicy to match against pods in
terminal phase to make sure there are no race conditions with pod failure
conditions or exit codes being added while the pod is being matched against the
policy. Thus in the current implementation of Beta we only require pods to be in
terminal phase when `podFailurePolicy` is specified (and only in that case).
This allows to opt-out for the behavior without disabling the feature gate flags
in case the handling of Pending and Terminating pods is problematic for a Beta user
(see Issue: [Job controller should wait for Pods to terminate to match the failure policy](https://github.com/kubernetes/kubernetes/issues/113855)).

For the second iteration of Beta (1.27), in order to solve this issue, we plan
to modify Kubelet to transition such pods (scheduled, Pending and Terminating)
into the `Failed` phase, allowing the Job controller to count them, match
against the (optional) pod failure policy and remove their Job finalizers,
allowing API server to complete deletion. A prototype implementation is prepared
here: [WIP: Mark Pending Terminating pods as Failed by Kubelet](https://github.com/kubernetes/kubernetes/pull/115331).
The PR scopes the logic to only pods with finalizers to save an unnecessary API
call by Kubelet for the transition in case there is no finalizer, as pods
without finalizers are about to be deleted by API server anyway.

One release after the PodDisruptionCondition feature gate graduates to GA
we plan to simplify Job controller to consider as failed (and count as such)
pods which are in terminal phase regardless of the fact if the `podFailurePolicy`
is specified, fixing the Issue:
[Job controller should wait for Pods to terminate to match the failure policy](https://github.com/kubernetes/kubernetes/issues/113855).

### Risks and Mitigations

Expand Down Expand Up @@ -1218,7 +1255,7 @@ the pod failure does not match any of the specified rules, then default
handling of failed pods applies.

If we limit this feature to use `onExitCodes` only when `restartPolicy=Never`
(see: [limitting this feature](#limitting-this-feature)), then the rules using
(see: [limiting this feature](#limitting-this-feature)), then the rules using
`onExitCodes` are evaluated only against the exit codes in the `state` field
(under `terminated.exitCode`) of `pod.status.containerStatuses` and
`pod.status.initContainerStatuses`. We may also need to check for the exit codes
Expand Down Expand Up @@ -1279,9 +1316,9 @@ the following scenarios will be covered with unit tests:
- handling of a pod failure, in accordance with the specified `spec.podFailurePolicy`,
when the failure is associated with
- a failed container with non-zero exit code,
- a dedicated Pod condition indicating termmination originated by a kubernetes component
- a dedicated Pod condition indicating termination originated by a kubernetes component
- adding of the `DisruptionTarget` by Kubelet in case of:
- eviciton due to graceful node shutdown
- eviction due to graceful node shutdown
- eviction due to node pressure
<!--
Additionally, for Alpha try to enumerate the core package you will be touching
Expand Down Expand Up @@ -1313,7 +1350,7 @@ The following scenarios will be covered with integration tests:
- pod failure is caused by a failed container with a non-zero exit code

More integration tests might be added to ensure good code coverage based on the
actual implemention.
actual implementation.

<!--
This question should be filled when targeting a release.
Expand Down Expand Up @@ -1402,19 +1439,11 @@ Below are some examples to consider, in addition to the aforementioned [maturity
#### GA

- Address reviews and bug reports from Beta users
- The feature is unconditionally enabled
- Simplify the code in job controller responsible for detection of failed pods
based on the fix for pods stuck in the running phase (see: [Marking pods as Failed](marking-pods-as-failed)).
Also, extend PodGC to mark as failed (and delete) terminating pods
(with set deletionTimestamp) which stuck in the pending.
- Discuss within the community (involving CNCF Technical Advisory Group for
Runtime, SIG-node, container runtime implementations) the standarization of
the CRI API to communicate an OOM kill occurrence by contatiner runtime to
Kubelet. In particular, suggest that the API should allow to convey the reason
for OOM killer being invoked (to distinguish if the container was killed due
to exceeding its limits or due to system running low on memory).
- Review of the implementation of hanling OOM kill events by Kubelet depending
on the outcome of the discussions and the standardized API.
- Extend Kubelet to mark as failed pending terminating pods (see: [Marking Scheduled Pending and Terminating pods as Failed by Kubelet](#marking-scheduled-pending-and-terminating-pods-as-failed-by-kubelet)).
- Write a blog post about the feature
- Graduate e2e tests as conformance tests
- Lock the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates
- Declare deprecation of the `PodDisruptionConditions` and `JobPodFailurePolicy` feature-gates in documentation

<!--
**Note:** Generally we also wait at least two releases between beta and
Expand All @@ -1430,7 +1459,14 @@ in back-to-back releases.

#### Deprecation

N/A
In GA+1 release:
- Modify the code to ignore the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates
- Simplify the Job controller to wait for pods to terminate before counting them
as failed or matching against the pod failure policy, regardless if the pod
failure policy is specified ([https://github.com/kubernetes/kubernetes/issues/113855](#113855)).

In GA+2 release:
- Remove the `PodDisruptionConditions` and `JobPodFailurePolicy` feature gates

### Upgrade / Downgrade Strategy

Expand All @@ -1439,7 +1475,7 @@ N/A
An upgrade to a version which supports this feature should not require any
additional configuration changes. In order to use this feature after an upgrade
users will need to configure their Jobs by specifying `spec.podFailurePolicy`. The
only noticeable difference in behaviour, without specifying `spec.podFailurePolicy`,
only noticeable difference in behavior, without specifying `spec.podFailurePolicy`,
is that Pods terminated by kubernetes components will have an additional
condition appended to `status.conditions`.

Expand Down Expand Up @@ -1654,7 +1690,7 @@ Manual test performed to simulate the upgrade->downgrade->upgrade scenario:
- Scenario 2:
- Create a job with a long running containers and `backoffLimit=0`.
- Verify that the job continues after the node in uncordoned
1. Disable the feature gates. Verify that the above scenarios result in default behaviour:
1. Disable the feature gates. Verify that the above scenarios result in default behavior:
- In scenario 1: the job restarts pods failed with exit code `42`
- In scenario 2: the job is failed due to exceeding the `backoffLimit` as the failed pod failed during the draining
1. Re-enable the feature gates
Expand Down Expand Up @@ -1953,7 +1989,7 @@ technics apply):
is an increase of the Job controller processing time.
- Inspect the Job controller's `job_pods_finished_total` metric for the
to check if the numbers of pod failures handled by specific actions (counted
by the `failure_policy_action` label) agree with the expetations.
by the `failure_policy_action` label) agree with the expectations.
For example, if a user configures job failure policy with `Ignore` action for
the `DisruptionTarget` condition, then a node drain is expected to increase
the metric for `failure_policy_action=Ignore`.
Expand All @@ -1963,7 +1999,7 @@ technics apply):

- 2022-06-23: Initial KEP merged
- 2022-07-12: Preparatory PR "Refactor gc_controller to do not use the deletePod stub" merged
- 2022-07-14: Preparatory PR "efactor taint_manager to do not use getPod and getNode stubs" merged
- 2022-07-14: Preparatory PR "Refactor taint_manager to do not use getPod and getNode stubs" merged
- 2022-07-20: Preparatory PR "Add integration test for podgc" merged
- 2022-07-28: KEP updates merged
- 2022-08-01: Additional KEP updates merged
Expand All @@ -1972,7 +2008,7 @@ technics apply):
- 2022-08-04: PR "Support handling of pod failures with respect to the configured rules" merged
- 2022-09-09: Bugfix PR for test "Fix the TestRoundTripTypes by adding default to the fuzzer" merged
- 2022-09-26: Prepared PR for KEP Beta update. Summary of the changes:
- propsal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
- proposal to extend kubelet to add the following pod conditions when evicting a pod (see [Design details](#design-details)):
- DisruptionTarget for evictions due graceful node shutdown, admission errors, node pressure or Pod admission errors
- ResourceExhausted for evictions due to OOM killer and exceeding Pod's ephemeral-storage limits
- extended the review of pod eviction scenarios by kubelet-initiated pod evictions:
Expand All @@ -1990,6 +2026,15 @@ technics apply):
- 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304))
- 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324))
- 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580))
- 2022-11-08: KEP update for Beta ([link](https://github.com/kubernetes/enhancements/pull/3646)) with main changes:
- do not introduce the `ResourceExhausted` condition (it was planned to be used for pods killed due to OOM killer or exceeding ephemeral storage limits)
- do not add `DisruptionTarget` condition in case of admission failures
- 2022-11-11: PR "Fix match onExitCodes when Pod is not terminated" ([link](https://github.com/kubernetes/kubernetes/pull/113856))
- 2022-11-11: PR "Wait for Pods to finish before considering Failed in Job" ([link](https://github.com/kubernetes/kubernetes/pull/113860))
- 2022-11-15: PR "Add e2e test to ignore failures with 137 exit code" ([link](https://github.com/kubernetes/kubernetes/pull/113927))
- 2023-01-03: PR "Fix clearing of rate-limiter for the queue of checks for cleaning stale pod disruption conditions" ([link](https://github.com/kubernetes/kubernetes/pull/114770))
- 2023-01-09: PR "Adjust DisruptionTarget condition message to do not include preemptor pod metadata" ([link](https://github.com/kubernetes/kubernetes/pull/114914))
- 2023-01-13: PR "PodGC should not add DisruptionTarget condition for pods which are in terminal phase" ([link](https://github.com/kubernetes/kubernetes/pull/115056))

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ stage: beta
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.26"
latest-milestone: "v1.27"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.25"
beta: "v1.26"
stable: "v1.27"
stable: "v1.28"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Expand Down

0 comments on commit c3df52a

Please sign in to comment.