WIP

mimowo · mimowo · commit 7349dc48b7db · 2023-04-26T16:28:12.000+02:00
diff --git a/keps/NNNN-kep-template/README.md b/keps/NNNN-kep-template/README.md
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 - [ ] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
   - [ ] e2e Tests for all Beta API Operations (endpoints)
-  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
   - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
 - [ ] (R) Graduation criteria is in place
-  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
 - [ ] (R) Production readiness review completed
 - [ ] (R) Production readiness review approved
 - [ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
 -->
 
 - [ ] Events
-  - Event Reason: 
+  - Event Reason:
 - [ ] API .status
-  - Condition name: 
-  - Other field: 
+  - Condition name:
+  - Other field:
 - [ ] Other (treat as last resort)
   - Details:
 
diff --git a/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md b/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
@@ -13,12 +13,18 @@
   - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
   - [Risks and Mitigations](#risks-and-mitigations)
 - [Design Details](#design-details)
+  - [Job API](#job-api)
+  - [Tracking the number of failures](#tracking-the-number-of-failures)
+  - [FailIndex action](#failindex-action)
   - [Test Plan](#test-plan)
       - [Prerequisite testing updates](#prerequisite-testing-updates)
       - [Unit tests](#unit-tests)
       - [Integration tests](#integration-tests)
       - [e2e tests](#e2e-tests)
   - [Graduation Criteria](#graduation-criteria)
+    - [Alpha](#alpha)
+    - [Beta](#beta)
+    - [GA](#ga)
   - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
   - [Version Skew Strategy](#version-skew-strategy)
 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -43,10 +49,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 - [ ] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
   - [ ] e2e Tests for all Beta API Operations (endpoints)
-  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
   - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
 - [ ] (R) Graduation criteria is in place
-  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
 - [ ] (R) Production readiness review completed
 - [ ] (R) Production readiness review approved
 - [ ] "Implementation History" section is up-to-date for milestone
@@ -61,18 +67,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 ## Summary
 
 This KEP extends the indexed job API to support indexed jobs where each index is independent,
-and a failed index does not cause the other indices to automatically clean up.
+and a failed index does not cause the other indexes to automatically clean up.
 
 
 ## Motivation
 
-Currently, the indices of an indexed job share a single backoff limit. 
+Currently, the indexes of an indexed job share a single backoff limit.
 When the job reaches this shared backoff limit, the job controller marks the entire
-job as failed, and the resources are cleaned up, including indices that have yet
-to run to completion. 
+job as failed, and the resources are cleaned up, including indexes that have yet
+to run to completion.
 
 As a result, the current implementation does not cover the situation where the workload
-is truly embarrassingly parallel and each index is completely independent of other indices. 
+is truly embarrassingly parallel and each index is completely independent of other indexes.
 
 For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
 then each test run would only be able to find a single test failure.
@@ -82,11 +88,16 @@ showing that this is a common use case that should be supported by Kubernetes.
 
 ### Goals
 
-Support the use case where each indexed job has its own backoff limit, and all 
-indices of an indexed job can complete even when a single index fails.
+- allow to count failures towards the backoffLimit independently for all indexes,
+- allow to fail an index (stop recreating pods for the index) using pod failure policy.
 
 ### Non-Goals
 
+- allow to specify the number of indexes to mark the entire job as failed or completed.
+This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
+- allow to control the number of retries per index when pod's `restartPolicy=OnFailure`.
+This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
+
 <!--
 What is out of scope for this KEP? Listing non-goals helps to focus discussion
 and make progress.
@@ -96,16 +107,78 @@ and make progress.
 
 We propose the addition of a new enum field in PodFailurePolicy called backoffLimitTarget,
 that accepts the values Job and Index. Job (the default value) would have the same behavior
-as the current implementation of the backoff limit where the limit shared between all indices.
+as the current implementation of the backoff limit where the limit shared between all indexes.
 Index would represent this new set of use cases, where the backoff limit is applied to each
 index individually.
 
 We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
-analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
+analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
 while the rest continue until completion.
 
 ### User Stories (Optional)
 
+#### Story 1
+
+As a CI/CD platform administrator, I want to use Indexed Jobs to run
+suites of integration tests, one suite per index. A failure of one suite should
+not interrupt running of other suites. Additionally, I would like to be able
+to control the maximal number of retries per index.
+
+The following Job configuration could satisfy my use case:
+
+```yaml
+apiVersion: v1
+kind: Job
+spec:
+  parallelism: 10
+  completions: 10
+  completionMode: Indexed
+  backoffLimit: 1
+  backoffLimitPerIndex: true
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: job-container
+        image: job-image
+        command: ["./tests-runner"]
+```
+
+In this case, we run 10 indexes, representing running of the test suites.
+Due to possible flakes we allow for 1 failure per index.
+
+#### Story 2
+
+As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
+to control the failures with the pod failure policy. In particular, I want
+to be able to use pod failure policy to avoid restarts of some indexes, based
+on exit codes.
+
+The following Job configuration could satisfy my use case:
+
+```yaml
+apiVersion: v1
+kind: Job
+spec:
+  parallelism: 10
+  completions: 10
+  completionMode: Indexed
+  backoffLimit: 1
+  backoffLimitPerIndex: true
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: job-container
+        image: job-image
+        command: ["./tests-runner"]
+  podFailurePolicy:
+    rules:
+    - action: FailIndex
+      onExitCodes:
+        operator: In
+        values: [42]
+```
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -132,19 +205,73 @@ Consider including folks who also work outside the SIG or subproject.
 
 ## Design Details
 
-A possible PodFailurePolicy spec might look something like this with the new additions
-
-```
-podFailurePolicy:
-  rules:
-  - action: FailJob|FailIndex
-    onExitCodes:
-      containerName: main
-      operator: In
-      values: [42]
-  backoffLimitTarget: Job|Index
+We introduce a new Job API field, called `.spec.backoffLimitPerIndex`, when set
+to `true`, then failures are counted towards the `.spec.backoffLimit`, but
+incremented independently for all indexes. This mode is only supported when
+pod's `restartPolicy=Never`.
+
+### Job API
+
+We extend the Job API in order to allow to apply different actions depending
+on the conditions associated with the pod failure.
+
+```golang
+
+// PodFailurePolicyAction specifies how a Pod failure is handled.
+// +enum
+type PodFailurePolicyAction string
+
+const (
+	// This is an action which might be taken on a pod failure - mark the
+	// Job's index as failed to avoid pod restarts within this index.
+	PodFailurePolicyActionFailIndex PodFailurePolicyAction = "FailIndex"
+  ...
+)
+...
+
+// JobSpec describes how the job execution will look like.
+type JobSpec struct {
+  ...
+	// Indicates if the number of retries specified by backoffLimit is counted
+  // globally or within an index. When set to true, each pod  is
+  // kept per index in the batch.kubernetes.io/job-index-retry-number Pod
+  // annotation. It can only be set to true when Job's completionMode=Indexed.
+  // Defaults to false
+	// +optional
+	BackoffLimitPerIndex *bool
+	// Specifies the number of retries before marking this job failed. When
+  // BackoffLimitPerIndex=true, then it specifies the number of retries for
+  // for a given index.
+	// Defaults to 6
+	// +optional
+	BackoffLimit *int32
+  ...
 ```
 
+### Tracking the number of failures
+
+In order to determine if the `backoffLimit` is exceeded we need to keep track
+of the number of failures per index, when `restartPolicy=Never`. For this
+purpose we use the Pod annotation, `batch.kubernetes.io/job-index-retry-number`,
+which holds the value of the number of pod retries for a given index. It is set
+to `0` for the first pod created for a given index.
+
+When a pod with `k` number of retries fails, and the index isn't failed yet, ie.
+number of retries is still smaller than the backoff limit per index, then we
+need to create a new pod with `k+1` number of retries. For this purpose we need
+to delay deletion of the old pod, and thus its finalizer, until the new pod
+corresponding to the index is created.
+
+Once the number of retries for a given index reaches the `backoffLimit` we need
+to mark the index as failed, so that we can remove the pods. For this reason
+we keep the set of failed indexes in the Job `failedIndexes` field.
+
+### FailIndex action
+
+In order to allow early termination of indexes with the `FailIndex` action
+we also store the set of failed indexes in the `failedIndexes` variable
+in Job status. Analogous to the way as `completedIndexes` are kept.
+
 ### Test Plan
 
 <!--
@@ -282,6 +409,30 @@ in back-to-back releases.
 - Deprecate the flag
 -->
 
+#### Alpha
+
+- the feature implemented behind the `JobBackoffLimitPerIndex` feature flag
+- the support for the `FailIndex` action is implemented behind the
+`JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature
+- the `FailIndex` action cannot be used when creating a new Job
+- The `JobBackoffLimitPerIndex` feature flag disabled by default
+- Tests: unit and integration
+
+#### Beta
+
+- Address reviews and bug reports from Alpha users
+- E2e tests are in Testgrid and linked in KEP
+- The `FailIndex` action can be used for newly created Jobs
+- The feature flag enabled by default
+
+#### GA
+
+- Address reviews and bug reports from Beta users
+- Write a blog post about the feature
+- Graduate e2e tests as conformance tests
+- Lock the `JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature-gates
+- Declare deprecation of the `JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature-gates in documentation
+
 ### Upgrade / Downgrade Strategy
 
 <!--
@@ -468,10 +619,10 @@ Recall that end users cannot usually observe component logs or access metrics.
 -->
 
 - [ ] Events
-  - Event Reason: 
+  - Event Reason:
 - [ ] API .status
-  - Condition name: 
-  - Other field: 
+  - Condition name:
+  - Other field:
 - [ ] Other (treat as last resort)
   - Details:
 
diff --git a/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml b/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml
@@ -1,36 +1,43 @@
 title: Backoff Limits Per Index For Indexed Jobs
 kep-number: 3850
 authors:
+  - "@mimowo"
   - "@jensentanlo"
 owning-sig: sig-apps
 participating-sigs:
 status: provisional
-creation-date: 2023-01-23
+creation-date: 2023-04-26
 reviewers:
-  - TBD
+  - "@liggitt"
+  - "@alculquicondor"
 approvers:
-  - TBD
-see-also:
-  - "/keps/sig-apps/2214-indexed-job"
-replaces:
+  - "@soltysh"
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage:
+stage: alpha
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone:
+latest-milestone: "v1.28"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
-  alpha:
-  beta:
+  alpha: "v1.28"
+  beta: "v1.29"
   stable:
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
+  - name: JobBackoffLimitPerIndex
+    components:
+      - kube-apiserver
+      - kube-controller-manager
+  - name: JobPodFailurePolicy
+    components:
+      - kube-apiserver
+      - kube-controller-manager
 disable-supported: true
 
 # The following PRR answers are required at beta release