WIP

mimowo · mimowo · commit 5230713c86a3 · 2023-04-26T12:56:45.000+02:00
diff --git a/keps/NNNN-kep-template/README.md b/keps/NNNN-kep-template/README.md
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 - [ ] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
   - [ ] e2e Tests for all Beta API Operations (endpoints)
-  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
   - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
 - [ ] (R) Graduation criteria is in place
-  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
 - [ ] (R) Production readiness review completed
 - [ ] (R) Production readiness review approved
 - [ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
 -->
 
 - [ ] Events
-  - Event Reason: 
+  - Event Reason:
 - [ ] API .status
-  - Condition name: 
-  - Other field: 
+  - Condition name:
+  - Other field:
 - [ ] Other (treat as last resort)
   - Details:
 
diff --git a/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md b/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
@@ -43,10 +43,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 - [ ] (R) Design details are appropriately documented
 - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
   - [ ] e2e Tests for all Beta API Operations (endpoints)
-  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
   - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
 - [ ] (R) Graduation criteria is in place
-  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) 
+  - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
 - [ ] (R) Production readiness review completed
 - [ ] (R) Production readiness review approved
 - [ ] "Implementation History" section is up-to-date for milestone
@@ -66,13 +66,13 @@ and a failed index does not cause the other indices to automatically clean up.
 
 ## Motivation
 
-Currently, the indices of an indexed job share a single backoff limit. 
+Currently, the indices of an indexed job share a single backoff limit.
 When the job reaches this shared backoff limit, the job controller marks the entire
 job as failed, and the resources are cleaned up, including indices that have yet
-to run to completion. 
+to run to completion.
 
 As a result, the current implementation does not cover the situation where the workload
-is truly embarrassingly parallel and each index is completely independent of other indices. 
+is truly embarrassingly parallel and each index is completely independent of other indices.
 
 For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
 then each test run would only be able to find a single test failure.
@@ -82,11 +82,16 @@ showing that this is a common use case that should be supported by Kubernetes.
 
 ### Goals
 
-Support the use case where each indexed job has its own backoff limit, and all 
-indices of an indexed job can complete even when a single index fails.
+- allow to count failures towards the backoffLimit independently for all indices,
+- allow to fail an index (stop recreating pods for the index) using pod failure policy.
 
 ### Non-Goals
 
+- allow to specify the number of indices to mark the entire job as failed or completed.
+This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
+- allow to control the number of retries per index when restartPolicy=OnFailure.
+This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
+
 <!--
 What is out of scope for this KEP? Listing non-goals helps to focus discussion
 and make progress.
@@ -101,11 +106,76 @@ Index would represent this new set of use cases, where the backoff limit is appl
 index individually.
 
 We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
-analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
+analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
 while the rest continue until completion.
 
 ### User Stories (Optional)
 
+#### Story 1
+
+As a CI/CD platform administrator, I want to use Indexed Jobs to run
+suites of integration tests, one suite per index. A failure of one suite should
+not interrupt running of other suites. Additionally, I would like to be able
+to control the maximal number of retries per index.
+
+The following Job configuration could satisfy my use case:
+
+```yaml
+apiVersion: v1
+kind: Job
+spec:
+  parallelism: 10
+  completions: 10
+  completionMode: Indexed
+  backoffLimit: 1
+  backoffLimitPerIndex: true
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: job-container
+        image: job-image
+        command: ["./tests-runner"]
+```
+
+In this case, we run 10 indexes, representing running of the test suites.
+Due to possible flakes we allow for 1 failure per index.
+
+#### Story 2
+
+As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
+to control the failures with the pod failure policy. First, I want to be able
+to do not count disruptions towards the backoff limit per index. Second, I want
+to be able to use pod failure policy to avoid restarts of some indexes.
+
+The following Job configuration could satisfy my use case:
+
+```yaml
+apiVersion: v1
+kind: Job
+spec:
+  parallelism: 10
+  completions: 10
+  completionMode: Indexed
+  backoffLimit: 1
+  backoffLimitPerIndex: true
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+      - name: job-container
+        image: job-image
+        command: ["./tests-runner"]
+  podFailurePolicy:
+    rules:
+    - action: Ignore
+      onPodConditions:
+        type: DisruptionTarget
+    - action: FailIndex
+      onExitCodes:
+        operator: In
+        values: [42]
+```
 
 ### Notes/Constraints/Caveats (Optional)
 
@@ -468,10 +538,10 @@ Recall that end users cannot usually observe component logs or access metrics.
 -->
 
 - [ ] Events
-  - Event Reason: 
+  - Event Reason:
 - [ ] API .status
-  - Condition name: 
-  - Other field: 
+  - Condition name:
+  - Other field:
 - [ ] Other (treat as last resort)
   - Details:
 
diff --git a/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml b/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml
@@ -1,36 +1,43 @@
 title: Backoff Limits Per Index For Indexed Jobs
 kep-number: 3850
 authors:
+  - "@mimowo"
   - "@jensentanlo"
 owning-sig: sig-apps
 participating-sigs:
 status: provisional
-creation-date: 2023-01-23
+creation-date: 2023-04-26
 reviewers:
-  - TBD
+  - "@liggitt"
+  - "@alculquicondor"
 approvers:
-  - TBD
-see-also:
-  - "/keps/sig-apps/2214-indexed-job"
-replaces:
+  - "@soltysh"
 
 # The target maturity stage in the current dev cycle for this KEP.
-stage:
+stage: alpha
 
 # The most recent milestone for which work toward delivery of this KEP has been
 # done. This can be the current (upcoming) milestone, if it is being actively
 # worked on.
-latest-milestone:
+latest-milestone: "v1.28"
 
 # The milestone at which this feature was, or is targeted to be, at each stage.
 milestone:
-  alpha:
-  beta:
+  alpha: "v1.28"
+  beta: "v1.29"
   stable:
 
 # The following PRR answers are required at alpha release
 # List the feature gate name and the components for which it must be enabled
 feature-gates:
+  - name: JobBackoffLimitPerIndex
+    components:
+      - kube-apiserver
+      - kube-controller-manager
+  - name: JobPodFailurePolicy
+    components:
+      - kube-apiserver
+      - kube-controller-manager
 disable-supported: true
 
 # The following PRR answers are required at beta release