Skip to content

Commit 5230713

Browse files
committed
WIP
1 parent 217a435 commit 5230713

File tree

3 files changed

+103
-26
lines changed

3 files changed

+103
-26
lines changed

keps/NNNN-kep-template/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
133133
- [ ] (R) Design details are appropriately documented
134134
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
135135
- [ ] e2e Tests for all Beta API Operations (endpoints)
136-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
137137
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
138138
- [ ] (R) Graduation criteria is in place
139-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
139+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
140140
- [ ] (R) Production readiness review completed
141141
- [ ] (R) Production readiness review approved
142142
- [ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
577577
-->
578578

579579
- [ ] Events
580-
- Event Reason:
580+
- Event Reason:
581581
- [ ] API .status
582-
- Condition name:
583-
- Other field:
582+
- Condition name:
583+
- Other field:
584584
- [ ] Other (treat as last resort)
585585
- Details:
586586

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

Lines changed: 81 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -43,10 +43,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4343
- [ ] (R) Design details are appropriately documented
4444
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
4545
- [ ] e2e Tests for all Beta API Operations (endpoints)
46-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
46+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4747
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4848
- [ ] (R) Graduation criteria is in place
49-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5050
- [ ] (R) Production readiness review completed
5151
- [ ] (R) Production readiness review approved
5252
- [ ] "Implementation History" section is up-to-date for milestone
@@ -66,13 +66,13 @@ and a failed index does not cause the other indices to automatically clean up.
6666

6767
## Motivation
6868

69-
Currently, the indices of an indexed job share a single backoff limit.
69+
Currently, the indices of an indexed job share a single backoff limit.
7070
When the job reaches this shared backoff limit, the job controller marks the entire
7171
job as failed, and the resources are cleaned up, including indices that have yet
72-
to run to completion.
72+
to run to completion.
7373

7474
As a result, the current implementation does not cover the situation where the workload
75-
is truly embarrassingly parallel and each index is completely independent of other indices.
75+
is truly embarrassingly parallel and each index is completely independent of other indices.
7676

7777
For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
7878
then each test run would only be able to find a single test failure.
@@ -82,11 +82,16 @@ showing that this is a common use case that should be supported by Kubernetes.
8282

8383
### Goals
8484

85-
Support the use case where each indexed job has its own backoff limit, and all
86-
indices of an indexed job can complete even when a single index fails.
85+
- allow to count failures towards the backoffLimit independently for all indices,
86+
- allow to fail an index (stop recreating pods for the index) using pod failure policy.
8787

8888
### Non-Goals
8989

90+
- allow to specify the number of indices to mark the entire job as failed or completed.
91+
This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
92+
- allow to control the number of retries per index when restartPolicy=OnFailure.
93+
This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
94+
9095
<!--
9196
What is out of scope for this KEP? Listing non-goals helps to focus discussion
9297
and make progress.
@@ -101,11 +106,76 @@ Index would represent this new set of use cases, where the backoff limit is appl
101106
index individually.
102107

103108
We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
104-
analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
109+
analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
105110
while the rest continue until completion.
106111

107112
### User Stories (Optional)
108113

114+
#### Story 1
115+
116+
As a CI/CD platform administrator, I want to use Indexed Jobs to run
117+
suites of integration tests, one suite per index. A failure of one suite should
118+
not interrupt running of other suites. Additionally, I would like to be able
119+
to control the maximal number of retries per index.
120+
121+
The following Job configuration could satisfy my use case:
122+
123+
```yaml
124+
apiVersion: v1
125+
kind: Job
126+
spec:
127+
parallelism: 10
128+
completions: 10
129+
completionMode: Indexed
130+
backoffLimit: 1
131+
backoffLimitPerIndex: true
132+
template:
133+
spec:
134+
restartPolicy: Never
135+
containers:
136+
- name: job-container
137+
image: job-image
138+
command: ["./tests-runner"]
139+
```
140+
141+
In this case, we run 10 indexes, representing running of the test suites.
142+
Due to possible flakes we allow for 1 failure per index.
143+
144+
#### Story 2
145+
146+
As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
147+
to control the failures with the pod failure policy. First, I want to be able
148+
to do not count disruptions towards the backoff limit per index. Second, I want
149+
to be able to use pod failure policy to avoid restarts of some indexes.
150+
151+
The following Job configuration could satisfy my use case:
152+
153+
```yaml
154+
apiVersion: v1
155+
kind: Job
156+
spec:
157+
parallelism: 10
158+
completions: 10
159+
completionMode: Indexed
160+
backoffLimit: 1
161+
backoffLimitPerIndex: true
162+
template:
163+
spec:
164+
restartPolicy: Never
165+
containers:
166+
- name: job-container
167+
image: job-image
168+
command: ["./tests-runner"]
169+
podFailurePolicy:
170+
rules:
171+
- action: Ignore
172+
onPodConditions:
173+
type: DisruptionTarget
174+
- action: FailIndex
175+
onExitCodes:
176+
operator: In
177+
values: [42]
178+
```
109179
110180
### Notes/Constraints/Caveats (Optional)
111181
@@ -468,10 +538,10 @@ Recall that end users cannot usually observe component logs or access metrics.
468538
-->
469539

470540
- [ ] Events
471-
- Event Reason:
541+
- Event Reason:
472542
- [ ] API .status
473-
- Condition name:
474-
- Other field:
543+
- Condition name:
544+
- Other field:
475545
- [ ] Other (treat as last resort)
476546
- Details:
477547

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,43 @@
11
title: Backoff Limits Per Index For Indexed Jobs
22
kep-number: 3850
33
authors:
4+
- "@mimowo"
45
- "@jensentanlo"
56
owning-sig: sig-apps
67
participating-sigs:
78
status: provisional
8-
creation-date: 2023-01-23
9+
creation-date: 2023-04-26
910
reviewers:
10-
- TBD
11+
- "@liggitt"
12+
- "@alculquicondor"
1113
approvers:
12-
- TBD
13-
see-also:
14-
- "/keps/sig-apps/2214-indexed-job"
15-
replaces:
14+
- "@soltysh"
1615

1716
# The target maturity stage in the current dev cycle for this KEP.
18-
stage:
17+
stage: alpha
1918

2019
# The most recent milestone for which work toward delivery of this KEP has been
2120
# done. This can be the current (upcoming) milestone, if it is being actively
2221
# worked on.
23-
latest-milestone:
22+
latest-milestone: "v1.28"
2423

2524
# The milestone at which this feature was, or is targeted to be, at each stage.
2625
milestone:
27-
alpha:
28-
beta:
26+
alpha: "v1.28"
27+
beta: "v1.29"
2928
stable:
3029

3130
# The following PRR answers are required at alpha release
3231
# List the feature gate name and the components for which it must be enabled
3332
feature-gates:
33+
- name: JobBackoffLimitPerIndex
34+
components:
35+
- kube-apiserver
36+
- kube-controller-manager
37+
- name: JobPodFailurePolicy
38+
components:
39+
- kube-apiserver
40+
- kube-controller-manager
3441
disable-supported: true
3542

3643
# The following PRR answers are required at beta release

0 commit comments

Comments
 (0)