You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/NNNN-kep-template/README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
133
133
-[ ] (R) Design details are appropriately documented
134
134
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
135
135
-[ ] e2e Tests for all Beta API Operations (endpoints)
136
-
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
137
137
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
138
138
-[ ] (R) Graduation criteria is in place
139
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
139
+
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
140
140
-[ ] (R) Production readiness review completed
141
141
-[ ] (R) Production readiness review approved
142
142
-[ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
Copy file name to clipboardExpand all lines: keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
+81-11Lines changed: 81 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,10 +43,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
43
43
-[ ] (R) Design details are appropriately documented
44
44
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
45
45
-[ ] e2e Tests for all Beta API Operations (endpoints)
46
-
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
46
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
47
47
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
48
48
-[ ] (R) Graduation criteria is in place
49
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
49
+
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50
50
-[ ] (R) Production readiness review completed
51
51
-[ ] (R) Production readiness review approved
52
52
-[ ] "Implementation History" section is up-to-date for milestone
@@ -66,13 +66,13 @@ and a failed index does not cause the other indices to automatically clean up.
66
66
67
67
## Motivation
68
68
69
-
Currently, the indices of an indexed job share a single backoff limit.
69
+
Currently, the indices of an indexed job share a single backoff limit.
70
70
When the job reaches this shared backoff limit, the job controller marks the entire
71
71
job as failed, and the resources are cleaned up, including indices that have yet
72
-
to run to completion.
72
+
to run to completion.
73
73
74
74
As a result, the current implementation does not cover the situation where the workload
75
-
is truly embarrassingly parallel and each index is completely independent of other indices.
75
+
is truly embarrassingly parallel and each index is completely independent of other indices.
76
76
77
77
For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
78
78
then each test run would only be able to find a single test failure.
@@ -82,11 +82,16 @@ showing that this is a common use case that should be supported by Kubernetes.
82
82
83
83
### Goals
84
84
85
-
Support the use case where each indexed job has its own backoff limit, and all
86
-
indices of an indexed job can complete even when a single index fails.
85
+
- allow to count failures towards the backoffLimit independently for all indices,
86
+
- allow to fail an index (stop recreating pods for the index) using pod failure policy.
87
87
88
88
### Non-Goals
89
89
90
+
- allow to specify the number of indices to mark the entire job as failed or completed.
91
+
This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
92
+
- allow to control the number of retries per index when restartPolicy=OnFailure.
93
+
This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
94
+
90
95
<!--
91
96
What is out of scope for this KEP? Listing non-goals helps to focus discussion
92
97
and make progress.
@@ -101,11 +106,76 @@ Index would represent this new set of use cases, where the backoff limit is appl
101
106
index individually.
102
107
103
108
We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
104
-
analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
109
+
analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
105
110
while the rest continue until completion.
106
111
107
112
### User Stories (Optional)
108
113
114
+
#### Story 1
115
+
116
+
As a CI/CD platform administrator, I want to use Indexed Jobs to run
117
+
suites of integration tests, one suite per index. A failure of one suite should
118
+
not interrupt running of other suites. Additionally, I would like to be able
119
+
to control the maximal number of retries per index.
120
+
121
+
The following Job configuration could satisfy my use case:
122
+
123
+
```yaml
124
+
apiVersion: v1
125
+
kind: Job
126
+
spec:
127
+
parallelism: 10
128
+
completions: 10
129
+
completionMode: Indexed
130
+
backoffLimit: 1
131
+
backoffLimitPerIndex: true
132
+
template:
133
+
spec:
134
+
restartPolicy: Never
135
+
containers:
136
+
- name: job-container
137
+
image: job-image
138
+
command: ["./tests-runner"]
139
+
```
140
+
141
+
In this case, we run 10 indexes, representing running of the test suites.
142
+
Due to possible flakes we allow for 1 failure per index.
143
+
144
+
#### Story 2
145
+
146
+
As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
147
+
to control the failures with the pod failure policy. First, I want to be able
148
+
to do not count disruptions towards the backoff limit per index. Second, I want
149
+
to be able to use pod failure policy to avoid restarts of some indexes.
150
+
151
+
The following Job configuration could satisfy my use case:
152
+
153
+
```yaml
154
+
apiVersion: v1
155
+
kind: Job
156
+
spec:
157
+
parallelism: 10
158
+
completions: 10
159
+
completionMode: Indexed
160
+
backoffLimit: 1
161
+
backoffLimitPerIndex: true
162
+
template:
163
+
spec:
164
+
restartPolicy: Never
165
+
containers:
166
+
- name: job-container
167
+
image: job-image
168
+
command: ["./tests-runner"]
169
+
podFailurePolicy:
170
+
rules:
171
+
- action: Ignore
172
+
onPodConditions:
173
+
type: DisruptionTarget
174
+
- action: FailIndex
175
+
onExitCodes:
176
+
operator: In
177
+
values: [42]
178
+
```
109
179
110
180
### Notes/Constraints/Caveats (Optional)
111
181
@@ -468,10 +538,10 @@ Recall that end users cannot usually observe component logs or access metrics.
0 commit comments