You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/NNNN-kep-template/README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
133
133
-[ ] (R) Design details are appropriately documented
134
134
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
135
135
-[ ] e2e Tests for all Beta API Operations (endpoints)
136
-
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
137
137
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
138
138
-[ ] (R) Graduation criteria is in place
139
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
139
+
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
140
140
-[ ] (R) Production readiness review completed
141
141
-[ ] (R) Production readiness review approved
142
142
-[ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
@@ -43,10 +49,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
43
49
-[ ] (R) Design details are appropriately documented
44
50
-[ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
45
51
-[ ] e2e Tests for all Beta API Operations (endpoints)
46
-
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52
+
-[ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
47
53
-[ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
48
54
-[ ] (R) Graduation criteria is in place
49
-
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
55
+
-[ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
50
56
-[ ] (R) Production readiness review completed
51
57
-[ ] (R) Production readiness review approved
52
58
-[ ] "Implementation History" section is up-to-date for milestone
@@ -61,18 +67,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
61
67
## Summary
62
68
63
69
This KEP extends the indexed job API to support indexed jobs where each index is independent,
64
-
and a failed index does not cause the other indices to automatically clean up.
70
+
and a failed index does not cause the other indexes to automatically clean up.
65
71
66
72
67
73
## Motivation
68
74
69
-
Currently, the indices of an indexed job share a single backoff limit.
75
+
Currently, the indexes of an indexed job share a single backoff limit.
70
76
When the job reaches this shared backoff limit, the job controller marks the entire
71
-
job as failed, and the resources are cleaned up, including indices that have yet
72
-
to run to completion.
77
+
job as failed, and the resources are cleaned up, including indexes that have yet
78
+
to run to completion.
73
79
74
80
As a result, the current implementation does not cover the situation where the workload
75
-
is truly embarrassingly parallel and each index is completely independent of other indices.
81
+
is truly embarrassingly parallel and each index is completely independent of other indexes.
76
82
77
83
For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
78
84
then each test run would only be able to find a single test failure.
@@ -82,11 +88,16 @@ showing that this is a common use case that should be supported by Kubernetes.
82
88
83
89
### Goals
84
90
85
-
Support the use case where each indexed job has its own backoff limit, and all
86
-
indices of an indexed job can complete even when a single index fails.
91
+
- allow to count failures towards the backoffLimit independently for all indexes,
92
+
- allow to fail an index (stop recreating pods for the index) using pod failure policy.
87
93
88
94
### Non-Goals
89
95
96
+
- allow to specify the number of indexes to mark the entire job as failed or completed.
97
+
This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
98
+
- allow to control the number of retries per index when pod's `restartPolicy=OnFailure`.
99
+
This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
100
+
90
101
<!--
91
102
What is out of scope for this KEP? Listing non-goals helps to focus discussion
92
103
and make progress.
@@ -96,16 +107,78 @@ and make progress.
96
107
97
108
We propose the addition of a new enum field in PodFailurePolicy called backoffLimitTarget,
98
109
that accepts the values Job and Index. Job (the default value) would have the same behavior
99
-
as the current implementation of the backoff limit where the limit shared between all indices.
110
+
as the current implementation of the backoff limit where the limit shared between all indexes.
100
111
Index would represent this new set of use cases, where the backoff limit is applied to each
101
112
index individually.
102
113
103
114
We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
104
-
analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
115
+
analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
105
116
while the rest continue until completion.
106
117
107
118
### User Stories (Optional)
108
119
120
+
#### Story 1
121
+
122
+
As a CI/CD platform administrator, I want to use Indexed Jobs to run
123
+
suites of integration tests, one suite per index. A failure of one suite should
124
+
not interrupt running of other suites. Additionally, I would like to be able
125
+
to control the maximal number of retries per index.
126
+
127
+
The following Job configuration could satisfy my use case:
128
+
129
+
```yaml
130
+
apiVersion: v1
131
+
kind: Job
132
+
spec:
133
+
parallelism: 10
134
+
completions: 10
135
+
completionMode: Indexed
136
+
backoffLimit: 1
137
+
backoffLimitPerIndex: true
138
+
template:
139
+
spec:
140
+
restartPolicy: Never
141
+
containers:
142
+
- name: job-container
143
+
image: job-image
144
+
command: ["./tests-runner"]
145
+
```
146
+
147
+
In this case, we run 10 indexes, representing running of the test suites.
148
+
Due to possible flakes we allow for 1 failure per index.
149
+
150
+
#### Story 2
151
+
152
+
As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
153
+
to control the failures with the pod failure policy. In particular, I want
154
+
to be able to use pod failure policy to avoid restarts of some indexes, based
155
+
on exit codes.
156
+
157
+
The following Job configuration could satisfy my use case:
158
+
159
+
```yaml
160
+
apiVersion: v1
161
+
kind: Job
162
+
spec:
163
+
parallelism: 10
164
+
completions: 10
165
+
completionMode: Indexed
166
+
backoffLimit: 1
167
+
backoffLimitPerIndex: true
168
+
template:
169
+
spec:
170
+
restartPolicy: Never
171
+
containers:
172
+
- name: job-container
173
+
image: job-image
174
+
command: ["./tests-runner"]
175
+
podFailurePolicy:
176
+
rules:
177
+
- action: FailIndex
178
+
onExitCodes:
179
+
operator: In
180
+
values: [42]
181
+
```
109
182
110
183
### Notes/Constraints/Caveats (Optional)
111
184
@@ -132,19 +205,73 @@ Consider including folks who also work outside the SIG or subproject.
132
205
133
206
## Design Details
134
207
135
-
A possible PodFailurePolicy spec might look something like this with the new additions
136
-
137
-
```
138
-
podFailurePolicy:
139
-
rules:
140
-
- action: FailJob|FailIndex
141
-
onExitCodes:
142
-
containerName: main
143
-
operator: In
144
-
values: [42]
145
-
backoffLimitTarget: Job|Index
208
+
We introduce a new Job API field, called `.spec.backoffLimitPerIndex`, when set
209
+
to `true`, then failures are counted towards the `.spec.backoffLimit`, but
210
+
incremented independently for all indexes. This mode is only supported when
211
+
pod's `restartPolicy=Never`.
212
+
213
+
### Job API
214
+
215
+
We extend the Job API in order to allow to apply different actions depending
216
+
on the conditions associated with the pod failure.
217
+
218
+
```golang
219
+
220
+
// PodFailurePolicyAction specifies how a Pod failure is handled.
221
+
// +enum
222
+
type PodFailurePolicyAction string
223
+
224
+
const (
225
+
// This is an action which might be taken on a pod failure - mark the
226
+
// Job's index as failed to avoid pod restarts within this index.
0 commit comments