Skip to content

Commit 7349dc4

Browse files
committed
WIP
1 parent 217a435 commit 7349dc4

File tree

3 files changed

+198
-40
lines changed

3 files changed

+198
-40
lines changed

keps/NNNN-kep-template/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -133,10 +133,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
133133
- [ ] (R) Design details are appropriately documented
134134
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
135135
- [ ] e2e Tests for all Beta API Operations (endpoints)
136-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
136+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
137137
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
138138
- [ ] (R) Graduation criteria is in place
139-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
139+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
140140
- [ ] (R) Production readiness review completed
141141
- [ ] (R) Production readiness review approved
142142
- [ ] "Implementation History" section is up-to-date for milestone
@@ -577,10 +577,10 @@ Recall that end users cannot usually observe component logs or access metrics.
577577
-->
578578

579579
- [ ] Events
580-
- Event Reason:
580+
- Event Reason:
581581
- [ ] API .status
582-
- Condition name:
583-
- Other field:
582+
- Condition name:
583+
- Other field:
584584
- [ ] Other (treat as last resort)
585585
- Details:
586586

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

Lines changed: 176 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,18 @@
1313
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1414
- [Risks and Mitigations](#risks-and-mitigations)
1515
- [Design Details](#design-details)
16+
- [Job API](#job-api)
17+
- [Tracking the number of failures](#tracking-the-number-of-failures)
18+
- [FailIndex action](#failindex-action)
1619
- [Test Plan](#test-plan)
1720
- [Prerequisite testing updates](#prerequisite-testing-updates)
1821
- [Unit tests](#unit-tests)
1922
- [Integration tests](#integration-tests)
2023
- [e2e tests](#e2e-tests)
2124
- [Graduation Criteria](#graduation-criteria)
25+
- [Alpha](#alpha)
26+
- [Beta](#beta)
27+
- [GA](#ga)
2228
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
2329
- [Version Skew Strategy](#version-skew-strategy)
2430
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
@@ -43,10 +49,10 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
4349
- [ ] (R) Design details are appropriately documented
4450
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
4551
- [ ] e2e Tests for all Beta API Operations (endpoints)
46-
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
52+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
4753
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
4854
- [ ] (R) Graduation criteria is in place
49-
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
55+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
5056
- [ ] (R) Production readiness review completed
5157
- [ ] (R) Production readiness review approved
5258
- [ ] "Implementation History" section is up-to-date for milestone
@@ -61,18 +67,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6167
## Summary
6268

6369
This KEP extends the indexed job API to support indexed jobs where each index is independent,
64-
and a failed index does not cause the other indices to automatically clean up.
70+
and a failed index does not cause the other indexes to automatically clean up.
6571

6672

6773
## Motivation
6874

69-
Currently, the indices of an indexed job share a single backoff limit.
75+
Currently, the indexes of an indexed job share a single backoff limit.
7076
When the job reaches this shared backoff limit, the job controller marks the entire
71-
job as failed, and the resources are cleaned up, including indices that have yet
72-
to run to completion.
77+
job as failed, and the resources are cleaned up, including indexes that have yet
78+
to run to completion.
7379

7480
As a result, the current implementation does not cover the situation where the workload
75-
is truly embarrassingly parallel and each index is completely independent of other indices.
81+
is truly embarrassingly parallel and each index is completely independent of other indexes.
7682

7783
For instance, if indexed jobs were used as the basis for a suite of long-running integration tests,
7884
then each test run would only be able to find a single test failure.
@@ -82,11 +88,16 @@ showing that this is a common use case that should be supported by Kubernetes.
8288

8389
### Goals
8490

85-
Support the use case where each indexed job has its own backoff limit, and all
86-
indices of an indexed job can complete even when a single index fails.
91+
- allow to count failures towards the backoffLimit independently for all indexes,
92+
- allow to fail an index (stop recreating pods for the index) using pod failure policy.
8793

8894
### Non-Goals
8995

96+
- allow to specify the number of indexes to mark the entire job as failed or completed.
97+
This is left to be addressed under: https://github.com/kubernetes/kubernetes/issues/117600.
98+
- allow to control the number of retries per index when pod's `restartPolicy=OnFailure`.
99+
This is left to be addressed under: https://github.com/kubernetes/enhancements/issues/3322.
100+
90101
<!--
91102
What is out of scope for this KEP? Listing non-goals helps to focus discussion
92103
and make progress.
@@ -96,16 +107,78 @@ and make progress.
96107

97108
We propose the addition of a new enum field in PodFailurePolicy called backoffLimitTarget,
98109
that accepts the values Job and Index. Job (the default value) would have the same behavior
99-
as the current implementation of the backoff limit where the limit shared between all indices.
110+
as the current implementation of the backoff limit where the limit shared between all indexes.
100111
Index would represent this new set of use cases, where the backoff limit is applied to each
101112
index individually.
102113

103114
We also propose the addition of a new action in PodFailurePolicy called FailIndex. This would be
104-
analagous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
115+
analogous to the existing FailJob action, but would allow a single index to be failed (short-circuiting retries)
105116
while the rest continue until completion.
106117

107118
### User Stories (Optional)
108119

120+
#### Story 1
121+
122+
As a CI/CD platform administrator, I want to use Indexed Jobs to run
123+
suites of integration tests, one suite per index. A failure of one suite should
124+
not interrupt running of other suites. Additionally, I would like to be able
125+
to control the maximal number of retries per index.
126+
127+
The following Job configuration could satisfy my use case:
128+
129+
```yaml
130+
apiVersion: v1
131+
kind: Job
132+
spec:
133+
parallelism: 10
134+
completions: 10
135+
completionMode: Indexed
136+
backoffLimit: 1
137+
backoffLimitPerIndex: true
138+
template:
139+
spec:
140+
restartPolicy: Never
141+
containers:
142+
- name: job-container
143+
image: job-image
144+
command: ["./tests-runner"]
145+
```
146+
147+
In this case, we run 10 indexes, representing running of the test suites.
148+
Due to possible flakes we allow for 1 failure per index.
149+
150+
#### Story 2
151+
152+
As a CI/CD platform administrator from the [Story 1](#story-1) I want to be able
153+
to control the failures with the pod failure policy. In particular, I want
154+
to be able to use pod failure policy to avoid restarts of some indexes, based
155+
on exit codes.
156+
157+
The following Job configuration could satisfy my use case:
158+
159+
```yaml
160+
apiVersion: v1
161+
kind: Job
162+
spec:
163+
parallelism: 10
164+
completions: 10
165+
completionMode: Indexed
166+
backoffLimit: 1
167+
backoffLimitPerIndex: true
168+
template:
169+
spec:
170+
restartPolicy: Never
171+
containers:
172+
- name: job-container
173+
image: job-image
174+
command: ["./tests-runner"]
175+
podFailurePolicy:
176+
rules:
177+
- action: FailIndex
178+
onExitCodes:
179+
operator: In
180+
values: [42]
181+
```
109182
110183
### Notes/Constraints/Caveats (Optional)
111184
@@ -132,19 +205,73 @@ Consider including folks who also work outside the SIG or subproject.
132205
133206
## Design Details
134207
135-
A possible PodFailurePolicy spec might look something like this with the new additions
136-
137-
```
138-
podFailurePolicy:
139-
rules:
140-
- action: FailJob|FailIndex
141-
onExitCodes:
142-
containerName: main
143-
operator: In
144-
values: [42]
145-
backoffLimitTarget: Job|Index
208+
We introduce a new Job API field, called `.spec.backoffLimitPerIndex`, when set
209+
to `true`, then failures are counted towards the `.spec.backoffLimit`, but
210+
incremented independently for all indexes. This mode is only supported when
211+
pod's `restartPolicy=Never`.
212+
213+
### Job API
214+
215+
We extend the Job API in order to allow to apply different actions depending
216+
on the conditions associated with the pod failure.
217+
218+
```golang
219+
220+
// PodFailurePolicyAction specifies how a Pod failure is handled.
221+
// +enum
222+
type PodFailurePolicyAction string
223+
224+
const (
225+
// This is an action which might be taken on a pod failure - mark the
226+
// Job's index as failed to avoid pod restarts within this index.
227+
PodFailurePolicyActionFailIndex PodFailurePolicyAction = "FailIndex"
228+
...
229+
)
230+
...
231+
232+
// JobSpec describes how the job execution will look like.
233+
type JobSpec struct {
234+
...
235+
// Indicates if the number of retries specified by backoffLimit is counted
236+
// globally or within an index. When set to true, each pod is
237+
// kept per index in the batch.kubernetes.io/job-index-retry-number Pod
238+
// annotation. It can only be set to true when Job's completionMode=Indexed.
239+
// Defaults to false
240+
// +optional
241+
BackoffLimitPerIndex *bool
242+
// Specifies the number of retries before marking this job failed. When
243+
// BackoffLimitPerIndex=true, then it specifies the number of retries for
244+
// for a given index.
245+
// Defaults to 6
246+
// +optional
247+
BackoffLimit *int32
248+
...
146249
```
147250

251+
### Tracking the number of failures
252+
253+
In order to determine if the `backoffLimit` is exceeded we need to keep track
254+
of the number of failures per index, when `restartPolicy=Never`. For this
255+
purpose we use the Pod annotation, `batch.kubernetes.io/job-index-retry-number`,
256+
which holds the value of the number of pod retries for a given index. It is set
257+
to `0` for the first pod created for a given index.
258+
259+
When a pod with `k` number of retries fails, and the index isn't failed yet, ie.
260+
number of retries is still smaller than the backoff limit per index, then we
261+
need to create a new pod with `k+1` number of retries. For this purpose we need
262+
to delay deletion of the old pod, and thus its finalizer, until the new pod
263+
corresponding to the index is created.
264+
265+
Once the number of retries for a given index reaches the `backoffLimit` we need
266+
to mark the index as failed, so that we can remove the pods. For this reason
267+
we keep the set of failed indexes in the Job `failedIndexes` field.
268+
269+
### FailIndex action
270+
271+
In order to allow early termination of indexes with the `FailIndex` action
272+
we also store the set of failed indexes in the `failedIndexes` variable
273+
in Job status. Analogous to the way as `completedIndexes` are kept.
274+
148275
### Test Plan
149276

150277
<!--
@@ -282,6 +409,30 @@ in back-to-back releases.
282409
- Deprecate the flag
283410
-->
284411

412+
#### Alpha
413+
414+
- the feature implemented behind the `JobBackoffLimitPerIndex` feature flag
415+
- the support for the `FailIndex` action is implemented behind the
416+
`JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature
417+
- the `FailIndex` action cannot be used when creating a new Job
418+
- The `JobBackoffLimitPerIndex` feature flag disabled by default
419+
- Tests: unit and integration
420+
421+
#### Beta
422+
423+
- Address reviews and bug reports from Alpha users
424+
- E2e tests are in Testgrid and linked in KEP
425+
- The `FailIndex` action can be used for newly created Jobs
426+
- The feature flag enabled by default
427+
428+
#### GA
429+
430+
- Address reviews and bug reports from Beta users
431+
- Write a blog post about the feature
432+
- Graduate e2e tests as conformance tests
433+
- Lock the `JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature-gates
434+
- Declare deprecation of the `JobPodFailurePolicy` and `JobBackoffLimitPerIndex` feature-gates in documentation
435+
285436
### Upgrade / Downgrade Strategy
286437

287438
<!--
@@ -468,10 +619,10 @@ Recall that end users cannot usually observe component logs or access metrics.
468619
-->
469620

470621
- [ ] Events
471-
- Event Reason:
622+
- Event Reason:
472623
- [ ] API .status
473-
- Condition name:
474-
- Other field:
624+
- Condition name:
625+
- Other field:
475626
- [ ] Other (treat as last resort)
476627
- Details:
477628

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,43 @@
11
title: Backoff Limits Per Index For Indexed Jobs
22
kep-number: 3850
33
authors:
4+
- "@mimowo"
45
- "@jensentanlo"
56
owning-sig: sig-apps
67
participating-sigs:
78
status: provisional
8-
creation-date: 2023-01-23
9+
creation-date: 2023-04-26
910
reviewers:
10-
- TBD
11+
- "@liggitt"
12+
- "@alculquicondor"
1113
approvers:
12-
- TBD
13-
see-also:
14-
- "/keps/sig-apps/2214-indexed-job"
15-
replaces:
14+
- "@soltysh"
1615

1716
# The target maturity stage in the current dev cycle for this KEP.
18-
stage:
17+
stage: alpha
1918

2019
# The most recent milestone for which work toward delivery of this KEP has been
2120
# done. This can be the current (upcoming) milestone, if it is being actively
2221
# worked on.
23-
latest-milestone:
22+
latest-milestone: "v1.28"
2423

2524
# The milestone at which this feature was, or is targeted to be, at each stage.
2625
milestone:
27-
alpha:
28-
beta:
26+
alpha: "v1.28"
27+
beta: "v1.29"
2928
stable:
3029

3130
# The following PRR answers are required at alpha release
3231
# List the feature gate name and the components for which it must be enabled
3332
feature-gates:
33+
- name: JobBackoffLimitPerIndex
34+
components:
35+
- kube-apiserver
36+
- kube-controller-manager
37+
- name: JobPodFailurePolicy
38+
components:
39+
- kube-apiserver
40+
- kube-controller-manager
3441
disable-supported: true
3542

3643
# The following PRR answers are required at beta release

0 commit comments

Comments
 (0)