Skip to content

Commit de0c18c

Browse files
committed
Update KEP for Beta for "Backoff Limit Per Index"
1 parent 0a6d6c8 commit de0c18c

File tree

3 files changed

+37
-5
lines changed

3 files changed

+37
-5
lines changed
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
kep-number: 3850
22
alpha:
33
approver: "@wojtek-t"
4+
beta:
5+
approver: "@wojtek-t"

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

Lines changed: 34 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
- [Keep failedIndexes field as a bitmap](#keep-failedindexes-field-as-a-bitmap)
5959
- [Keep the list of failed indexes in a dedicated API object](#keep-the-list-of-failed-indexes-in-a-dedicated-api-object)
6060
- [Implicit limit on the number of failed indexes](#implicit-limit-on-the-number-of-failed-indexes)
61+
- [Skip uncountedTerminatedPods when backoffLimitPerIndex is used](#skip-uncountedterminatedpods-when-backofflimitperindex-is-used)
6162
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
6263
<!-- /toc -->
6364

@@ -77,7 +78,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
7778
- [x] (R) Production readiness review completed
7879
- [x] (R) Production readiness review approved
7980
- [x] "Implementation History" section is up-to-date for milestone
80-
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81+
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
8182
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
8283

8384
[kubernetes.io]: https://kubernetes.io/
@@ -728,9 +729,6 @@ in back-to-back releases.
728729
- Evaluate performance of Job controller for jobs using backoff limit per index
729730
with benchmarks at the integration or e2e level (discussion pointers from Alpha
730731
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
731-
- Reevaluate ideas of not using `.status.uncountedTerminatedPods` for keeping track
732-
in the `.status.Failed` field. The idea is to prevent `backoffLimit` for setting.
733-
Discussion [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848).
734732
- The feature flag enabled by default
735733

736734
#### GA
@@ -992,6 +990,8 @@ Recall that end users cannot usually observe component logs or access metrics.
992990

993991
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
994992

993+
This feature does not propose SLOs.
994+
995995
<!--
996996
This is your opportunity to define what "normal" quality of service looks like
997997
for a feature.
@@ -1017,6 +1017,8 @@ Pick one more of these and delete the rest.
10171017
- Metric name:
10181018
- `job_sync_duration_seconds` (existing): can be used to see how much the
10191019
feature enablement increases the time spent in the sync job
1020+
- `job_finished_indexes_total` (new): can be used to determine if the indexes
1021+
are marked failed,
10201022
- Components exposing the metric: kube-controller-manager
10211023

10221024
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -1182,8 +1184,12 @@ details). For now, we leave it here.
11821184

11831185
###### How does this feature react if the API server and/or etcd is unavailable?
11841186

1187+
No change from existing behavior of the Job controller.
1188+
11851189
###### What are other known failure modes?
11861190

1191+
None.
1192+
11871193
<!--
11881194
For each of them, fill in the following information by copying the below template:
11891195
- [Failure mode brief description]
@@ -1199,6 +1205,8 @@ For each of them, fill in the following information by copying the below templat
11991205

12001206
###### What steps should be taken if SLOs are not being met to determine the problem?
12011207

1208+
N/A.
1209+
12021210
## Implementation History
12031211

12041212
<!--
@@ -1219,6 +1227,8 @@ Major milestones might include:
12191227
- 2023-07-13: The implementation PR [Support BackoffLimitPerIndex in Jobs #118009](https://github.com/kubernetes/kubernetes/pull/118009) under review
12201228
- 2023-07-18: Merge the API PR [Extend the Job API for BackoffLimitPerIndex](https://github.com/kubernetes/kubernetes/pull/119294)
12211229
- 2023-07-18: Merge the Job Controller PR [Support BackoffLimitPerIndex in Jobs](https://github.com/kubernetes/kubernetes/pull/118009)
1230+
- 2023-08-04: Merge user-facing docs PR [Docs update for Job's backoff limit per index (alpha in 1.28)](https://github.com/kubernetes/website/pull/41921)
1231+
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase [Update for KEP3850 "Backoff Limit Per Index"](https://github.com/kubernetes/enhancements/pull/4123)
12221232

12231233
## Drawbacks
12241234

@@ -1457,6 +1467,26 @@ when a user sets `maxFailedIndexes` as 10^6 the Job may complete if the indexes
14571467
and consecutive, but the Job may also fail if the size of the object exceeds the
14581468
limits due to non-consecutive indexes failing.
14591469

1470+
### Skip uncountedTerminatedPods when backoffLimitPerIndex is used
1471+
1472+
It's been proposed (see [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848))
1473+
that when backoffLimitPerIndex is used, then we could skip the interim step of
1474+
recording terminated pods in `.status.uncountedTerminatedPods`.
1475+
1476+
**Reasons for deferring / rejecting**
1477+
1478+
First, if we stop using `.status.uncountedTerminatedPods` it means that
1479+
`.status.failed` can no longer track the number of failed pods. Thus, it would
1480+
require a change of semantic to denote just the number of failed indexes. This
1481+
has downsides:
1482+
- two different semantics of the field, depending on the used feature
1483+
- lost information about some failed pods within an index (some users may care
1484+
to investigate succeeded indexes with at least one failed pod)
1485+
1486+
Second, it would only optimize the unhappy path, where there are failures. Also,
1487+
the saving is only 1 request per 500 failed pods, which does not seem essential.
1488+
1489+
14601490
## Infrastructure Needed (Optional)
14611491

14621492
<!--

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/kep.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ stage: alpha
1919
# The most recent milestone for which work toward delivery of this KEP has been
2020
# done. This can be the current (upcoming) milestone, if it is being actively
2121
# worked on.
22-
latest-milestone: "v1.28"
22+
latest-milestone: "v1.29"
2323

2424
# The milestone at which this feature was, or is targeted to be, at each stage.
2525
milestone:

0 commit comments

Comments
 (0)