You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -77,7 +78,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
77
78
-[x] (R) Production readiness review completed
78
79
-[x] (R) Production readiness review approved
79
80
-[x] "Implementation History" section is up-to-date for milestone
80
-
-[] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81
+
-[x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
81
82
-[ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
82
83
83
84
[kubernetes.io]: https://kubernetes.io/
@@ -728,9 +729,6 @@ in back-to-back releases.
728
729
- Evaluate performance of Job controller for jobs using backoff limit per index
729
730
with benchmarks at the integration or e2e level (discussion pointers from Alpha
730
731
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
731
-
- Reevaluate ideas of not using `.status.uncountedTerminatedPods` for keeping track
732
-
in the `.status.Failed` field. The idea is to prevent `backoffLimit` for setting.
@@ -992,6 +990,8 @@ Recall that end users cannot usually observe component logs or access metrics.
992
990
993
991
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
994
992
993
+
This feature does not propose SLOs.
994
+
995
995
<!--
996
996
This is your opportunity to define what "normal" quality of service looks like
997
997
for a feature.
@@ -1017,6 +1017,8 @@ Pick one more of these and delete the rest.
1017
1017
- Metric name:
1018
1018
-`job_sync_duration_seconds` (existing): can be used to see how much the
1019
1019
feature enablement increases the time spent in the sync job
1020
+
- `job_finished_indexes_total` (new): can be used to determine if the indexes
1021
+
are marked failed,
1020
1022
- Components exposing the metric: kube-controller-manager
1021
1023
1022
1024
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
@@ -1182,8 +1184,12 @@ details). For now, we leave it here.
1182
1184
1183
1185
###### How does this feature react if the API server and/or etcd is unavailable?
1184
1186
1187
+
No change from existing behavior of the Job controller.
1188
+
1185
1189
###### What are other known failure modes?
1186
1190
1191
+
None.
1192
+
1187
1193
<!--
1188
1194
For each of them, fill in the following information by copying the below template:
1189
1195
- [Failure mode brief description]
@@ -1199,6 +1205,8 @@ For each of them, fill in the following information by copying the below templat
1199
1205
1200
1206
###### What steps should be taken if SLOs are not being met to determine the problem?
1201
1207
1208
+
N/A.
1209
+
1202
1210
## Implementation History
1203
1211
1204
1212
<!--
@@ -1219,6 +1227,8 @@ Major milestones might include:
1219
1227
- 2023-07-13: The implementation PR [Support BackoffLimitPerIndex in Jobs #118009](https://github.com/kubernetes/kubernetes/pull/118009) under review
1220
1228
- 2023-07-18: Merge the API PR [Extend the Job API for BackoffLimitPerIndex](https://github.com/kubernetes/kubernetes/pull/119294)
1221
1229
- 2023-07-18: Merge the Job Controller PR [Support BackoffLimitPerIndex in Jobs](https://github.com/kubernetes/kubernetes/pull/118009)
1230
+
- 2023-08-04: Merge user-facing docs PR [Docs update for Job's backoff limit per index (alpha in 1.28)](https://github.com/kubernetes/website/pull/41921)
1231
+
- 2023-08-06: Merge KEP update reflecting decisions during the implementation phase [Update for KEP3850 "Backoff Limit Per Index"](https://github.com/kubernetes/enhancements/pull/4123)
1222
1232
1223
1233
## Drawbacks
1224
1234
@@ -1457,6 +1467,26 @@ when a user sets `maxFailedIndexes` as 10^6 the Job may complete if the indexes
1457
1467
and consecutive, but the Job may also fail if the size of the object exceeds the
1458
1468
limits due to non-consecutive indexes failing.
1459
1469
1470
+
### Skip uncountedTerminatedPods when backoffLimitPerIndex is used
1471
+
1472
+
It's been proposed (see [link](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263879848))
1473
+
that when backoffLimitPerIndex is used, then we could skip the interim step of
1474
+
recording terminated pods in `.status.uncountedTerminatedPods`.
1475
+
1476
+
**Reasons for deferring / rejecting**
1477
+
1478
+
First, if we stop using `.status.uncountedTerminatedPods` it means that
1479
+
`.status.failed` can no longer track the number of failed pods. Thus, it would
1480
+
require a change of semantic to denote just the number of failed indexes. This
1481
+
has downsides:
1482
+
- two different semantics of the field, depending on the used feature
1483
+
- lost information about some failed pods within an index (some users may care
1484
+
to investigate succeeded indexes with at least one failed pod)
1485
+
1486
+
Second, it would only optimize the unhappy path, where there are failures. Also,
1487
+
the saving is only 1 request per 500 failed pods, which does not seem essential.
0 commit comments