You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
+32-4Lines changed: 32 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -724,8 +724,9 @@ in back-to-back releases.
724
724
#### Beta
725
725
726
726
- Address reviews and bug reports from Alpha users
727
-
-Propose and implement metrics
727
+
-Implement the `job_finished_indexes_total` metric
728
728
- E2e tests are in Testgrid and linked in KEP
729
+
- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
729
730
- Evaluate performance of Job controller for jobs using backoff limit per index
730
731
with benchmarks at the integration or e2e level (discussion pointers from Alpha
731
732
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
@@ -756,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
756
757
any additional configuration changes. Jobs which specified
757
758
`.spec.backoffLimitPerIndex` (to make use of this feature) will be
758
759
handled in a default way, ie. using the `.spec.backoffLimit`.
760
+
However, since the `.spec.backoffLimit` defaults to max int32 value
761
+
(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
762
+
to ensure failed pods are not retried indefinitely.
759
763
760
764
<!--
761
765
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -876,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
876
880
877
881
###### Are there any tests for feature enablement/disablement?
878
882
879
-
No. The tests will be added in Alpha.
883
+
Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
884
+
which tests the following path: enablement -> disablement -> re-enablement.
880
885
881
886
<!--
882
887
The e2e framework does not currently support enabling or disabling feature
@@ -899,7 +904,16 @@ This section must be completed when targeting beta to a release.
899
904
900
905
###### How can a rollout or rollback fail? Can it impact already running workloads?
901
906
902
-
The change is opt-in, it doesn't impact already running workloads.
907
+
This change does not impact how the rollout or rollback fail.
908
+
909
+
The change is opt-in, thus a rollout doesn't impact already running pods.
910
+
911
+
The rollback might affect how pod failures are handled, since they will
912
+
be counted only against `.spec.backoffLimit`, which is defaulted to max int32
913
+
value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
914
+
Thus, similarly as in case of a downgrade (see [here](#downgrade))
915
+
it might be required to manually set `spec.backoffLimit` to ensure failed pods
916
+
are not retried indefinitely.
903
917
904
918
<!--
905
919
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -1023,7 +1037,7 @@ are marked failed,
1023
1037
1024
1038
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
1025
1039
1026
-
For Beta we will consider introduction of a new metric `job_finished_indexes_total`
1040
+
For Beta we will introduce of a new metric `job_finished_indexes_total`
1027
1041
with labels `status=(failed|succeeded)`, and `backoffLimit=(perIndex|global)`.
1028
1042
It will count the number of failed and succeeded indexes across jobs using
1029
1043
`backoffLimitPerIndex`, or regular Indexed Jobs (using only `.spec.backoffLimit`).
@@ -1169,6 +1183,20 @@ This through this both in small and large cases, again with respect to the
0 commit comments