You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
+135-5Lines changed: 135 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -724,8 +724,9 @@ in back-to-back releases.
724
724
#### Beta
725
725
726
726
- Address reviews and bug reports from Alpha users
727
-
-Propose and implement metrics
727
+
-Implement the `job_finished_indexes_total` metric
728
728
- E2e tests are in Testgrid and linked in KEP
729
+
- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
729
730
- Evaluate performance of Job controller for jobs using backoff limit per index
730
731
with benchmarks at the integration or e2e level (discussion pointers from Alpha
731
732
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
@@ -756,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
756
757
any additional configuration changes. Jobs which specified
757
758
`.spec.backoffLimitPerIndex` (to make use of this feature) will be
758
759
handled in a default way, ie. using the `.spec.backoffLimit`.
760
+
However, since the `.spec.backoffLimit` defaults to max int32 value
761
+
(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
762
+
to ensure failed pods are not retried indefinitely.
759
763
760
764
<!--
761
765
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -876,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
876
880
877
881
###### Are there any tests for feature enablement/disablement?
878
882
879
-
No. The tests will be added in Alpha.
883
+
Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
884
+
which tests the following path: enablement -> disablement -> re-enablement.
880
885
881
886
<!--
882
887
The e2e framework does not currently support enabling or disabling feature
@@ -899,7 +904,16 @@ This section must be completed when targeting beta to a release.
899
904
900
905
###### How can a rollout or rollback fail? Can it impact already running workloads?
901
906
902
-
The change is opt-in, it doesn't impact already running workloads.
907
+
This change does not impact how the rollout or rollback fail.
908
+
909
+
The change is opt-in, thus a rollout doesn't impact already running pods.
910
+
911
+
The rollback might affect how pod failures are handled, since they will
912
+
be counted only against `.spec.backoffLimit`, which is defaulted to max int32
913
+
value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
914
+
Thus, similarly as in case of a downgrade (see [here](#downgrade))
915
+
it might be required to manually set `spec.backoffLimit` to ensure failed pods
916
+
are not retried indefinitely.
903
917
904
918
<!--
905
919
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -932,7 +946,109 @@ that might indicate a serious problem?
932
946
933
947
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
934
948
935
-
It will be tested manually prior to beta launch.
949
+
The Upgrade->downgrade->upgrade testing was done manually using the `alpha`
950
+
version in 1.28 with the following steps:
951
+
952
+
1. Start the cluster with the `JobBackoffLimitPerIndex` enabled:
0 commit comments