Skip to content

Commit 71409f6

Browse files
committed
Remarks
1 parent 295014f commit 71409f6

File tree

1 file changed

+32
-4
lines changed
  • keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs

1 file changed

+32
-4
lines changed

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -724,8 +724,9 @@ in back-to-back releases.
724724
#### Beta
725725

726726
- Address reviews and bug reports from Alpha users
727-
- Propose and implement metrics
727+
- Implement the `job_finished_indexes_total` metric
728728
- E2e tests are in Testgrid and linked in KEP
729+
- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
729730
- Evaluate performance of Job controller for jobs using backoff limit per index
730731
with benchmarks at the integration or e2e level (discussion pointers from Alpha
731732
review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
@@ -756,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
756757
any additional configuration changes. Jobs which specified
757758
`.spec.backoffLimitPerIndex` (to make use of this feature) will be
758759
handled in a default way, ie. using the `.spec.backoffLimit`.
760+
However, since the `.spec.backoffLimit` defaults to max int32 value
761+
(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
762+
to ensure failed pods are not retried indefinitely.
759763

760764
<!--
761765
If applicable, how will the component be upgraded and downgraded? Make sure
@@ -876,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
876880

877881
###### Are there any tests for feature enablement/disablement?
878882

879-
No. The tests will be added in Alpha.
883+
Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
884+
which tests the following path: enablement -> disablement -> re-enablement.
880885

881886
<!--
882887
The e2e framework does not currently support enabling or disabling feature
@@ -899,7 +904,16 @@ This section must be completed when targeting beta to a release.
899904

900905
###### How can a rollout or rollback fail? Can it impact already running workloads?
901906

902-
The change is opt-in, it doesn't impact already running workloads.
907+
This change does not impact how the rollout or rollback fail.
908+
909+
The change is opt-in, thus a rollout doesn't impact already running pods.
910+
911+
The rollback might affect how pod failures are handled, since they will
912+
be counted only against `.spec.backoffLimit`, which is defaulted to max int32
913+
value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
914+
Thus, similarly as in case of a downgrade (see [here](#downgrade))
915+
it might be required to manually set `spec.backoffLimit` to ensure failed pods
916+
are not retried indefinitely.
903917

904918
<!--
905919
Try to be as paranoid as possible - e.g., what if some components will restart
@@ -1023,7 +1037,7 @@ are marked failed,
10231037

10241038
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
10251039

1026-
For Beta we will consider introduction of a new metric `job_finished_indexes_total`
1040+
For Beta we will introduce of a new metric `job_finished_indexes_total`
10271041
with labels `status=(failed|succeeded)`, and `backoffLimit=(perIndex|global)`.
10281042
It will count the number of failed and succeeded indexes across jobs using
10291043
`backoffLimitPerIndex`, or regular Indexed Jobs (using only `.spec.backoffLimit`).
@@ -1169,6 +1183,20 @@ This through this both in small and large cases, again with respect to the
11691183
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
11701184
-->
11711185

1186+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
1187+
1188+
No. This feature does not introduce any resource exhaustive operations.
1189+
1190+
<!--
1191+
Focus not just on happy cases, but primarily on more pathological cases
1192+
(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
1193+
If any of the resources can be exhausted, how this is mitigated with the existing limits
1194+
(e.g. pods per node) or new limits added by this KEP?
1195+
1196+
Are there any tests that were run/should be run to understand performance characteristics better
1197+
and validate the declared limits?
1198+
-->
1199+
11721200
### Troubleshooting
11731201

11741202
<!--

0 commit comments

Comments
 (0)