Remarks

mimowo · mimowo · commit 71409f6e143e · 2023-09-29T15:07:26.000+02:00
diff --git a/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md b/keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
@@ -724,8 +724,9 @@ in back-to-back releases.
 #### Beta
 
 - Address reviews and bug reports from Alpha users
-- Propose and implement metrics
+- Implement the `job_finished_indexes_total` metric
 - E2e tests are in Testgrid and linked in KEP
+- Move the [new reason declarations](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/pkg/controller/job/job_controller.go#L82-L89) from Job controller to the API package
 - Evaluate performance of Job controller for jobs using backoff limit per index
   with benchmarks at the integration or e2e level (discussion pointers from Alpha
   review: [thread1](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1261694406) and [thread2](https://github.com/kubernetes/kubernetes/pull/118009#discussion_r1263862076))
@@ -756,6 +757,9 @@ A downgrade to a version which does not support this feature should not require
 any additional configuration changes. Jobs which specified
 `.spec.backoffLimitPerIndex` (to make use of this feature) will be
 handled in a default way, ie. using the `.spec.backoffLimit`.
+However, since the `.spec.backoffLimit` defaults to max int32 value
+(see [here](#job-api)) is might require a manual setting of the `.spec.backoffLimit`
+to ensure failed pods are not retried indefinitely.
 
 <!--
 If applicable, how will the component be upgraded and downgraded? Make sure
@@ -876,7 +880,8 @@ The Job controller starts to handle pod failures according to the specified
 
 ###### Are there any tests for feature enablement/disablement?
 
-No. The tests will be added in Alpha.
+Yes, there is an [integration test](https://github.com/kubernetes/kubernetes/blob/dc28eeaa3a6e18ef683f4b2379234c2284d5577e/test/integration/job/job_test.go#L763)
+which tests the following path: enablement -> disablement -> re-enablement.
 
 <!--
 The e2e framework does not currently support enabling or disabling feature
@@ -899,7 +904,16 @@ This section must be completed when targeting beta to a release.
 
 ###### How can a rollout or rollback fail? Can it impact already running workloads?
 
-The change is opt-in, it doesn't impact already running workloads.
+This change does not impact how the rollout or rollback fail.
+
+The change is opt-in, thus a rollout doesn't impact already running pods.
+
+The rollback might affect how pod failures are handled, since they will
+be counted only against `.spec.backoffLimit`, which is defaulted to max int32
+value, when using `.spec.backoffLimitPerIndex` (see [here](#job-api)).
+Thus, similarly as in case of a downgrade (see [here](#downgrade))
+it might be required to manually set `spec.backoffLimit` to ensure failed pods
+are not retried indefinitely.
 
 <!--
 Try to be as paranoid as possible - e.g., what if some components will restart
@@ -1023,7 +1037,7 @@ are marked failed,
 
 ###### Are there any missing metrics that would be useful to have to improve observability of this feature?
 
-For Beta we will consider introduction of a new metric `job_finished_indexes_total`
+For Beta we will introduce of a new metric `job_finished_indexes_total`
 with labels `status=(failed|succeeded)`, and `backoffLimit=(perIndex|global)`.
 It will count the number of failed and succeeded indexes across jobs using
 `backoffLimitPerIndex`, or regular Indexed Jobs (using only `.spec.backoffLimit`).
@@ -1169,6 +1183,20 @@ This through this both in small and large cases, again with respect to the
 [supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
 -->
 
+###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
+
+No. This feature does not introduce any resource exhaustive operations.
+
+<!--
+Focus not just on happy cases, but primarily on more pathological cases
+(e.g. probes taking a minute instead of milliseconds, failed pods consuming resources, etc.).
+If any of the resources can be exhausted, how this is mitigated with the existing limits
+(e.g. pods per node) or new limits added by this KEP?
+
+Are there any tests that were run/should be run to understand performance characteristics better
+and validate the declared limits?
+-->
+
 ### Troubleshooting
 
 <!--