Backoff limit per Job Index #3967

mimowo · 2023-04-26T10:26:01Z

One-line PR description: Support for executing all indexes in case of failures using backoff limit per Index

Issue link: Backoff Limit Per Index For Indexed Jobs #3850

Other comments:
- superseeds: Backoff Limit Per Job #3774

linux-foundation-easycla · 2023-04-26T10:26:05Z

The committers listed above are authorized under a signed CLA.

✅ login: jensentanlo (44769f5)
✅ login: mimowo / name: Michał Woźniak (cd0d8a0, 30f964f)

mimowo · 2023-04-26T10:59:37Z

FYI @jensentanlo @alculquicondor @soltysh @kerthcet

alculquicondor · 2023-04-26T17:08:56Z

/wg batch

soltysh

/lgtm

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

wojtek-t

From PRR perspective I'm fine - I wanted to follow up on scalability aspect.

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

wojtek-t · 2023-06-06T13:48:01Z

keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md

+the expotential backoff delay hasn't elapsed for any index (allowing pod
+recreation), then we requeue the next Job status update. The delay is computed
+as minimum of all delays computed for all indexes requiring pod recreation,
+but not less that 1s.


This is not strictly related to PRR, rather wearing my scalability hat now.
[I would also be happy to discuss it separately if preferred.]

My concern is purely code-related - you have this magic immediate parameter in:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L499
I have concerns:

why we want to use a different separate backoff in those cases (that effectively starts from 0 instead of 1s)

why we don't do any batching when Job object is changing itself, but I think the new PR (that doesn't do batching only when the job generation changes) seems to address this part

(1.) I agree. IIUC the idea of backoff is to throttle pod creation in case of consecutive failures. This is covered already here: https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L1415-L1419. However, currently it is also used for throttling in other places for non-obvious reasons. It also might be a left over after a relatively recent refactoring of how backoff works. I think this was always a little bit underspecified, +1 to clean it up at some point. Also, I think it makes sense to modify the final backoff as max(1s, computed backoff), but this is a detail. @alculquicondor @soltysh wdyt? Maybe we should open a separate Issue to clean up the backoff status?

(2.) yes, I don't think we want to batch (delay) job updates in case of spec updates. For example, spec updates are used to update suspend status, which could result in delayed job start. The PR covers the distinction.

Backoff is used for errors from API calls (like conflicts). It is not used for events (pod creation, updates, etc). I guess a better name for the variable would be useBackoff or have separate functions altogether.

We want to be able to respond to changes to the spec ASAP (new parallelism, suspend, etc).

Maybe we should open a separate Issue to clean up the backoff status?

Just open a PR with your proposal of what a clean code looks like :)
There is a lot of heritage in the backoff code.

Just open a PR with your proposal of what a clean code looks like :)

We have one PR currently open for the fix + potentially small cleanup. I will consider another in the future, but some changes may not be just cosmetic, but also touching on semantics (so requiring discussion). Still, I suggest we move this discussion out of KEP to offline, under PRs or under Issues.

For (1) - clearly backoff isn't used only for errors from API calls, because errors from API calls don't generate events and we have those in event handlers:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L318-L321

@mimowo - while opening PR is certainly good thing, can you open an issue to so that we won't forget about it?

For (2) - I agree.

@wojtek-t sure, opened: kubernetes/kubernetes#118527

wojtek-t

From PRR perspective I'm fine - I wanted to follow up on scalability aspect.

soltysh · 2023-06-06T14:42:23Z

/lgtm

wojtek-t · 2023-06-07T07:34:50Z

/lgtm
/approve PRR

/hold
@mimowo - feel free to cancel hold once the issue is opened (and please cc me to that issue)

mimowo · 2023-06-07T08:24:30Z

/hold cancel
as the issue is now open: kubernetes/kubernetes#118527

wojtek-t · 2023-06-07T09:38:52Z

@soltysh - it needs your approval

soltysh · 2023-06-07T14:36:36Z

@soltysh - it needs your approval

oh, I thought I already did
/approve

k8s-ci-robot · 2023-06-07T14:36:53Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-apps/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2023-06-07T19:18:44Z

spoke on slack and live prior. The API looked good too.

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 26, 2023

k8s-ci-robot requested review from kow3ns and soltysh April 26, 2023 10:26

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2023

mimowo changed the title ~~Backoff limit per index~~ Backoff limit per Job Index Apr 26, 2023

mimowo force-pushed the backoff-limit-per-index branch from a739ffc to 5230713 Compare April 26, 2023 10:56

mimowo changed the title ~~Backoff limit per Job Index~~ WIP: Backoff limit per Job Index Apr 26, 2023

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2023

mimowo marked this pull request as draft April 26, 2023 10:59

mimowo mentioned this pull request Apr 26, 2023

Add a new field maxRestartTimes to podSpec when running into RestartPolicyOnFailure #3322

Open

4 tasks

mimowo force-pushed the backoff-limit-per-index branch 2 times, most recently from 451acea to 7349dc4 Compare April 26, 2023 14:28

k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 26, 2023

mimowo force-pushed the backoff-limit-per-index branch 12 times, most recently from 960cff1 to bc634d9 Compare April 27, 2023 12:30

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

Backoff Limit Per Index

cd0d8a0

mimowo force-pushed the backoff-limit-per-index branch from 03c5335 to cd0d8a0 Compare June 5, 2023 09:56

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

soltysh approved these changes Jun 5, 2023

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

wojtek-t reviewed Jun 5, 2023

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023

mimowo force-pushed the backoff-limit-per-index branch from 16cf877 to 1f8696b Compare June 5, 2023 11:45

Remarks

30f964f

mimowo force-pushed the backoff-limit-per-index branch from 1f8696b to 30f964f Compare June 5, 2023 14:27

mimowo requested a review from wojtek-t June 6, 2023 11:59

wojtek-t reviewed Jun 6, 2023

View reviewed changes

tenzen-y mentioned this pull request Jun 6, 2023

KEP-3998: Job success/completion policy #4062

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2023

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023

mimowo mentioned this pull request Jun 7, 2023

Cleanup the Job controller exponential backoff delay kubernetes/kubernetes#118527

Closed

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2023

k8s-ci-robot merged commit 19a6057 into kubernetes:master Jun 7, 2023

k8s-ci-robot added this to the v1.28 milestone Jun 7, 2023

mimowo mentioned this pull request Jul 11, 2023

Support BackoffLimitPerIndex in Jobs kubernetes/kubernetes#118009

Merged

Backoff limit per Job Index #3967

Backoff limit per Job Index #3967

Uh oh!

Conversation

mimowo commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Apr 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mimowo commented Apr 26, 2023

Uh oh!

alculquicondor commented Apr 26, 2023

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wojtek-t Jun 6, 2023

Choose a reason for hiding this comment

Uh oh!

mimowo Jun 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alculquicondor Jun 6, 2023

Choose a reason for hiding this comment

Uh oh!

alculquicondor Jun 6, 2023

Choose a reason for hiding this comment

Uh oh!

mimowo Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

wojtek-t Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

mimowo Jun 7, 2023

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

soltysh commented Jun 6, 2023

Uh oh!

wojtek-t commented Jun 7, 2023

Uh oh!

mimowo commented Jun 7, 2023

Uh oh!

wojtek-t commented Jun 7, 2023

Uh oh!

soltysh commented Jun 7, 2023

Uh oh!

k8s-ci-robot commented Jun 7, 2023

Uh oh!

deads2k commented Jun 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

mimowo commented Apr 26, 2023 •

edited

Loading

linux-foundation-easycla bot commented Apr 26, 2023 •

edited

Loading

mimowo Jun 6, 2023 •

edited

Loading