Skip to content

Conversation

@mimowo
Copy link
Contributor

@mimowo mimowo commented Apr 26, 2023

  • One-line PR description: Support for executing all indexes in case of failures using backoff limit per Index

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Apr 26, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Apr 26, 2023
@k8s-ci-robot k8s-ci-robot requested review from kow3ns and soltysh April 26, 2023 10:26
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 26, 2023
@mimowo mimowo changed the title Backoff limit per index Backoff limit per Job Index Apr 26, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index branch from a739ffc to 5230713 Compare April 26, 2023 10:56
@mimowo
Copy link
Contributor Author

mimowo commented Apr 26, 2023

@mimowo mimowo changed the title Backoff limit per Job Index WIP: Backoff limit per Job Index Apr 26, 2023
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2023
@mimowo mimowo marked this pull request as draft April 26, 2023 10:59
@mimowo mimowo force-pushed the backoff-limit-per-index branch 2 times, most recently from 451acea to 7349dc4 Compare April 26, 2023 14:28
@alculquicondor
Copy link
Member

/wg batch

@k8s-ci-robot k8s-ci-robot added the wg/batch Categorizes an issue or PR as relevant to WG Batch. label Apr 26, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index branch 12 times, most recently from 960cff1 to bc634d9 Compare April 27, 2023 12:30
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index branch from 03c5335 to cd0d8a0 Compare June 5, 2023 09:56
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 5, 2023
@mimowo mimowo force-pushed the backoff-limit-per-index branch from 16cf877 to 1f8696b Compare June 5, 2023 11:45
@mimowo mimowo force-pushed the backoff-limit-per-index branch from 1f8696b to 30f964f Compare June 5, 2023 14:27
@mimowo mimowo requested a review from wojtek-t June 6, 2023 11:59
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PRR perspective I'm fine - I wanted to follow up on scalability aspect.

the expotential backoff delay hasn't elapsed for any index (allowing pod
recreation), then we requeue the next Job status update. The delay is computed
as minimum of all delays computed for all indexes requiring pod recreation,
but not less that 1s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not strictly related to PRR, rather wearing my scalability hat now.
[I would also be happy to discuss it separately if preferred.]

My concern is purely code-related - you have this magic immediate parameter in:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L499
I have concerns:

  1. why we want to use a different separate backoff in those cases (that effectively starts from 0 instead of 1s)
  2. why we don't do any batching when Job object is changing itself, but I think the new PR (that doesn't do batching only when the job generation changes) seems to address this part

Copy link
Contributor Author

@mimowo mimowo Jun 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1.) I agree. IIUC the idea of backoff is to throttle pod creation in case of consecutive failures. This is covered already here: https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L1415-L1419. However, currently it is also used for throttling in other places for non-obvious reasons. It also might be a left over after a relatively recent refactoring of how backoff works. I think this was always a little bit underspecified, +1 to clean it up at some point. Also, I think it makes sense to modify the final backoff as max(1s, computed backoff), but this is a detail. @alculquicondor @soltysh wdyt? Maybe we should open a separate Issue to clean up the backoff status?

(2.) yes, I don't think we want to batch (delay) job updates in case of spec updates. For example, spec updates are used to update suspend status, which could result in delayed job start. The PR covers the distinction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Backoff is used for errors from API calls (like conflicts). It is not used for events (pod creation, updates, etc). I guess a better name for the variable would be useBackoff or have separate functions altogether.
  2. We want to be able to respond to changes to the spec ASAP (new parallelism, suspend, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should open a separate Issue to clean up the backoff status?

Just open a PR with your proposal of what a clean code looks like :)
There is a lot of heritage in the backoff code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just open a PR with your proposal of what a clean code looks like :)

We have one PR currently open for the fix + potentially small cleanup. I will consider another in the future, but some changes may not be just cosmetic, but also touching on semantics (so requiring discussion). Still, I suggest we move this discussion out of KEP to offline, under PRs or under Issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For (1) - clearly backoff isn't used only for errors from API calls, because errors from API calls don't generate events and we have those in event handlers:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L318-L321

@mimowo - while opening PR is certainly good thing, can you open an issue to so that we won't forget about it?

For (2) - I agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From PRR perspective I'm fine - I wanted to follow up on scalability aspect.

@soltysh
Copy link
Contributor

soltysh commented Jun 6, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 6, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2023

/lgtm
/approve PRR

/hold
@mimowo - feel free to cancel hold once the issue is opened (and please cc me to that issue)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023
@mimowo
Copy link
Contributor Author

mimowo commented Jun 7, 2023

/hold cancel
as the issue is now open: kubernetes/kubernetes#118527

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 7, 2023
@wojtek-t
Copy link
Member

wojtek-t commented Jun 7, 2023

@soltysh - it needs your approval

@soltysh
Copy link
Contributor

soltysh commented Jun 7, 2023

@soltysh - it needs your approval

oh, I thought I already did
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 7, 2023
@k8s-ci-robot k8s-ci-robot merged commit 19a6057 into kubernetes:master Jun 7, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jun 7, 2023
@deads2k
Copy link
Contributor

deads2k commented Jun 7, 2023

spoke on slack and live prior. The API looked good too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. wg/batch Categorizes an issue or PR as relevant to WG Batch.

Projects

Status: API review completed, 1.28

Development

Successfully merging this pull request may close these issues.