-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Backoff limit per Job Index #3967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backoff limit per Job Index #3967
Conversation
a739ffc to
5230713
Compare
451acea to
7349dc4
Compare
|
/wg batch |
960cff1 to
bc634d9
Compare
03c5335 to
cd0d8a0
Compare
soltysh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
keps/sig-apps/3850-backoff-limits-per-index-for-indexed-jobs/README.md
Outdated
Show resolved
Hide resolved
16cf877 to
1f8696b
Compare
1f8696b to
30f964f
Compare
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From PRR perspective I'm fine - I wanted to follow up on scalability aspect.
| the expotential backoff delay hasn't elapsed for any index (allowing pod | ||
| recreation), then we requeue the next Job status update. The delay is computed | ||
| as minimum of all delays computed for all indexes requiring pod recreation, | ||
| but not less that 1s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not strictly related to PRR, rather wearing my scalability hat now.
[I would also be happy to discuss it separately if preferred.]
My concern is purely code-related - you have this magic immediate parameter in:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L499
I have concerns:
- why we want to use a different separate backoff in those cases (that effectively starts from 0 instead of 1s)
- why we don't do any batching when Job object is changing itself, but I think the new PR (that doesn't do batching only when the job generation changes) seems to address this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1.) I agree. IIUC the idea of backoff is to throttle pod creation in case of consecutive failures. This is covered already here: https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L1415-L1419. However, currently it is also used for throttling in other places for non-obvious reasons. It also might be a left over after a relatively recent refactoring of how backoff works. I think this was always a little bit underspecified, +1 to clean it up at some point. Also, I think it makes sense to modify the final backoff as max(1s, computed backoff), but this is a detail. @alculquicondor @soltysh wdyt? Maybe we should open a separate Issue to clean up the backoff status?
(2.) yes, I don't think we want to batch (delay) job updates in case of spec updates. For example, spec updates are used to update suspend status, which could result in delayed job start. The PR covers the distinction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Backoff is used for errors from API calls (like conflicts). It is not used for events (pod creation, updates, etc). I guess a better name for the variable would be
useBackoffor have separate functions altogether. - We want to be able to respond to changes to the spec ASAP (new parallelism, suspend, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should open a separate Issue to clean up the backoff status?
Just open a PR with your proposal of what a clean code looks like :)
There is a lot of heritage in the backoff code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just open a PR with your proposal of what a clean code looks like :)
We have one PR currently open for the fix + potentially small cleanup. I will consider another in the future, but some changes may not be just cosmetic, but also touching on semantics (so requiring discussion). Still, I suggest we move this discussion out of KEP to offline, under PRs or under Issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For (1) - clearly backoff isn't used only for errors from API calls, because errors from API calls don't generate events and we have those in event handlers:
https://github.com/kubernetes/kubernetes/blob/6195f96e56ee1e9f52986a0e768e22ca0d1949d6/pkg/controller/job/job_controller.go#L318-L321
@mimowo - while opening PR is certainly good thing, can you open an issue to so that we won't forget about it?
For (2) - I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t sure, opened: kubernetes/kubernetes#118527
wojtek-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From PRR perspective I'm fine - I wanted to follow up on scalability aspect.
|
/lgtm |
|
/lgtm /hold |
|
/hold cancel |
|
@soltysh - it needs your approval |
oh, I thought I already did |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
spoke on slack and live prior. The API looked good too. |
Uh oh!
There was an error while loading. Please reload this page.