-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: delete active jobs right away when job finishes even when TTLSecondsAfterFinished is set #667
fix: delete active jobs right away when job finishes even when TTLSecondsAfterFinished is set #667
Conversation
✅ Deploy Preview for kubernetes-sigs-jobset canceled.
|
…ondsAfterFinished is set change order
c3581cb
to
5c882f9
Compare
@@ -154,6 +154,10 @@ func (r *JobSetReconciler) reconcile(ctx context.Context, js *jobset.JobSet, upd | |||
|
|||
// If JobSet is already completed or failed, clean up active child jobs and requeue if TTLSecondsAfterFinished is set. | |||
if jobSetFinished(js) { | |||
if err := r.deleteJobs(ctx, ownedJobs.active); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes sense, although I think we originally put deleteJobs after the TTL check because users wanted to use TTL to allow time to inspect logs via kubectl before everything is deleted, to debug more easily without having to craft queries in cloud logging systems.
Quickly freeing up resources that may still be being used by a completed/failed JobSet is important as well, but for this purpose the user can just tune their TTL accordingly.
@CecileRobertMichon @kannon92 @ahg-g any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only deleting active jobs, which I think makes sense.
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, CecileRobertMichon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind bug
What this PR does / why we need it:
When
TTLSecondsAfterFinished
is not set, the active jobs get deleted right away to free up resources. However, whenTTLSecondsAfterFinished
is set, the active jobs and the completed jobs all get deleted after the TTL expires when the jobset itself gets deleted. We should always delete active jobs to free up resources when a jobset is failed or completed, then optionally cleanup the jobset itself ifTTLSecondsAfterFinished
is set.Which issue(s) this PR fixes:
Fixes #666
Special notes for your reviewer:
Does this PR introduce a user-facing change?