Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for ttl cleanup for finished jobsets #443

Merged

Conversation

dejanzele
Copy link
Contributor

CHANGELOG

  • Added TTLSecondsAfterFinished *int32 in JobSetSpec with validation for minimal value 0
  • Run make manifests and make generate to update autogenerated code
  • Updated JobSet Controller to handle cleanup after TTL expires
  • Added unit and integration tests

This PR closes #279

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 10, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @dejanzele. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kannon92 March 10, 2024 19:21
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 10, 2024
Copy link

netlify bot commented Mar 10, 2024

Deploy Preview for kubernetes-sigs-jobset canceled.

Name Link
🔨 Latest commit 849ba92
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66149e2c02519d0008ceadf8

@dejanzele
Copy link
Contributor Author

/cc @kannon92 @ahg-g @danielvegamyhre

@dejanzele
Copy link
Contributor Author

Reference to the closed PR with original comments - #374

Should I add the metrics here also (I added the jobset_deletion_duration_seconds in the closed PR)?

Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dejanzele! I did a first pass reviewing the PR, once those comments are addressed I'll take another look.

if jobSetFinished(&js) {
if err := r.deleteJobs(ctx, ownedJobs.active); err != nil {
log.Error(err, "deleting jobs")
return ctrl.Result{}, err
}
if js.Spec.TTLSecondsAfterFinished != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TTLSecondsAfterFinished is set, we want to wait to delete child Jobs until after the TTL has expired. This gives the user time to inspect/debug the pods before everything is deleted, without having to formulate queries against cloud (or on-prem) log storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I reordered the job deletion to be below

pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@danielvegamyhre
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 11, 2024
@danielvegamyhre
Copy link
Contributor

@dejanzele are you still working on this?

@dejanzele
Copy link
Contributor Author

@dejanzele are you still working on this?

yes, I was off due to KubeCon, now I can resolve your comments

@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch 2 times, most recently from 751be2f to 463ae7a Compare March 26, 2024 12:32
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
pkg/controllers/jobset_controller.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 28, 2024
@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch 2 times, most recently from 60b70e3 to 586fa41 Compare March 29, 2024 17:37
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Mar 29, 2024
@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch from 586fa41 to 15fcc39 Compare March 31, 2024 22:11
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2024
@danielvegamyhre
Copy link
Contributor

I will take another look at this today

@danielvegamyhre
Copy link
Contributor

@dejanzele can you add an example JobSet yaml to the examples/simple/ folder which utilizes this feature?

@ahg-g
Copy link
Contributor

ahg-g commented Apr 1, 2024

/hold

I will take a quick look

@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch from 10a2309 to 6b60eed Compare April 5, 2024 21:35
Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one last minor comment.

pkg/controllers/ttl_after_finished_test.go Outdated Show resolved Hide resolved
@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch from 6b60eed to ef0e4c6 Compare April 8, 2024 14:04
@dejanzele
Copy link
Contributor Author

@danielvegamyhre @ahg-g I think all of your comments have been addressed.

@danielvegamyhre
Copy link
Contributor

/lgtm

I'll leave approval to @ahg-g so he can confirm his comments have been resolved.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 8, 2024
// If the JobSet has expired, it deletes the JobSet.
// If the JobSet has not expired, it returns the time after which the JobSet should be requeued.
// If the JobSet does not have a TTLSecondsAfterFinished set, it returns 0.
func (r *JobSetReconciler) executeTTLAfterFinishedPolicy(ctx context.Context, js *jobset.JobSet) (time.Duration, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't make this a member function of the reconciler because it is living in a different file now, same pkg, but still a different file. You can simply pass in the clock as a parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added unit tests using the fake client

pkg/controllers/ttl_after_finished.go Outdated Show resolved Hide resolved
test/integration/controller/jobset_controller_test.go Outdated Show resolved Hide resolved
test/integration/controller/jobset_controller_test.go Outdated Show resolved Hide resolved
Comment on lines 1251 to 1252
js1 := testJobSet(ns1).TTLSecondsAfterFinished(2).Obj()
js2 := testJobSet(ns2).TTLSecondsAfterFinished(2).Obj()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better test is to have one with TTL set and one without

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

// The following 2 checks do sanity checking for nil pointers in case of changes to the above function.
// This logic should never be executed.
if now == nil || finishAt == nil || expireAt == nil {
log.V(2).Info("Warning: Calculated invalid expiration time. JobSet cleanup will be deferred.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use error pls instead of info and remove the warning prefix pls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I log it as error or return an error?

I don't know should we fail Reconciliation if it goes into this loop, in theory we should never execute this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually return an error here

pkg/controllers/ttl_after_finished.go Outdated Show resolved Hide resolved
pkg/controllers/ttl_after_finished.go Outdated Show resolved Hide resolved
pkg/controllers/ttl_after_finished.go Outdated Show resolved Hide resolved
@dejanzele dejanzele force-pushed the feat/jobset-ttl-after-finished branch from ef0e4c6 to 849ba92 Compare April 9, 2024 01:47
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2024
@dejanzele
Copy link
Contributor Author

@danielvegamyhre @ahg-g is there a chance this feature makes it into the v0.5.0 release?

@kannon92
Copy link
Contributor

kannon92 commented Apr 9, 2024

@danielvegamyhre @ahg-g is there a chance this feature makes it into the v0.5.0 release?

Lightly following this, we called this out in the release tracking issue. And it seems that the PR is almost there. So I think its possible. its the last item for the release.

@danielvegamyhre
Copy link
Contributor

@danielvegamyhre @ahg-g is there a chance this feature makes it into the v0.5.0 release?

I think so, we plan to cut the release tomorrow but we can wait an extra day or two if necessary.

@dejanzele
Copy link
Contributor Author

Cool, I think all comments are addressed

@ahg-g
Copy link
Contributor

ahg-g commented Apr 9, 2024

/label tide/merge-method-squash

/lgtm
/approve

Thanks!

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 9, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, dejanzele

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 9, 2024
@dejanzele
Copy link
Contributor Author

@ahg-g I think you need to unhold also

@ahg-g
Copy link
Contributor

ahg-g commented Apr 10, 2024

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2024
@k8s-ci-robot k8s-ci-robot merged commit 547c8eb into kubernetes-sigs:main Apr 10, 2024
12 checks passed
testutil.JobSetCompleted(ctx, k8sClient, js2, timeout)

// Verify active jobs have been deleted after ttl has passed.
testutil.ExpectJobsDeletionTimestamp(ctx, k8sClient, js1, testutil.NumExpectedJobs(js1)-1, timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we just checking for the timestamp instead of checking that the jobs are actually deleted?

if err := k8sClient.Get(ctx, client.ObjectKeyFromObject(js2), &fresh2); err != nil {
return false
}
return !fresh1.DeletionTimestamp.IsZero() && fresh2.DeletionTimestamp.IsZero()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, why are we only checking that the deletion timestamp is set instead of checking that the jobset is notFound?

@danielvegamyhre danielvegamyhre mentioned this pull request Apr 15, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JobSet TTL to clean up completed workloads
5 participants