Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for ttl after finished controller for cleaning up finished jobsets #374

Closed

Conversation

dejanzele
Copy link
Contributor

CHANGELOG

  • Added TTLSecondsAfterFinished *int32 in JobSetSpec with validation for minimal value 0 and default value 0
  • Added TTLAfterFinished controller which watches for finished JobSets with configured TTL, similarly how it is implemented in Kubernetes for Job controller
  • Added unit and integration tests
  • Edited Dockerfile to include client-go folder in build container

This PR closes #279

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 29, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dejanzele
Once this PR has been reviewed and has the lgtm label, please assign danielvegamyhre for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 29, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @dejanzele. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 29, 2023
@dejanzele
Copy link
Contributor Author

@kannon92 @danielvegamyhre as I have issues running JobSet API locally on my M1, can you please help me verify this works?

@dejanzele dejanzele force-pushed the feat/ttl-after-finished branch from a52abad to 7b80317 Compare December 29, 2023 10:35
@dejanzele
Copy link
Contributor Author

/cc @danielvegamyhre

@kannon92
Copy link
Contributor

/ok-to-test
/cc @ahg-g

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 29, 2023
@dejanzele dejanzele force-pushed the feat/ttl-after-finished branch from 7b80317 to d714d55 Compare December 30, 2023 17:29
return false
}
return !fresh.DeletionTimestamp.IsZero()
}, 15*time.Second, 500*time.Millisecond).Should(gomega.BeTrue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a test case to verify jobs and pods are also deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that be tested in the JobSet controller, as it is the responsibility of that controller? TTL After Finished controlled shouldn't know or care about cleanup logic besides marking JobSet for deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the integration tests, there is no job controller running, so no pods will exist.

We could add an E2E test case for this, creating a JobSet in an isolated namespace, completing it and letting the TTL expire, then checking that no resources exist in the namespace anymore.

Given this is a pretty large PR with substantial changes like adding a new controller, metrics etc I think an e2e test case would give us more confidence in merging this.

Copy link
Contributor

@danielvegamyhre danielvegamyhre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for working on this. Glad you added metrics for the TTL deletion duration as well, that is a nice addition.

I added some comments, once those are addressed I'll take another look since this is a pretty large PR.

Also, have you manually tested theses changes, or just used unit tests and integration tests so far?

@@ -157,6 +163,22 @@ func setupControllers(mgr ctrl.Manager, certsReady chan struct{}) {
os.Exit(1)
}

clientset := versioned.NewForConfigOrDie(mgr.GetConfig())
sharedInformers := externalversions.NewSharedInformerFactory(clientset, 30*time.Minute)
Copy link
Contributor

@danielvegamyhre danielvegamyhre Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the 30 minute parameter specifying? Can we define the 30 minute variable as a variable somewhere to make it more obvious what the 30 minute duration is referring to without looking up the docs for this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use informerfactory, kube-builder sets it up already under the client we get from the manager (mgr.GetClient())

log: log,
}

_, _ = jobSetInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these 2 values being returned that we are ignoring? If one is an error, what reason do we have for ignoring it?

log logr.Logger,
) *TTLAfterFinishedReconciler {
config := workqueue.RateLimitingQueueConfig{Name: "ttl_jobsets_to_delete"}
tc := &TTLAfterFinishedReconciler{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we name this variable something besides tc? We usually use tc as a variable name for test cases so this is a bit confusing since the following lines look like test code at first glance.

Comment on lines +206 to +211
if errors.IsNotFound(err) {
return nil
}
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to:

	if client.IgnoreNotFound(err) != nil {
		return err
	}

See similar example here:

return ctrl.Result{}, client.IgnoreNotFound(err)

Comment on lines +225 to +230
if errors.IsNotFound(err) {
return nil
}
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified in the same way as suggested in my comment above.

Comment on lines +219 to +222
// The JobSet's TTL is assumed to have expired, but the JobSet TTL might be stale.
// Before deleting the JobSet, do a final sanity check.
// If TTL is modified before we do this check, we cannot be sure if the TTL truly expires.
// The latest JobSet may have a different UID, but it's fine because the checks will be run again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this, it looks like we are fetching the JobSet and checking the TTL 2 times in a row in the same function, without any logic or delays in between. Why is getting the JobSet state the first time on line 202 not considered "fresh" enough?

}

now := r.clock.Now()
t, e, err := timeLeft(r.log, jobSet, &now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this variable to something more descriptive than t and e? The corresponding variable names in the timeLeft function itself seem good: remaining and expireAt

return js.Spec.TTLSecondsAfterFinished != nil && jobSetFinished(js)
}

func getFinishAndExpireTime(js *jobsetv1alpha2.JobSet) (*time.Time, *time.Time, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we place the timeLeft function above the getFinishAndExpireTime so the layers of abstraction flow from top to bottom consistently?

return false
}
return !fresh.DeletionTimestamp.IsZero()
}, 15*time.Second, 500*time.Millisecond).Should(gomega.BeTrue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the integration tests, there is no job controller running, so no pods will exist.

We could add an E2E test case for this, creating a JobSet in an isolated namespace, completing it and letting the TTL expire, then checking that no resources exist in the namespace anymore.

Given this is a pretty large PR with substantial changes like adding a new controller, metrics etc I think an e2e test case would give us more confidence in merging this.

log: ctrl.Log.WithValues("controller", "TTLAfterFinished"),
}

go controller.Run(ctx, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we define 1 as a variable somewhere so it's more clear it is referring to the number of workers, without the reader needing to look up this function and read it's parameters?

This applies to all places where Run() is called.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, but kube-builder offers most the mechanics implemented in ttlafterfinished_controller.go file. I think the queue management stuff are not necessary.

Why not use the JobSet reconciler itself? Simply queue completed jobsets for the time remaining to delete it (using return ctrl.Result{RequeueAfter: timeToDelete} )

@@ -157,6 +163,22 @@ func setupControllers(mgr ctrl.Manager, certsReady chan struct{}) {
os.Exit(1)
}

clientset := versioned.NewForConfigOrDie(mgr.GetConfig())
sharedInformers := externalversions.NewSharedInformerFactory(clientset, 30*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to use informerfactory, kube-builder sets it up already under the client we get from the manager (mgr.GetClient())

@dejanzele
Copy link
Contributor Author

@ahg-g I based it on the TTL After Finished controller for Jobs in Kubernetes.
I agree it makes sense to reuse more logic from kubebuilder. I'll refactor the PR in this way.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 23, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 7, 2024

Hi @dejanzele, are you still working on this?

@ahg-g
Copy link
Contributor

ahg-g commented Feb 10, 2024

closing this due to inactivity

@ahg-g ahg-g closed this Feb 10, 2024
@dejanzele
Copy link
Contributor Author

Hey @ahg-g,

I was away on some other work, I plan to get back to this now if nobody took it in the meantime?

@ahg-g
Copy link
Contributor

ahg-g commented Mar 7, 2024

@dejanzele sounds good, looking forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JobSet TTL to clean up completed workloads
5 participants