Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefactor works of implementing job interface #589

Conversation

kerthcet
Copy link
Contributor

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

prefactor work before #544

The job controller work flow would like:

  1. handling job is finished
  2. handling workload is nil
  3. processing only main job related logics
  4. handling job is suspended
  5. handling job is unsuspended

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

@k8s-ci-robot k8s-ci-robot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Feb 21, 2023
@netlify
Copy link

netlify bot commented Feb 21, 2023

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit e33b886
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6400112b44a3ab0008bb81df

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 21, 2023
@kerthcet
Copy link
Contributor Author

cc @mimowo

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @alculquicondor
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 21, 2023
@@ -210,39 +207,41 @@ func (r *JobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R
log := ctrl.LoggerFrom(ctx).WithValues("job", klog.KObj(&job))
ctx = ctrl.LoggerInto(ctx, log)

pwName := parentWorkload(&job)
isMainJob := parentWorkloadName(&job) == ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isStandaloneJob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


// 3. handle workload is nil.
if wl == nil {
if !isMainJob {
return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, shouldn't we should suspend the dependent job?

so maybe it should be if wl == nil && isStandaloneJob and below (step 6) it should say if wl == nil || wl.Spec.Admission == nil

Copy link
Contributor Author

@kerthcet kerthcet Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only create workload here, leaving the suspend logics to another reconciler. Passing a nil workload downwards is somehow dangerous.

After creating the workload:

  1. if dependent job is suspended, jump to step5.
  2. if dependent job is unsuspended, jump to step6.

Copy link
Contributor Author

@kerthcet kerthcet Feb 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But one corner case is both parent and child jobs are unsuspended, then we delete the workload manually. Then if the workload:

  • admitted immediately again with the same assignment: it works well
  • admitted immediately again with a different assignment: the node affinity may be changed so not working as expected
  • unadmitted, jobs will be suspended finally.

Then we may have a potential bug here. We can suspend all the jobs here when workload is nil .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open a ticket here #595

Copy link
Contributor

@mimowo mimowo Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kerthcet can you double check this is an issue currently, we have this integration test:

ginkgo.It("Should suspend a job if the parent workload does not exist", func() {

IIUC the code to ensure that is here: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workload/job/job_controller.go#L483-L495

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll take charge of this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This shouldn't be a problem in the existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it will not happen, I just complicated the situation about admitted immediately again with a different assignment: the node affinity may be changed so not working as expected, the admitted workload should be create at first, then at that moment, the job will be suspended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.

This will also be covered by https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workload/job/job_controller.go#L483-L495

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2023
Copy link
Contributor

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @mimowo


// 3. handle workload is nil.
if wl == nil {
if !isMainJob {
return ctrl.Result{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.

@kerthcet
Copy link
Contributor Author

Updated, PTAL.

pkg/controller/workload/job/job_controller.go Outdated Show resolved Hide resolved
log.Error(err, "Unable to list child workloads")
return ctrl.Result{}, err
}

// 1. make sure there is only a single existing instance of the workload
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here that if there was no workload, this will suspend the job.

Although I find it confusing. Maybe we should move that part out of the function and call stopJob here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's kind of thing I already done in job interface library, let's visit this part on that one.

@alculquicondor
Copy link
Contributor

alculquicondor commented Mar 1, 2023

/approve
@kerthcet just a couple of nits

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, kerthcet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2023
Signed-off-by: Kante Yin <kerthcet@gmail.com>
@kerthcet kerthcet force-pushed the cleanup/preparation-to-job-interface branch from 6e7146c to e33b886 Compare March 2, 2023 02:59
@kerthcet
Copy link
Contributor Author

kerthcet commented Mar 2, 2023

ping @mimowo for LGTM

@mimowo
Copy link
Contributor

mimowo commented Mar 2, 2023

/lgtm
Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2023
@k8s-ci-robot k8s-ci-robot merged commit 271e66a into kubernetes-sigs:main Mar 2, 2023
@kerthcet kerthcet deleted the cleanup/preparation-to-job-interface branch March 2, 2023 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants