-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefactor works of implementing job interface #589
Prefactor works of implementing job interface #589
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
cc @mimowo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @alculquicondor
Thanks!
@@ -210,39 +207,41 @@ func (r *JobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R | |||
log := ctrl.LoggerFrom(ctx).WithValues("job", klog.KObj(&job)) | |||
ctx = ctrl.LoggerInto(ctx, log) | |||
|
|||
pwName := parentWorkload(&job) | |||
isMainJob := parentWorkloadName(&job) == "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isStandaloneJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
// 3. handle workload is nil. | ||
if wl == nil { | ||
if !isMainJob { | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, shouldn't we should suspend the dependent job?
so maybe it should be if wl == nil && isStandaloneJob
and below (step 6) it should say if wl == nil || wl.Spec.Admission == nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only create workload here, leaving the suspend logics to another reconciler. Passing a nil workload downwards is somehow dangerous.
After creating the workload:
- if dependent job is suspended, jump to step5.
- if dependent job is unsuspended, jump to step6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But one corner case is both parent and child jobs are unsuspended, then we delete the workload manually. Then if the workload:
- admitted immediately again with the same assignment: it works well
- admitted immediately again with a different assignment: the node affinity may be changed so not working as expected
- unadmitted, jobs will be suspended finally.
Then we may have a potential bug here. We can suspend all the jobs here when workload is nil .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
open a ticket here #595
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kerthcet can you double check this is an issue currently, we have this integration test:
ginkgo.It("Should suspend a job if the parent workload does not exist", func() { |
IIUC the code to ensure that is here: https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workload/job/job_controller.go#L483-L495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll take charge of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This shouldn't be a problem in the existing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it will not happen, I just complicated the situation about admitted immediately again with a different assignment: the node affinity may be changed so not working as expected
, the admitted workload should be create at first, then at that moment, the job will be suspended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.
This will also be covered by https://github.com/kubernetes-sigs/kueue/blob/main/pkg/controller/workload/job/job_controller.go#L483-L495
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign @mimowo
|
||
// 3. handle workload is nil. | ||
if wl == nil { | ||
if !isMainJob { | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could also be the case that somehow the parent job controller is crashing (and not able to create a workload) and somehow a user set the child job to suspend=false. We don't want that child job to continue running.
0d25130
to
6e7146c
Compare
Updated, PTAL. |
log.Error(err, "Unable to list child workloads") | ||
return ctrl.Result{}, err | ||
} | ||
|
||
// 1. make sure there is only a single existing instance of the workload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment here that if there was no workload, this will suspend the job.
Although I find it confusing. Maybe we should move that part out of the function and call stopJob
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's kind of thing I already done in job interface library, let's visit this part on that one.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, kerthcet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Kante Yin <kerthcet@gmail.com>
6e7146c
to
e33b886
Compare
ping @mimowo for LGTM |
/lgtm |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
prefactor work before #544
The job controller work flow would like:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: