Record injected node affinity in batch Job #518

kerthcet · 2023-01-16T10:50:27Z

What would you like to be added:

When we want to suspend Job, we'd like to restore the original nodeAffinity, but some times we can't find the derived workload, see

kueue/pkg/controller/workload/job/job_controller.go

Lines 397 to 402 in 045697c

    
           if len(workloads.Items) == 1 { 
        
           	// The job may have been modified and hence the existing workload 
        
           	// doesn't match the job anymore. All bets are off if there are more 
        
           	// than one workload... 
        
           	w = &workloads.Items[0] 
        
           }

I'd like to add the nodeAffinity to Job annotations to make this an accurate one. It would like:

annotations: kueue.sig.kubernetes.io/injected-node-affinity: '{"key1": "value1", "key2": "value2"}'

Why is this needed:

Always make sure that when suspending a Job, we'll restore the original nodeAffinity.

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

The text was updated successfully, but these errors were encountered:

kerthcet · 2023-01-16T10:53:03Z

cc @ahg-g as the first author.

ahg-g · 2023-01-16T19:24:03Z

My concern is that this will add another update request for every job. Is this cost justified?

alculquicondor · 2023-02-01T21:55:31Z

It can be the same API call that updates the Job spec.

However, maybe we should store the original node selector instead of the injected one.

ahg-g · 2023-02-28T00:18:44Z

Note that when the job is suspended, the controller will reset the nodeSelector on the job:

kueue/pkg/controller/workload/job/job_controller.go

Line 352 in b2a5e38

    
           if w != nil && !equality.Semantic.DeepEqual(job.Spec.Template.Spec.NodeSelector,

kerthcet · 2023-02-28T03:09:01Z

Note that when the job is suspended, the controller will reset the nodeSelector on the job:

Yes, but sometimes the corner case could be job is unsuspended, but the workload is deleted in an unknown condition, like delete in manual, then the job will maintain a wrongly configured nodeSelector.

mcariatm · 2023-03-06T15:15:56Z

/assign

ahg-g · 2023-03-06T15:49:53Z

Note that when the job is suspended, the controller will reset the nodeSelector on the job:

Yes, but sometimes the corner case could be job is unsuspended, but the workload is deleted in an unknown condition, like delete in manual, then the job will maintain a wrongly configured nodeSelector.

Right, it is ok to record the original nodeSelector as long as it is done in the same nodeSelector update request, but I think it is worth having a discussion on whether we want to attach the workload's life with the job using finalizers.

alculquicondor · 2023-03-06T17:05:05Z

Deleting a Workload seems like an important tool for forcing a requeue and adding finalizers could further complicate this use case.

kerthcet · 2023-03-09T07:14:55Z

whether we want to attach the workload's life with the job using finalizers

Add a finalizer to workload, when to delete the workload, restore the node selector with Job. Seems more convincible. I prefer to not use annotation if we can. But yes, we will have to handle the terminating workload in job reconciling.

alculquicondor · 2023-03-22T14:43:04Z

it's /assign

alculquicondor · 2023-03-22T14:44:16Z

@mimowo could you take on a review for a future PR on this, in the context of the job integration framework?

trasc · 2023-03-22T14:44:49Z

/assign

mimowo · 2023-03-22T15:03:04Z

@mimowo could you take on a review for a future PR on this, in the context of the job integration framework?

Sure.

alculquicondor · 2023-05-31T20:32:04Z

I'm now wondering whether we should revert this, as we have a growing number of annotations to support partial admission.

On support of this feature, we have resiliency: we can loose the Workload object and we can still recover the job.

However, is it worth?

End users can't modify workloads by default https://github.com/kubernetes-sigs/kueue/blob/main/config/components/rbac/workload_editor_role.yaml
There are concerns about sizes of the Job manifests upstream Indexed Jobs can break with high number of parallelism or completions kubernetes/kubernetes#118085, then certainly adding a big annotation doesn't help.

cc @tenzen-y @trasc

I think we can just document that users (including admins) shouldn't remove a Workload object.

alculquicondor · 2023-05-31T20:32:14Z

/reopen

k8s-ci-robot · 2023-05-31T20:32:18Z

@alculquicondor: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

kerthcet · 2023-06-01T09:50:36Z

I think we can just document that users (including admins) shouldn't remove a Workload object

Or leveraging finalizers here, when deleting jobs, we'll restore the node affinity and then remove the finalizer.

alculquicondor · 2023-06-01T12:03:10Z

The finalizers alternative require careful thinking too. We don't want to accidentally leave objects with finalizers. The problem here is that the Job object is a parent of the Workload object. As such, the Job can't be deleted unless the Workload is deleted first. Then we have a circular dependency.

Another alternative is that a Workload has a finalizer if it's admitted. But the complication is that finalizers are not part of the status, so we need additional API calls.

Not sure if it's worth the effort, but worth exploring.

kerthcet · 2023-06-01T14:00:40Z

RE: It's not delete the job but delete the workload. The problem we can't restore the node affinity is because workload might be deleted accidentally, now we add the finalized to the workload, when we want to delete the workload, we'll restore the Job, then remove the finalizer, the workload will be deleted finally.

tenzen-y · 2023-06-01T17:24:47Z

I'm now wondering whether we should revert this, as we have a growing number of annotations to support partial admission.

Yea... Our annotations are so big...

It's not delete the job but delete the workload. The problem we can't restore the node affinity is because workload might be deleted accidentally, now we add the finalized to the workload, when we want to delete the workload, we'll restore the Job, then remove the finalizer, the workload will be deleted finally.

That probably works fine. But we need to evaluate whether it is worth doing even if making more API calls as @alculquicondor says.

trasc · 2023-06-06T09:16:33Z

An alternative could be to have an additional custom resource just to back-up selectors and counts. This resource should be owned by the job, so after we are creating / update it before unsuspending the job, we do not need to keep track of it's lifecycle.

We will have additional API calls, but they should not trigger any controller.

@alculquicondor @kerthcet @tenzen-y WDYT?

alculquicondor · 2023-06-06T15:13:11Z

Another CRD would have the same issue about needing a finalizer.

The workload is already an object that end-users shouldn't have permissions to edit or delete.

I prefer we get rid of the annotation, without any finalizer in the Workload. And re-evaluate in the future if we find a use case where end-users need to modify the Workload object.

trasc · 2023-06-06T17:37:05Z

Another CRD would have the same issue about needing a finalizer.

Not exactly, there is no delete conditioning , when the job gets deleted so it's the "backup" resource.

alculquicondor · 2023-06-06T17:44:20Z

The Job also owns the Workload. So when we delete the Job, the Workload gets deleted as well.

The problem is what happens if someone (not Kueue) deletes the Workload prematurely.

trasc · 2023-06-06T17:47:33Z

Yes the problem is when the workload gets deleted before the restore, in that case the backup resource will still exists, and the restore can be done from that.

When the job gets deleted ... we don't actually care what is happening to the selectors and counts.

alculquicondor · 2023-06-06T17:49:43Z

But what's the difference between the "backup" resource and Workload? They are both subject to an unauthorized deletion. I don't see any difference, so I rather have one object.

tenzen-y · 2023-06-06T18:01:41Z

The key here is that end-users accidentally remove resources storing original job information.
So, I think the problem isn't solved even if we introduce another CRD.

trasc · 2023-06-06T19:09:24Z

I don't see accidental removal as a real problem, the chances of it to happen is the same as accidental removal of the job, however with a different resource a queue administrator could make sure that "end-users accidentally remove resources storing original job information" by RBAC.

During the review of the original implementation, I think, workload deletions was presented as a valid way to requeue.

alculquicondor · 2023-06-06T19:30:39Z

Right... that's an easy way of evicting a workload and put it in the front of the queue.

The alternative would be that the administrator only deletes the admission field in the status, but this puts the job in the back of the queue.

A finalizer would still be a more perfomant solution than having a second object. The question would be which controller removes the finalizer?
~~I guess it could be the workload controller itself, based on whether the Admitted=true condition is present.~~
It has to be each job controller, based on whether the original spec was restored after the Workload has a deletionTimestamp.

alculquicondor · 2023-06-06T19:34:12Z

I would suggest we mark the deletion of a Workload as "unsupported behavior" to start.

@tenzen-y @kerthcet are you ok with that?

tenzen-y · 2023-06-06T20:05:42Z

I would suggest we mark the deletion of a Workload as "unsupported behavior" to start.

Does that mean we say the note in the document?

alculquicondor · 2023-06-06T20:19:46Z

yes

tenzen-y · 2023-06-06T20:28:46Z

Agree.

Additionally, we might want to consider another way to re-enqueue the job.

kerthcet · 2023-06-07T04:07:22Z

SGTM, the ROI is high 😄

mimowo · 2023-06-07T07:22:11Z

Does it make sense to consider a knob in Kueue configuration whether to store the annotation?

Some users wouldn't be concerned about Job size (reasons may vary: 1. using non-indexed jobs, 2. using small node selectors, or using indexed jobs with small parameters), but may be concerned about losing track of node selectors.

kerthcet · 2023-06-07T11:22:56Z

I think the motivation here is avoid to use too many annotations from the POV of kueue. If we have the ambition to push this to the upstream, it will be a stumbling stock. 🥲

trasc · 2023-06-07T12:35:33Z

I've opened a new PR for this #834, It's still in draft since is developed on top of #771.

More interesting for this discussion are:

commit no 1 which removes the use the annotations, but we should not delete workloads anymore
commit no 2 adds and manages a finalizer in the workloads (this is still wip , need some cleanup and tests, but is functional)

Please have a look and let me know what you think.

trasc · 2023-06-07T12:39:30Z

/assign

alculquicondor · 2023-06-07T15:34:23Z

Does it make sense to consider a knob in Kueue configuration whether to store the annotation?

I prefer we don't maintain such piece of code. We are also risking that the annotation changes name/contents from one version to the next, as we do more mutations during admission.

kerthcet added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 16, 2023

alculquicondor mentioned this issue Feb 27, 2023

WaitForPodsReady: Requeue at the back of the queue after timeout #599

Closed

3 tasks

k8s-ci-robot assigned mcariatm Mar 6, 2023

kerthcet mentioned this issue Mar 9, 2023

Implement GenericJob interface on batchv1.Job (cherry-picked and resolved conflicts) #616

Merged

k8s-ci-robot assigned trasc Mar 22, 2023

mcariatm removed their assignment Mar 22, 2023

trasc mentioned this issue Mar 24, 2023

Record original node selectors #660

Merged

k8s-ci-robot closed this as completed in #660 Apr 5, 2023

k8s-ci-robot reopened this May 31, 2023

alculquicondor mentioned this issue May 31, 2023

Partial admission #771

Merged

trasc mentioned this issue Jun 7, 2023

No annotation for PodSets Info backup #834

Merged

trasc mentioned this issue Jun 16, 2023

Workload finalizer #861

Merged

trasc mentioned this issue Aug 3, 2023

Workload finalizer #1037

Closed

k8s-ci-robot closed this as completed in #861 Aug 18, 2023

Record injected node affinity in batch Job #518

Record injected node affinity in batch Job #518

Comments

kerthcet commented Jan 16, 2023

kerthcet commented Jan 16, 2023

ahg-g commented Jan 16, 2023

alculquicondor commented Feb 1, 2023 • edited Loading

ahg-g commented Feb 28, 2023

kerthcet commented Feb 28, 2023 • edited Loading

mcariatm commented Mar 6, 2023

ahg-g commented Mar 6, 2023

alculquicondor commented Mar 6, 2023

kerthcet commented Mar 9, 2023

alculquicondor commented Mar 22, 2023

alculquicondor commented Mar 22, 2023

trasc commented Mar 22, 2023

mimowo commented Mar 22, 2023

alculquicondor commented May 31, 2023

alculquicondor commented May 31, 2023

k8s-ci-robot commented May 31, 2023

kerthcet commented Jun 1, 2023

alculquicondor commented Jun 1, 2023

kerthcet commented Jun 1, 2023

tenzen-y commented Jun 1, 2023

trasc commented Jun 6, 2023 • edited Loading

alculquicondor commented Jun 6, 2023

trasc commented Jun 6, 2023

alculquicondor commented Jun 6, 2023

trasc commented Jun 6, 2023

alculquicondor commented Jun 6, 2023 • edited Loading

tenzen-y commented Jun 6, 2023

trasc commented Jun 6, 2023

alculquicondor commented Jun 6, 2023 • edited Loading

alculquicondor commented Jun 6, 2023 • edited Loading

tenzen-y commented Jun 6, 2023

alculquicondor commented Jun 6, 2023

tenzen-y commented Jun 6, 2023 • edited Loading

kerthcet commented Jun 7, 2023

mimowo commented Jun 7, 2023

kerthcet commented Jun 7, 2023

trasc commented Jun 7, 2023

trasc commented Jun 7, 2023

alculquicondor commented Jun 7, 2023 • edited Loading

alculquicondor commented Feb 1, 2023 •

edited

Loading

kerthcet commented Feb 28, 2023 •

edited

Loading

trasc commented Jun 6, 2023 •

edited

Loading

alculquicondor commented Jun 6, 2023 •

edited

Loading

alculquicondor commented Jun 6, 2023 •

edited

Loading

alculquicondor commented Jun 6, 2023 •

edited

Loading

tenzen-y commented Jun 6, 2023 •

edited

Loading

alculquicondor commented Jun 7, 2023 •

edited

Loading