Queue reconciler when prebuilt workload is created after job #3131

IrvingMg · 2024-09-24T16:54:30Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Invokes job reconciler when workload appears so that job can take ownership of it.

Which issue(s) this PR fixes:

Fixes #3051

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2024-09-24T16:54:32Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-09-24T16:54:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: IrvingMg
Once this PR has been reviewed and has the lgtm label, please assign kerthcet, trasc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controller/OWNERS
test/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2024-09-24T16:54:47Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`4315626`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6705c95248f7da0008c6490a

IrvingMg · 2024-09-26T17:47:33Z

/test all

IrvingMg · 2024-09-26T18:21:13Z

/test pull-kueue-test-e2e-main-1-31

IrvingMg · 2024-09-27T07:42:22Z

/retest-required

IrvingMg · 2024-10-07T18:15:46Z

/test all

IrvingMg · 2024-10-07T18:31:00Z

/cc @mbobrovskyi

pkg/controller/jobs/job/job_controller.go

pkg/controller/jobs/jobset/jobset_controller.go

pkg/controller/jobs/kubeflow/jobs/paddlejob/paddlejob_controller.go

pkg/controller/jobs/kubeflow/jobs/pytorchjob/pytorchjob_controller.go

pkg/controller/jobs/kubeflow/jobs/tfjob/tfjob_controller.go

pkg/controller/jobs/kubeflow/jobs/xgboostjob/xgboostjob_controller.go

pkg/controller/jobs/mpijob/mpijob_controller.go

test/integration/controller/jobs/job/job_controller_test.go

trasc

/lgtm

k8s-ci-robot · 2024-10-09T05:38:54Z

LGTM label has been added.

Git tree hash: 9f5ace647ee57320e9d5a0192ae78e38e0c3f510

mimowo

I'm a bit concerned with the complexity of the solution as a fix for the flake. In particular, that it is distributed across all Job integrations, which means extra friction for the users who want to write their own external Job controllers.

IIUC this issue does not impact MultiKueue which is the only current consumer of the prebuilt-workload (see comment. If this is the case, I would suggest to solve the flake in a separate PR. IIUC we can solve it by making the workload reserving before creating the Job object. This will make sure the Workload creation event is propagated before the Job is created. Let me know if I'm missing some scenarios @trasc .

We could separately open an issue for the prebuilt-workload feature, and follow up with a fix, and maybe this PR is still a valid approach, but splitting will give us necessary time to think about the right approach without the need to worry about flakes.

cc @tenzen-y @alculquicondor

trasc · 2024-10-09T13:27:15Z

IIUC this issue does not impact MultiKueue which is the only current consumer of the prebuilt-workload (see comment. If this is the case, I would suggest to solve the flake in a separate PR. IIUC we can solve it by making the workload reserving before creating the Job object. This will make sure the Workload creation event is propagated before the Job is created. Let me know if I'm missing some scenarios @trasc .

Indeed we mostly use the prebuilt workload feature in multikueue where, for now, this scenario is unlikely to happen. However the prebuilt workload is a different independent feature.

This scenario happening in some other context is problematic since the Workload can get admitted and keep the quota reserved indefinitely since it's job is not aware of its existence.

Regarding the complexity:

This is not the only "special" thing one needs to do to support prebuilt workload for a specific job type, take the JobSet's [jobset] Add prebuilt workload support. #1575 for example.
A slightly less optimal, but generic proposal was made in [Flaking test] Kueue when Creating a Job With Queueing Should run with prebuilt workload #3051 (to queue a delayed reconcile for the job if the prebuilt workload is missing, this wold have impacted only the jobframework), but the current one was chosen.

mimowo · 2024-10-09T14:03:04Z

However the prebuilt workload is a different independent feature.

That is a bit debatable. It doesn't have a dedicated KEP or user-facing docs. So it is a bit like a "side-feature" of MK (and so its expected behavior outside of MK is a bit undefined).

Having said that, I understand it is part of the API surface, so fixing it makes sense, I was just thinking about decoupling the fix for flakes, to have a bit of time to think about a proper fix for the issue, as we have competing possibilities.

Due to other workstreams I wasn't yet able to assess which approach is better. I can try to do so on the best effort basis, but probably next week. In the meanwhile if @tenzen-y @alculquicondor have some assessment here, it would be very helpful.

mimowo · 2024-10-09T14:46:26Z

Thinking "aloud". Aside of friction to the developers of the integrations, another con of the approach is that only batch/Job is covered with tests (in the current PR). For other integrations we hope we didn't make a mistake when copy-pasting and renaming. OTOH, adding integration tests per CRD seems overkill. So, maybe it is worth to explore what the "retry" approach would entail.

mimowo · 2024-10-09T14:54:02Z

Thinking about it more, wouldn't it be sufficient to just return fmt.Errorf("no expected prebuilt-workload") from

kueue/pkg/controller/jobframework/reconciler.go

Line 1000 in a768127

}

?

This would be an error, yes, but controller-runtime would requeue with exponential delay.

IIUC this would solve the flake, work fine for MK, and work so-so for the other use-cases, and work ootb for other CRDs. The benefit is that this is one-liner approach, which we could improve upon user-feedback.

trasc · 2024-10-10T11:57:00Z

Thinking about it more, wouldn't it be sufficient to just return fmt.Errorf("no expected prebuilt-workload") from ..

This might not be what we need, in my opinion the reconcile should continue in the "missing workload" state as it needs to potentially stop the job and/or more.

This would be an error, yes, but controller-runtime would requeue with exponential delay.

The exponential delay might be to unpredictable for what we need.

To keep the changes in the job integrations at a minimum we can add generic (jobframework) handler for missing prebuilt workloads, the only somehow "exotic" thing this implies is the need to use reflect to extract the Items from a client.ObjectList.

It still needs some work (cleanup, unit tests) but it can look like main...epam:kubernetes-kueue:prebuilt-requeue-generic . (maybe we can do it as a follow-up)

Edit: Something based on UnstructuredList may also be possible. But it's problematic form a caching perspective, we should either accept that the listing is uncached or enable CacheUnstructured which may increase cpu and memory usage.

mimowo · 2024-10-11T10:22:13Z

This might not be what we need, in my opinion the reconcile should continue in the "missing workload" state as it needs to potentially stop the job and/or more.

This was just a quick code check, surely we need to test if I'm not mistaken. Probably this place in code is better:

kueue/pkg/controller/jobframework/reconciler.go

Line 1011 in a768127

return nil

.

I'm hesitant if it is a good investment of time to implement (and review) a complex solution for a problem which is outside of current use-cases (MultiKueue).

The semantics of prebuilt-workload in that scenario is not specified anywhere , so I think a one-liner retry is a valid approach.

Rest assured, we can have a KEP and specify the behavior, and implement Job support when we have such use-cases. For now, I'm focused on fixing the flake, and preventing potentially rare scenarios when we need to wait a little bit of time for the propagation of events.

mimowo · 2024-10-15T08:10:29Z

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

IrvingMg · 2024-10-16T01:46:07Z

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

I've tested this approach on my local but it seems it doesn't work. First, I've tried just returning the error fmt.Errorf("no expected prebuilt-workload") here:

kueue/pkg/controller/jobframework/reconciler.go

Line 1015 in 391a000

return nil

but it didn't work.

Then, I've added a return ctrl.Result{Requeue: true}, nil here

kueue/pkg/controller/jobframework/reconciler.go

Line 392 in 391a000

log.Error(err, "Handling job with no workload")

but the e2e test keeps failing in the same case.

mimowo · 2024-10-16T05:52:53Z

Thanks for testing. This is a bit weird, because controller runtime should generally retry on errors. I see we have some function to categorize errors as retryable. Would be good to investigate a bit to see how this function handles the error, and if the modified code is used.

IrvingMg · 2024-10-17T09:46:07Z

Still need to investigate more about this issue. For now, I've tested this solution by @mbobrovskyi: main...epam:kubernetes-kueue:fix/wait-for-prebuild-workload but I keep getting the same error as before:

Kueue when Creating a Job With Queueing [It] Should run with prebuilt workload
  [FAILED] Timed out after 45.001s.
  The function passed to Eventually failed at /Users/Irving_Mondragon/Documents/git/kueue/test/e2e/singlecluster/e2e_test.go:206 with:
  Expected
      <[]string | len:1, cap:4>: [
          "kueue.x-k8s.io/resource-in-use",
      ]
  not to contain element matching
      <string>: kueue.x-k8s.io/resource-in-use

mbobrovskyi · 2024-10-17T13:14:44Z

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

Opened the PR #3255.

mimowo · 2024-10-17T14:28:02Z

Thanks!

tenzen-y · 2024-10-17T20:07:22Z

/cc

mbobrovskyi · 2024-10-18T14:53:05Z

Still need to investigate more about this issue. For now, I've tested this solution by @mbobrovskyi: main...epam:kubernetes-kueue:fix/wait-for-prebuild-workload but I keep getting the same error as before:
Kueue when Creating a Job With Queueing [It] Should run with prebuilt workload
  [FAILED] Timed out after 45.001s.
  The function passed to Eventually failed at /Users/Irving_Mondragon/Documents/git/kueue/test/e2e/singlecluster/e2e_test.go:206 with:
  Expected
      <[]string | len:1, cap:4>: [
          "kueue.x-k8s.io/resource-in-use",
      ]
  not to contain element matching
      <string>: kueue.x-k8s.io/resource-in-use

This is an another issue. @IrvingMg I would propose to create separate PR to fix it.

mimowo · 2024-10-21T10:41:12Z

/close
in favor of the retry in #3255. I appretiate the effort on this PR to investigate and prototype the approach. Still, I think the PR adds more complexity than needed at this point. So, let's go with "best effort" support for now. We will improve the support when we have evidence or feedback from users who need it.

k8s-ci-robot · 2024-10-21T10:41:17Z

@mimowo: Closed this PR.

In response to this:

/close
in favor of the retry in #3255. I appretiate the effort on this PR to investigate and prototype the approach. Still, I think the PR adds more complexity than needed at this point. So, let's go with "best effort" support for now. We will improve the support when we have evidence or feedback from users who need it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from denkensk and trasc September 24, 2024 16:54

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 24, 2024

IrvingMg changed the title ~~Reconcile when workload is created after job~~ Queue reconciler when prebuilt workload is created after job Sep 24, 2024

IrvingMg force-pushed the fix/kueue-when-creating-job-with-queueing-should-run-with-prebuilt-workload branch 2 times, most recently from 5045026 to dfea000 Compare September 26, 2024 16:34

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 26, 2024

mimowo mentioned this pull request Sep 27, 2024

[Flaking test] Kueue when Creating a Job With Queueing Should run with prebuilt workload #3051

Closed

IrvingMg added 3 commits October 7, 2024 19:57

Reconcile jobs waiting for prebuilt workload

5a20dff

Add workload watcher to jobs supporting prebuilt workload

8ad76be

Use prebuilt workload label to reconcile

06e63a5

IrvingMg force-pushed the fix/kueue-when-creating-job-with-queueing-should-run-with-prebuilt-workload branch from 444a464 to 06e63a5 Compare October 7, 2024 18:15

IrvingMg marked this pull request as ready for review October 7, 2024 18:30

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024

k8s-ci-robot requested a review from alculquicondor October 7, 2024 18:30

k8s-ci-robot requested a review from mbobrovskyi October 7, 2024 18:31

trasc reviewed Oct 8, 2024

View reviewed changes

IrvingMg added 3 commits October 8, 2024 20:55

Remove extra blank line

5b77965

Use handler.Funcs

da416f0

Fix test case

4315626

trasc reviewed Oct 9, 2024

View reviewed changes

k8s-ci-robot assigned trasc Oct 9, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2024

mimowo reviewed Oct 9, 2024

View reviewed changes

mbobrovskyi mentioned this pull request Oct 17, 2024

Wait for the prebuild workload. #3255

Merged

k8s-ci-robot requested a review from tenzen-y October 17, 2024 20:07

mbobrovskyi mentioned this pull request Oct 21, 2024

Await for pods to be running before delete all pods. #3272

Merged

k8s-ci-robot closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue reconciler when prebuilt workload is created after job #3131

Queue reconciler when prebuilt workload is created after job #3131

IrvingMg commented Sep 24, 2024

k8s-ci-robot commented Sep 24, 2024

k8s-ci-robot commented Sep 24, 2024

netlify bot commented Sep 24, 2024 •

edited

Loading

IrvingMg commented Sep 26, 2024

IrvingMg commented Sep 26, 2024

IrvingMg commented Sep 27, 2024

IrvingMg commented Oct 7, 2024

IrvingMg commented Oct 7, 2024

trasc left a comment

k8s-ci-robot commented Oct 9, 2024

mimowo left a comment •

edited

Loading

trasc commented Oct 9, 2024

mimowo commented Oct 9, 2024 •

edited

Loading

mimowo commented Oct 9, 2024

mimowo commented Oct 9, 2024 •

edited

Loading

trasc commented Oct 10, 2024 •

edited

Loading

mimowo commented Oct 11, 2024 •

edited

Loading

mimowo commented Oct 15, 2024 •

edited

Loading

IrvingMg commented Oct 16, 2024

mimowo commented Oct 16, 2024

IrvingMg commented Oct 17, 2024

mbobrovskyi commented Oct 17, 2024

mimowo commented Oct 17, 2024

tenzen-y commented Oct 17, 2024

mbobrovskyi commented Oct 18, 2024 •

edited

Loading

mimowo commented Oct 21, 2024

k8s-ci-robot commented Oct 21, 2024

Queue reconciler when prebuilt workload is created after job #3131

Queue reconciler when prebuilt workload is created after job #3131

Conversation

IrvingMg commented Sep 24, 2024

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 24, 2024

k8s-ci-robot commented Sep 24, 2024

netlify bot commented Sep 24, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

IrvingMg commented Sep 26, 2024

IrvingMg commented Sep 26, 2024

IrvingMg commented Sep 27, 2024

IrvingMg commented Oct 7, 2024

IrvingMg commented Oct 7, 2024

trasc left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 9, 2024

mimowo left a comment • edited Loading

Choose a reason for hiding this comment

trasc commented Oct 9, 2024

mimowo commented Oct 9, 2024 • edited Loading

mimowo commented Oct 9, 2024

mimowo commented Oct 9, 2024 • edited Loading

trasc commented Oct 10, 2024 • edited Loading

mimowo commented Oct 11, 2024 • edited Loading

mimowo commented Oct 15, 2024 • edited Loading

IrvingMg commented Oct 16, 2024

mimowo commented Oct 16, 2024

IrvingMg commented Oct 17, 2024

mbobrovskyi commented Oct 17, 2024

mimowo commented Oct 17, 2024

tenzen-y commented Oct 17, 2024

mbobrovskyi commented Oct 18, 2024 • edited Loading

mimowo commented Oct 21, 2024

k8s-ci-robot commented Oct 21, 2024

netlify bot commented Sep 24, 2024 •

edited

Loading

mimowo left a comment •

edited

Loading

mimowo commented Oct 9, 2024 •

edited

Loading

mimowo commented Oct 9, 2024 •

edited

Loading

trasc commented Oct 10, 2024 •

edited

Loading

mimowo commented Oct 11, 2024 •

edited

Loading

mimowo commented Oct 15, 2024 •

edited

Loading

mbobrovskyi commented Oct 18, 2024 •

edited

Loading