Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue reconciler when prebuilt workload is created after job #3131

Conversation

IrvingMg
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Invokes job reconciler when workload appears so that job can take ownership of it.

Which issue(s) this PR fixes:

Fixes #3051

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 24, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: IrvingMg
Once this PR has been reviewed and has the lgtm label, please assign kerthcet, trasc for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 24, 2024
Copy link

netlify bot commented Sep 24, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 4315626
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6705c95248f7da0008c6490a

@IrvingMg IrvingMg changed the title Reconcile when workload is created after job Queue reconciler when prebuilt workload is created after job Sep 24, 2024
@IrvingMg IrvingMg force-pushed the fix/kueue-when-creating-job-with-queueing-should-run-with-prebuilt-workload branch 2 times, most recently from 5045026 to dfea000 Compare September 26, 2024 16:34
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 26, 2024
@IrvingMg
Copy link
Contributor Author

/test all

@IrvingMg
Copy link
Contributor Author

/test pull-kueue-test-e2e-main-1-31

@IrvingMg
Copy link
Contributor Author

/retest-required

@IrvingMg IrvingMg force-pushed the fix/kueue-when-creating-job-with-queueing-should-run-with-prebuilt-workload branch from 444a464 to 06e63a5 Compare October 7, 2024 18:15
@IrvingMg
Copy link
Contributor Author

IrvingMg commented Oct 7, 2024

/test all

@IrvingMg IrvingMg marked this pull request as ready for review October 7, 2024 18:30
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 7, 2024
@IrvingMg
Copy link
Contributor Author

IrvingMg commented Oct 7, 2024

/cc @mbobrovskyi

Copy link
Contributor

@trasc trasc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 9f5ace647ee57320e9d5a0192ae78e38e0c3f510

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned with the complexity of the solution as a fix for the flake. In particular, that it is distributed across all Job integrations, which means extra friction for the users who want to write their own external Job controllers.

IIUC this issue does not impact MultiKueue which is the only current consumer of the prebuilt-workload (see comment. If this is the case, I would suggest to solve the flake in a separate PR. IIUC we can solve it by making the workload reserving before creating the Job object. This will make sure the Workload creation event is propagated before the Job is created. Let me know if I'm missing some scenarios @trasc .

We could separately open an issue for the prebuilt-workload feature, and follow up with a fix, and maybe this PR is still a valid approach, but splitting will give us necessary time to think about the right approach without the need to worry about flakes.

cc @tenzen-y @alculquicondor

@trasc
Copy link
Contributor

trasc commented Oct 9, 2024

IIUC this issue does not impact MultiKueue which is the only current consumer of the prebuilt-workload (see comment. If this is the case, I would suggest to solve the flake in a separate PR. IIUC we can solve it by making the workload reserving before creating the Job object. This will make sure the Workload creation event is propagated before the Job is created. Let me know if I'm missing some scenarios @trasc .

Indeed we mostly use the prebuilt workload feature in multikueue where, for now, this scenario is unlikely to happen. However the prebuilt workload is a different independent feature.

This scenario happening in some other context is problematic since the Workload can get admitted and keep the quota reserved indefinitely since it's job is not aware of its existence.

Regarding the complexity:

  1. This is not the only "special" thing one needs to do to support prebuilt workload for a specific job type, take the JobSet's [jobset] Add prebuilt workload support. #1575 for example.

  2. A slightly less optimal, but generic proposal was made in [Flaking test] Kueue when Creating a Job With Queueing Should run with prebuilt workload #3051 (to queue a delayed reconcile for the job if the prebuilt workload is missing, this wold have impacted only the jobframework), but the current one was chosen.

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2024

However the prebuilt workload is a different independent feature.

That is a bit debatable. It doesn't have a dedicated KEP or user-facing docs. So it is a bit like a "side-feature" of MK (and so its expected behavior outside of MK is a bit undefined).

Having said that, I understand it is part of the API surface, so fixing it makes sense, I was just thinking about decoupling the fix for flakes, to have a bit of time to think about a proper fix for the issue, as we have competing possibilities.

Due to other workstreams I wasn't yet able to assess which approach is better. I can try to do so on the best effort basis, but probably next week. In the meanwhile if @tenzen-y @alculquicondor have some assessment here, it would be very helpful.

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2024

Thinking "aloud". Aside of friction to the developers of the integrations, another con of the approach is that only batch/Job is covered with tests (in the current PR). For other integrations we hope we didn't make a mistake when copy-pasting and renaming. OTOH, adding integration tests per CRD seems overkill. So, maybe it is worth to explore what the "retry" approach would entail.

@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2024

Thinking about it more, wouldn't it be sufficient to just return fmt.Errorf("no expected prebuilt-workload") from

?

This would be an error, yes, but controller-runtime would requeue with exponential delay.

IIUC this would solve the flake, work fine for MK, and work so-so for the other use-cases, and work ootb for other CRDs. The benefit is that this is one-liner approach, which we could improve upon user-feedback.

@trasc
Copy link
Contributor

trasc commented Oct 10, 2024

Thinking about it more, wouldn't it be sufficient to just return fmt.Errorf("no expected prebuilt-workload") from ..

This might not be what we need, in my opinion the reconcile should continue in the "missing workload" state as it needs to potentially stop the job and/or more.

This would be an error, yes, but controller-runtime would requeue with exponential delay.

The exponential delay might be to unpredictable for what we need.

To keep the changes in the job integrations at a minimum we can add generic (jobframework) handler for missing prebuilt workloads, the only somehow "exotic" thing this implies is the need to use reflect to extract the Items from a client.ObjectList.

It still needs some work (cleanup, unit tests) but it can look like main...epam:kubernetes-kueue:prebuilt-requeue-generic . (maybe we can do it as a follow-up)

Edit: Something based on UnstructuredList may also be possible. But it's problematic form a caching perspective, we should either accept that the listing is uncached or enable CacheUnstructured which may increase cpu and memory usage.

@mimowo
Copy link
Contributor

mimowo commented Oct 11, 2024

This might not be what we need, in my opinion the reconcile should continue in the "missing workload" state as it needs to potentially stop the job and/or more.

This was just a quick code check, surely we need to test if I'm not mistaken. Probably this place in code is better:

.

I'm hesitant if it is a good investment of time to implement (and review) a complex solution for a problem which is outside of current use-cases (MultiKueue).

The semantics of prebuilt-workload in that scenario is not specified anywhere , so I think a one-liner retry is a valid approach.

Rest assured, we can have a KEP and specify the behavior, and implement Job support when we have such use-cases. For now, I'm focused on fixing the flake, and preventing potentially rare scenarios when we need to wait a little bit of time for the propagation of events.

@mimowo
Copy link
Contributor

mimowo commented Oct 15, 2024

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

@IrvingMg
Copy link
Contributor Author

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

I've tested this approach on my local but it seems it doesn't work. First, I've tried just returning the error fmt.Errorf("no expected prebuilt-workload") here:

but it didn't work.

Then, I've added a return ctrl.Result{Requeue: true}, nil here

log.Error(err, "Handling job with no workload")

but the e2e test keeps failing in the same case.

@mimowo
Copy link
Contributor

mimowo commented Oct 16, 2024

Thanks for testing. This is a bit weird, because controller runtime should generally retry on errors. I see we have some function to categorize errors as retryable. Would be good to investigate a bit to see how this function handles the error, and if the modified code is used.

@IrvingMg
Copy link
Contributor Author

Still need to investigate more about this issue. For now, I've tested this solution by @mbobrovskyi: main...epam:kubernetes-kueue:fix/wait-for-prebuild-workload but I keep getting the same error as before:

Kueue when Creating a Job With Queueing [It] Should run with prebuilt workload
  [FAILED] Timed out after 45.001s.
  The function passed to Eventually failed at /Users/Irving_Mondragon/Documents/git/kueue/test/e2e/singlecluster/e2e_test.go:206 with:
  Expected
      <[]string | len:1, cap:4>: [
          "kueue.x-k8s.io/resource-in-use",
      ]
  not to contain element matching
      <string>: kueue.x-k8s.io/resource-in-use

@mbobrovskyi
Copy link
Contributor

@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR.

Opened the PR #3255.

@mimowo
Copy link
Contributor

mimowo commented Oct 17, 2024

Thanks!

@tenzen-y
Copy link
Member

/cc

@k8s-ci-robot k8s-ci-robot requested a review from tenzen-y October 17, 2024 20:07
@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Oct 18, 2024

Still need to investigate more about this issue. For now, I've tested this solution by @mbobrovskyi: main...epam:kubernetes-kueue:fix/wait-for-prebuild-workload but I keep getting the same error as before:

Kueue when Creating a Job With Queueing [It] Should run with prebuilt workload
  [FAILED] Timed out after 45.001s.
  The function passed to Eventually failed at /Users/Irving_Mondragon/Documents/git/kueue/test/e2e/singlecluster/e2e_test.go:206 with:
  Expected
      <[]string | len:1, cap:4>: [
          "kueue.x-k8s.io/resource-in-use",
      ]
  not to contain element matching
      <string>: kueue.x-k8s.io/resource-in-use

This is an another issue. @IrvingMg I would propose to create separate PR to fix it.

@mimowo
Copy link
Contributor

mimowo commented Oct 21, 2024

/close
in favor of the retry in #3255. I appretiate the effort on this PR to investigate and prototype the approach. Still, I think the PR adds more complexity than needed at this point. So, let's go with "best effort" support for now. We will improve the support when we have evidence or feedback from users who need it.

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closed this PR.

In response to this:

/close
in favor of the retry in #3255. I appretiate the effort on this PR to investigate and prototype the approach. Still, I think the PR adds more complexity than needed at this point. So, let's go with "best effort" support for now. We will improve the support when we have evidence or feedback from users who need it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaking test] Kueue when Creating a Job With Queueing Should run with prebuilt workload
6 participants