-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Queue reconciler when prebuilt workload is created after job #3131
Queue reconciler when prebuilt workload is created after job #3131
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: IrvingMg The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
5045026
to
dfea000
Compare
/test all |
/test pull-kueue-test-e2e-main-1-31 |
/retest-required |
444a464
to
06e63a5
Compare
/test all |
/cc @mbobrovskyi |
pkg/controller/jobs/kubeflow/jobs/paddlejob/paddlejob_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/jobs/kubeflow/jobs/paddlejob/paddlejob_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/jobs/kubeflow/jobs/pytorchjob/pytorchjob_controller.go
Outdated
Show resolved
Hide resolved
pkg/controller/jobs/kubeflow/jobs/xgboostjob/xgboostjob_controller.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM label has been added. Git tree hash: 9f5ace647ee57320e9d5a0192ae78e38e0c3f510
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit concerned with the complexity of the solution as a fix for the flake. In particular, that it is distributed across all Job integrations, which means extra friction for the users who want to write their own external Job controllers.
IIUC this issue does not impact MultiKueue which is the only current consumer of the prebuilt-workload (see comment. If this is the case, I would suggest to solve the flake in a separate PR. IIUC we can solve it by making the workload reserving before creating the Job object. This will make sure the Workload creation event is propagated before the Job is created. Let me know if I'm missing some scenarios @trasc .
We could separately open an issue for the prebuilt-workload feature, and follow up with a fix, and maybe this PR is still a valid approach, but splitting will give us necessary time to think about the right approach without the need to worry about flakes.
Indeed we mostly use the prebuilt workload feature in multikueue where, for now, this scenario is unlikely to happen. However the prebuilt workload is a different independent feature. This scenario happening in some other context is problematic since the Workload can get admitted and keep the quota reserved indefinitely since it's job is not aware of its existence. Regarding the complexity:
|
That is a bit debatable. It doesn't have a dedicated KEP or user-facing docs. So it is a bit like a "side-feature" of MK (and so its expected behavior outside of MK is a bit undefined). Having said that, I understand it is part of the API surface, so fixing it makes sense, I was just thinking about decoupling the fix for flakes, to have a bit of time to think about a proper fix for the issue, as we have competing possibilities. Due to other workstreams I wasn't yet able to assess which approach is better. I can try to do so on the best effort basis, but probably next week. In the meanwhile if @tenzen-y @alculquicondor have some assessment here, it would be very helpful. |
Thinking "aloud". Aside of friction to the developers of the integrations, another con of the approach is that only batch/Job is covered with tests (in the current PR). For other integrations we hope we didn't make a mistake when copy-pasting and renaming. OTOH, adding integration tests per CRD seems overkill. So, maybe it is worth to explore what the "retry" approach would entail. |
Thinking about it more, wouldn't it be sufficient to just return kueue/pkg/controller/jobframework/reconciler.go Line 1000 in a768127
This would be an error, yes, but controller-runtime would requeue with exponential delay. IIUC this would solve the flake, work fine for MK, and work so-so for the other use-cases, and work ootb for other CRDs. The benefit is that this is one-liner approach, which we could improve upon user-feedback. |
This might not be what we need, in my opinion the reconcile should continue in the "missing workload" state as it needs to potentially stop the job and/or more.
The exponential delay might be to unpredictable for what we need. To keep the changes in the job integrations at a minimum we can add generic (jobframework) handler for missing prebuilt workloads, the only somehow "exotic" thing this implies is the need to use It still needs some work (cleanup, unit tests) but it can look like main...epam:kubernetes-kueue:prebuilt-requeue-generic . (maybe we can do it as a follow-up) Edit: Something based on UnstructuredList may also be possible. But it's problematic form a caching perspective, we should either accept that the listing is uncached or enable CacheUnstructured which may increase cpu and memory usage. |
This was just a quick code check, surely we need to test if I'm not mistaken. Probably this place in code is better: kueue/pkg/controller/jobframework/reconciler.go Line 1011 in a768127
I'm hesitant if it is a good investment of time to implement (and review) a complex solution for a problem which is outside of current use-cases (MultiKueue). The semantics of prebuilt-workload in that scenario is not specified anywhere , so I think a one-liner retry is a valid approach. Rest assured, we can have a KEP and specify the behavior, and implement Job support when we have such use-cases. For now, I'm focused on fixing the flake, and preventing potentially rare scenarios when we need to wait a little bit of time for the propagation of events. |
@IrvingMg please open another PR to test the approach of returning error. We can reuse the same integration test as added in this PR. |
I've tested this approach on my local but it seems it doesn't work. First, I've tried just returning the error kueue/pkg/controller/jobframework/reconciler.go Line 1015 in 391a000
but it didn't work. Then, I've added a return
but the e2e test keeps failing in the same case. |
Thanks for testing. This is a bit weird, because controller runtime should generally retry on errors. I see we have some function to categorize errors as retryable. Would be good to investigate a bit to see how this function handles the error, and if the modified code is used. |
Still need to investigate more about this issue. For now, I've tested this solution by @mbobrovskyi: main...epam:kubernetes-kueue:fix/wait-for-prebuild-workload but I keep getting the same error as before:
|
Thanks! |
/cc |
This is an another issue. @IrvingMg I would propose to create separate PR to fix it. |
/close |
@mimowo: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Invokes job reconciler when workload appears so that job can take ownership of it.
Which issue(s) this PR fixes:
Fixes #3051
Special notes for your reviewer:
Does this PR introduce a user-facing change?