-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A workload may not be retried if setting the Admission field fails #241
Comments
It existed because of the update event that happened during the scheduling cycle? In any case, we need to prioritize SSA #164 |
/assign Even without SSA, there should have been more retries. I can investigate. |
I am guessing this is starting to happen because we shifted to use BestEffortFIFO in all the integration tests. |
There is another flake, slightly similar, but not exactly: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/224/pull-kueue-test-integration-main/1521319895239757824 This case could be solved with SSA, but still why are we not trying with the newer version in the following attempts? btw, this potential bug is showing up probably because we are testing with the job-controller using Job, so there is certainly benefit to keep having those scheduler tests using jobs rather than workload only. |
I think this pr #245 maybe solve the problem. I just make sure that the element requeued is the newest. At least to me, I haven't meet this error after. |
I think this could happen with just one workload if it barely fits in the ClusterQueue. This is the sequence of events:
|
We could say that there are 2 bugs, as described in step 8 above. #245 only resolves the first part, but leaves a phantom workload in the inadmissible set. Solving just the second bug might make the first problem a non-bug. But maybe it's better to always try to re-queue the newer version? |
I think it's still independent of it. |
The failing test is a StrictFIFO queue, and this problem might still happen. I have add some explains here #245 (comment), maybe helpful. |
The error in the link above https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/224/pull-kueue-test-integration-main/1521319895239757824 is on a BestEffortFIFO queue. I missed analyzing the original link https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272. I'll do that now. |
Analysis for https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272 is quite different.
With this analysis, I now agree that #245 is also necessary. |
May be, but this issue shows the tricky race conditions that could happen when objects are being updated by different controllers; the job controller is also involved in updating the workload object, and so it is beneficial to also have it in the test. |
With #245 being merged, I assume we can close this issue |
What happened:
Test flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272
From the logs , the workload was assumed but we failed to set the
Admission
field because the object has changed:and requeuing the workload didn't actually add it to the queue because the workload already existed, which is expected:
but then the logs don't show that the workload was tried again, and so the scheduler never got to retry to set the admission field again in the apiserver.
What you expected to happen:
The workload to be retried and eventually admitted.
/kind bug
The text was updated successfully, but these errors were encountered: