A workload may not be retried if setting the Admission field fails #241

ahg-g · 2022-05-01T03:16:48Z

What happened:

Test flake: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272

From the logs , the workload was assumed but we failed to set the Admission field because the object has changed:

Could not admit workload and assigning flavors in apiserver	{"workload": {"name":"job","namespace":"core-67497"}, 
"clusterQueue": {"name":"cluster-queue-with-selector"}, "error": "Operation cannot be fulfilled on workloads.kueue.x-k8s.io 
\"job\": the object has been modified; please apply your changes to the latest version and try again"}

and requeuing the workload didn't actually add it to the queue because the workload already existed, which is expected:

scheduler	Workload re-queued	{"workload": {"name":"job","namespace":"core-67497"}, "clusterQueue": 
{"name":"cluster-queue-with-selector"}, "workload": {"name":"job","namespace":"core-67497"}, 
"queue": {"name":"queue-for-selector","namespace":"core-67497"}, "added": false, "status": "assumed"}

but then the logs don't show that the workload was tried again, and so the scheduler never got to retry to set the admission field again in the apiserver.

What you expected to happen:

The workload to be retried and eventually admitted.

/kind bug

The text was updated successfully, but these errors were encountered:

alculquicondor · 2022-05-02T13:52:00Z

and requeuing the workload didn't actually add it to the queue because the workload already existed, which is expected:

It existed because of the update event that happened during the scheduling cycle?
If so, wouldn't the pod be in the scheduling queue?

In any case, we need to prioritize SSA #164

alculquicondor · 2022-05-02T22:52:15Z

/assign

Even without SSA, there should have been more retries. I can investigate.

ahg-g · 2022-05-03T03:25:03Z

I am guessing this is starting to happen because we shifted to use BestEffortFIFO in all the integration tests.

ahg-g · 2022-05-03T03:44:34Z

There is another flake, slightly similar, but not exactly: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/224/pull-kueue-test-integration-main/1521319895239757824

This case could be solved with SSA, but still why are we not trying with the newer version in the following attempts?

btw, this potential bug is showing up probably because we are testing with the job-controller using Job, so there is certainly benefit to keep having those scheduler tests using jobs rather than workload only.

kerthcet · 2022-05-03T10:59:14Z

I think this pr #245 maybe solve the problem. I just make sure that the element requeued is the newest. At least to me, I haven't meet this error after.

alculquicondor · 2022-05-03T14:06:40Z

I think this could happen with just one workload if it barely fits in the ClusterQueue. This is the sequence of events:

Create workload
Workload is updated with pending status because Queue doesn't exist
Create queue
Workload is assumed and goes into async admission.
Event for update in 2 is received.
Scheduler tries to schedule the workload again. Since another version of the same workload is already assumed, it doesn't fit again. The workload is put in the inadmissible set.
The async admission fails because the workload had changed.
The workload is requeued, but with an old version. The newer version remains in the inadmissible set.

alculquicondor · 2022-05-03T14:10:17Z

We could say that there are 2 bugs, as described in step 8 above. #245 only resolves the first part, but leaves a phantom workload in the inadmissible set.

Solving just the second bug might make the first problem a non-bug. But maybe it's better to always try to re-queue the newer version?

alculquicondor · 2022-05-03T14:51:38Z

btw, this potential bug is showing up probably because we are testing with the job-controller using Job, so there is certainly benefit to keep having those scheduler tests using jobs rather than workload only.

I think it's still independent of it.

kerthcet · 2022-05-03T15:24:37Z

I think this could happen with just one workload if it barely fits in the ClusterQueue. This is the sequence of events:

Create workload

Workload is updated with pending status because Queue doesn't exist

Create queue

Workload is assumed and goes into async admission.

Event for update in 2 is received.

Scheduler tries to schedule the workload again. Since another version of the same workload is already assumed, it doesn't fit again. The workload is put in the inadmissible set.

The async admission fails because the workload had changed.

The workload is requeued, but with an old version. The newer version remains in the inadmissible set.

The failing test is a StrictFIFO queue, and this problem might still happen. I have add some explains here #245 (comment), maybe helpful.

alculquicondor · 2022-05-03T15:30:48Z

The error in the link above https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/224/pull-kueue-test-integration-main/1521319895239757824 is on a BestEffortFIFO queue.

I missed analyzing the original link https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272. I'll do that now.

alculquicondor · 2022-05-03T17:45:58Z

Analysis for https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_kueue/227/pull-kueue-test-integration-main/1520594730851766272 is quite different.

Create workload
Admission attempt 1: fails because the namespace doesn't match. This produces a status update.
Workload is requeued immediately, because the queue is StrictFIFO
Admission attempt 2: enters the async admission.
Event for step 2 is receive, workload is queued again.
Admission attempt 3 starts, now the workload is not in the queue. The attempt fails because the workload is already assumed.
Async admission from attemp 2 fails, because there was an update in the workload object. Requeue first version of the workload. Since there is no copy of the workload in the queues, it is accepted.
Requeue from admission attempt doesn't actually requeue, because the workload is already there. This version of the workload is lost.

With this analysis, I now agree that #245 is also necessary.

ahg-g · 2022-05-03T18:31:17Z

btw, this potential bug is showing up probably because we are testing with the job-controller using Job, so there is certainly benefit to keep having those scheduler tests using jobs rather than workload only.

I think it's still independent of it.

May be, but this issue shows the tricky race conditions that could happen when objects are being updated by different controllers; the job controller is also involved in updating the workload object, and so it is beneficial to also have it in the test.

ahg-g · 2022-05-07T09:18:29Z

With #245 being merged, I assume we can close this issue

ahg-g added the kind/bug Categorizes issue or PR as related to a bug. label May 1, 2022

ahg-g changed the title ~~A workload may not be retried if setting binding fails~~ A workload may not be retried if setting the Admission field fails May 1, 2022

ahg-g mentioned this issue May 1, 2022

Change make cluster queue in wrappers to bestfifo #227

Merged

kerthcet mentioned this issue May 1, 2022

fix: revert change when assume failed #242

Closed

k8s-ci-robot assigned alculquicondor May 2, 2022

kerthcet mentioned this issue May 3, 2022

fix: update failure in requeue #245

Merged

alculquicondor mentioned this issue May 3, 2022

Immediate requeue in BestEffortFIFO should put a workload for retry #246

Merged

ahg-g closed this as completed May 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A workload may not be retried if setting the Admission field fails #241

A workload may not be retried if setting the Admission field fails #241

ahg-g commented May 1, 2022

alculquicondor commented May 2, 2022

alculquicondor commented May 2, 2022

ahg-g commented May 3, 2022

ahg-g commented May 3, 2022

kerthcet commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

kerthcet commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

ahg-g commented May 3, 2022

ahg-g commented May 7, 2022

A workload may not be retried if setting the Admission field fails #241

A workload may not be retried if setting the Admission field fails #241

Comments

ahg-g commented May 1, 2022

alculquicondor commented May 2, 2022

alculquicondor commented May 2, 2022

ahg-g commented May 3, 2022

ahg-g commented May 3, 2022

kerthcet commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

kerthcet commented May 3, 2022

alculquicondor commented May 3, 2022

alculquicondor commented May 3, 2022

ahg-g commented May 3, 2022

ahg-g commented May 7, 2022