`decide_worker_rootish_queuing_disabled` assertion fails when retiring worker #7063

crusaderky · 2022-09-23T21:34:23Z

While stress-testing #7062, test_RetireWorker_stress, which gracefully removes the best part of the cluster while performing a very heavy computation, failed once out of 162 runs:

https://github.com/crusaderky/distributed/actions/runs/3114670981/jobs/5050785452#step:18:1674

2022-09-23 18:56:03,193 - distributed.scheduler - ERROR - (<WorkerState 'tcp://127.0.0.1:63881', name: 6, status: closing_gracefully, memory: 21, processing: 27>, {<WorkerState 'tcp://127.0.0.1:63869', name: 0, status: running, memory: 61, processing: 6>, <WorkerState 'tcp://127.0.0.1:63879', name: 5, status: running, memory: 59, processing: 14>, <WorkerState 'tcp://127.0.0.1:63885', name: 8, status: running, memory: 59, processing: 17>, <WorkerState 'tcp://127.0.0.1:63877', name: 4, status: running, memory: 58, processing: 5>, <WorkerState 'tcp://127.0.0.1:63887', name: 9, status: running, memory: 59, processing: 6>})

Traceback (most recent call last):

  File "d:\a\distributed\distributed\distributed\scheduler.py", line 2040, in transition_waiting_processing

    if not (ws := self.decide_worker_rootish_queuing_disabled(ts)):

  File "d:\a\distributed\distributed\distributed\scheduler.py", line 1901, in decide_worker_rootish_queuing_disabled

    assert ws in self.running, (ws, self.running)

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-09-23T23:43:04Z

Interesting. That assertion is actually "incorrect". This is the code-path for equivalence to scheduling prior to the queuing change. We were okay suggesting workers that weren't running in the past, so we should be now too (even though it's a bit unreasonable). (It's actually essential for the co-assignment logic.)

crusaderky assigned gjoseph92 Sep 23, 2022

crusaderky mentioned this issue Sep 23, 2022

Make AMM memory measure configurable #7062

Merged

gjoseph92 mentioned this issue Sep 23, 2022

Fix decide_worker_rootish_queuing_disabled assert #7065

Merged

2 tasks

This was referenced Oct 26, 2022

Transition queued->memory causes AssertionError #7200

Closed

test_stress_creation_and_deletion flaky #5388

Closed

Fix test_stress_creation_and_deletion #7215

Merged

crusaderky closed this as completed in #7065 Oct 28, 2022

gjoseph92 mentioned this issue Oct 28, 2022

Turn on queuing by default #7213

Closed

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`decide_worker_rootish_queuing_disabled` assertion fails when retiring worker #7063

`decide_worker_rootish_queuing_disabled` assertion fails when retiring worker #7063

crusaderky commented Sep 23, 2022

gjoseph92 commented Sep 23, 2022

decide_worker_rootish_queuing_disabled assertion fails when retiring worker #7063

decide_worker_rootish_queuing_disabled assertion fails when retiring worker #7063

Comments

crusaderky commented Sep 23, 2022

gjoseph92 commented Sep 23, 2022

`decide_worker_rootish_queuing_disabled` assertion fails when retiring worker #7063

`decide_worker_rootish_queuing_disabled` assertion fails when retiring worker #7063