Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

gjoseph92 · 2022-05-20T01:10:53Z

Scheduler.remove_worker removes state regarding the worker (self.workers[addr], self.stream_comms[addr], etc.), but does not close the actual network connections to the worker. This is even codified in the close=False option, which supports removing the worker state, but not telling the worker to shut down or to disconnect.

Keeping the network connections open (and listening to them) is essentially a half-removed state. The scheduler no longer knows about the worker, but if the worker sends it updates over the open connection, it will respond to them (potentially invoking handlers that assume the worker state is there).

There are two things to figure out:

What does it mean for a worker to be "there" or "not there", from the scheduler's perspective?
- i.e. is it only that self.workers[addr] exists? Or also self.stream_comms[addr], and other such fields? Is there a self.handle_worker coroutine running for that worker too?
- Can there be a single point of truth for this? A single dict to check? Or method to call?
How can Scheduler.remove_worker ensure that:
- after it returns, the worker is fully "not there"
- if it yields control while it's running (via await), things are in a well-defined state (worker is either "there", or "not there", or maybe even in a "closing" state, but no half-removed state like we have currently)
- if multiple remove_worker coroutines run concurrently, everything remains consistent
- if multiple remove_worker coroutines run concurrently, the second one does not return until the worker is actually removed (i.e. the first coroutine has completed)

Addresses #6354

The text was updated successfully, but these errors were encountered:

`Scheduler.restart` used to remove every worker without closing it. This was bad practice (dask#6390), as well as incorrect: it certainly seemed the intent was only to remove non-Nanny workers. Then, Nanny workers are restarted via the `restart` RPC to the Nanny, not to the worker.

BitTheByte · 2023-03-30T06:49:59Z

Is there any ETA for this?

crusaderky · 2024-02-23T12:40:20Z

There's an extra layer of complexity added to this when Scheduler.retire_workers and its parameter flags come into play:

close_workers=False, retire=False

The worker sits forever in status=closing_gracefully

close_workers=True, retire=False

Calls Scheduler.close_worker(), which kindly asks the worker to shut itself down. This API makes no sense to me.

close_workers=False, retire=True

Calls Scheduler.remove_worker(close=False) as described above.
This is the default behaviour of Scheduler.retire_workers().

close_workers=True, retire=True

Shut the workers and the nannies down and remove them.
This is the default behaviour of Client.retire_workers(), and it differs from the scheduler-side API.

crusaderky · 2024-02-23T14:19:50Z

I suspect the below is purely hypothetical, but I'll note it nonetheless.

At the moment, there is no simple API for graceful worker restart. For example, it would make sense for cleaning up a memory leak on a worker without losing the data on it. Currently you can do

client.retire_workers([addr], close_workers=False, remove=False)
client.restart_workers([addr])

but with the removal of the flags from retire_workers, it would become impossible.

This was referenced May 20, 2022

Worker reconnect removal follow-ups #6384

Open

Add back worker reconnection #6391

Open

Remove report and safe from Worker.close #6363

Merged

Regressions and potential bugs from #6363 #6422

Closed

fjetter mentioned this issue May 24, 2022

update_who_has can remove workers #6342

Merged

gjoseph92 mentioned this issue May 24, 2022

Worker may message or reconnect to scheduler while worker is closing #6354

Closed

This was referenced Jun 2, 2022

client.restart() may cause workers to shut down instead of restarting #6494

Closed

flaky test distributed/tests/test_scheduler.py::test_restart #6455

Closed

This was referenced Jun 3, 2022

Fix Scheduler.restart logic #6504

Merged

Flaky test_quiet_client_close #6540

Open

gjoseph92 mentioned this issue Aug 10, 2022

Deadlock when emerging from closing_gracefully #6867

Closed

This was referenced Aug 25, 2022

Withhold root tasks [no co assignment] #6614

Merged

Improve handling of tasks failing/succeeding on unexpected workers #6956

Open

Scheduler was unaware of this worker message during worker shutdown #6961

Open

gjoseph92 mentioned this issue Oct 27, 2022

Transition queued->memory causes AssertionError #7200

Closed

crusaderky mentioned this issue Feb 23, 2024

Fix critical race condition in graceful shutdown #8522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

gjoseph92 commented May 20, 2022

BitTheByte commented Mar 30, 2023

crusaderky commented Feb 23, 2024 •

edited

Loading

crusaderky commented Feb 23, 2024 •

edited

Loading

Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

Comments

gjoseph92 commented May 20, 2022

BitTheByte commented Mar 30, 2023

crusaderky commented Feb 23, 2024 • edited Loading

close_workers=False, retire=False

close_workers=True, retire=False

close_workers=False, retire=True

close_workers=True, retire=True

crusaderky commented Feb 23, 2024 • edited Loading

crusaderky commented Feb 23, 2024 •

edited

Loading

crusaderky commented Feb 23, 2024 •

edited

Loading