`client.restart()` may cause workers to shut down instead of restarting #6494

gjoseph92 · 2022-06-02T22:27:41Z

Client.restart() immediately removes WorkerStates before the workers have fully shut down (#6390) by calling remove_worker. However, this doesn't flush the BatchedSend or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after the WorkerState has been removed, but before the op: close has reached the worker, the scheduler will get mad at it and the worker will shut itself down instead of restarting. Only after it starts to shut down will it receive the op: close, nanny=True message from the scheduler, which it will effectively ignore.

There are a few ways to address this:

short-term: the worker should restart when it gets the missing message instead of shutting down. This is reasonable to do anyway. Restart worker via Nanny on connection failure #6387
medium-term: Eliminate partially-removed-worker state on scheduler (comms open, state removed) #6390

@jrbourbeau discovered this, and we initially thought it was an issue with the fact that Worker.close awaits a bunch of things before turning off its heartbeats to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies with missing, then the Worker.close() call in response to that will just jump on the bandwagon of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.

Therefore, I think this is purely about the race condition on the scheduler.

cc @fjetter

The text was updated successfully, but these errors were encountered:

fjetter · 2022-06-03T16:22:05Z

Wouldn't something like below be a very-short-term release?
I think we should get this immediate problem fixed asap since it's messing with out CI and is failing the coiled-runtime integration tests

(untested)

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index d4a84c184..149fa0819 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -5095,10 +5095,10 @@ class Scheduler(SchedulerState, ServerNode):

         for addr in list(self.workers):
             try:
-                # Ask the worker to close if it doesn't have a nanny,
-                # otherwise the nanny will kill it anyway
-                await self.remove_worker(
-                    address=addr, close=addr not in nannies, stimulus_id=stimulus_id
+                await self.close_worker(
+                    worker=addr,
+                    stimulus_id=stimulus_id,
+                    safe=True,
                 )
             except Exception:
                 logger.info(
@@ -5141,6 +5141,13 @@ class Scheduler(SchedulerState, ServerNode):
                         exc_info=True,
                     )

+        for addr in list(self.workers):
+            await self.remove_worker(
+                address=addr,
+                stimulus_id=stimulus_id,
+                safe=True,
+                close=False,
+            )
         self.clear_task_state()

         with suppress(AttributeError):

If #6387 is a quick fix that also works for me

gjoseph92 · 2022-06-03T17:05:57Z

No, because close_worker just calls remove_worker internally. It's a redundant method to remove_worker(close=True) (and close=False should never be used):

distributed/distributed/scheduler.py

Lines 3432 to 3442 in 6d85a85

    
               async def close_worker(self, worker: str, stimulus_id: str, safe: bool = False): 
        
                   """Remove a worker from the cluster 
        
                   This both removes the worker from our local state and also sends a 
        
                   signal to the worker to shut down.  This works regardless of whether or 
        
                   not the worker has a nanny process restarting it 
        
                   """ 
        
                   logger.info("Closing worker %s", worker) 
        
                   self.log_event(worker, {"action": "close-worker"}) 
        
                   self.worker_send(worker, {"op": "close"})  # TODO redundant with `remove_worker` 
        
                   await self.remove_worker(address=worker, safe=safe, stimulus_id=stimulus_id)

We should either remove close_worker and have just 1 method, or make close_worker do what it sounds like it should do.

But the approach you're suggesting here is correct. Having restart just tell workers to close, then flushing and closing their stream comms, and only then removing the state would be fix. That's basically what's proposed to do for #6390. That will be pretty quick, but it's not the quickest possible path.

I think the quickest fix would be

diff --git a/distributed/worker.py b/distributed/worker.py
index a8f4b6ba..1277a78b 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1245,7 +1245,8 @@ class Worker(ServerNode):
                 logger.error(
                     f"Scheduler was unaware of this worker {self.address!r}. Shutting down."
                 )
-                await self.close()
+                # Something is out of sync; have the nanny restart us if possible.
+                await self.close(nanny=False)
                 return
 
             self.scheduler_delay = response["time"] - middle

especially since we should do it anyway #6387.

Closes dask#6494. Progress towards dask#6387.

gjoseph92 added bug Something is broken stability Issue or feature related to cluster stability (e.g. deadlock) labels Jun 2, 2022

jrbourbeau mentioned this issue Jun 3, 2022

Update package versions coiled/benchmarks#149

Merged

This was referenced Jun 3, 2022

flaky test diagnostics/tests/test_progress.py::test_group_timing #6452

Closed

flaky test distributed/tests/test_scheduler.py::test_restart #6455

Closed

fjetter assigned gjoseph92 Jun 3, 2022

hendrikmakait mentioned this issue Jun 3, 2022

Improve awareness of frequently flaking tests #6500

Open

gjoseph92 mentioned this issue Jun 3, 2022

Fix Scheduler.restart logic #6504

Merged

2 tasks

gjoseph92 added a commit to gjoseph92/distributed that referenced this issue Jun 3, 2022

Restart worker if it's unrecognized by scheduler

14efd3d

Closes dask#6494. Progress towards dask#6387.

gjoseph92 mentioned this issue Jun 3, 2022

Restart worker if it's unrecognized by scheduler #6505

Merged

2 tasks

fjetter closed this as completed in #6505 Jun 7, 2022

fjetter mentioned this issue Feb 23, 2024

Fix critical race condition in graceful shutdown #8522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`client.restart()` may cause workers to shut down instead of restarting #6494

`client.restart()` may cause workers to shut down instead of restarting #6494

gjoseph92 commented Jun 2, 2022

fjetter commented Jun 3, 2022 •

edited

Loading

gjoseph92 commented Jun 3, 2022

client.restart() may cause workers to shut down instead of restarting #6494

client.restart() may cause workers to shut down instead of restarting #6494

Comments

gjoseph92 commented Jun 2, 2022

fjetter commented Jun 3, 2022 • edited Loading

gjoseph92 commented Jun 3, 2022

`client.restart()` may cause workers to shut down instead of restarting #6494

`client.restart()` may cause workers to shut down instead of restarting #6494

fjetter commented Jun 3, 2022 •

edited

Loading