Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client.restart() may cause workers to shut down instead of restarting #6494

Closed
gjoseph92 opened this issue Jun 2, 2022 · 2 comments · Fixed by #6505 or #6504
Closed

client.restart() may cause workers to shut down instead of restarting #6494

gjoseph92 opened this issue Jun 2, 2022 · 2 comments · Fixed by #6505 or #6504
Assignees
Labels
bug Something is broken stability Issue or feature related to cluster stability (e.g. deadlock)

Comments

@gjoseph92
Copy link
Collaborator

Client.restart() immediately removes WorkerStates before the workers have fully shut down (#6390) by calling remove_worker. However, this doesn't flush the BatchedSend or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after the WorkerState has been removed, but before the op: close has reached the worker, the scheduler will get mad at it and the worker will shut itself down instead of restarting. Only after it starts to shut down will it receive the op: close, nanny=True message from the scheduler, which it will effectively ignore.

There are a few ways to address this:

@jrbourbeau discovered this, and we initially thought it was an issue with the fact that Worker.close awaits a bunch of things before turning off its heartbeats to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies with missing, then the Worker.close() call in response to that will just jump on the bandwagon of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.

Therefore, I think this is purely about the race condition on the scheduler.

cc @fjetter

@gjoseph92 gjoseph92 added bug Something is broken stability Issue or feature related to cluster stability (e.g. deadlock) labels Jun 2, 2022
@fjetter
Copy link
Member

fjetter commented Jun 3, 2022

Wouldn't something like below be a very-short-term release?
I think we should get this immediate problem fixed asap since it's messing with out CI and is failing the coiled-runtime integration tests

(untested)

diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index d4a84c184..149fa0819 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -5095,10 +5095,10 @@ class Scheduler(SchedulerState, ServerNode):

         for addr in list(self.workers):
             try:
-                # Ask the worker to close if it doesn't have a nanny,
-                # otherwise the nanny will kill it anyway
-                await self.remove_worker(
-                    address=addr, close=addr not in nannies, stimulus_id=stimulus_id
+                await self.close_worker(
+                    worker=addr,
+                    stimulus_id=stimulus_id,
+                    safe=True,
                 )
             except Exception:
                 logger.info(
@@ -5141,6 +5141,13 @@ class Scheduler(SchedulerState, ServerNode):
                         exc_info=True,
                     )

+        for addr in list(self.workers):
+            await self.remove_worker(
+                address=addr,
+                stimulus_id=stimulus_id,
+                safe=True,
+                close=False,
+            )
         self.clear_task_state()

         with suppress(AttributeError):

If #6387 is a quick fix that also works for me

@gjoseph92
Copy link
Collaborator Author

No, because close_worker just calls remove_worker internally. It's a redundant method to remove_worker(close=True) (and close=False should never be used):

async def close_worker(self, worker: str, stimulus_id: str, safe: bool = False):
"""Remove a worker from the cluster
This both removes the worker from our local state and also sends a
signal to the worker to shut down. This works regardless of whether or
not the worker has a nanny process restarting it
"""
logger.info("Closing worker %s", worker)
self.log_event(worker, {"action": "close-worker"})
self.worker_send(worker, {"op": "close"}) # TODO redundant with `remove_worker`
await self.remove_worker(address=worker, safe=safe, stimulus_id=stimulus_id)

We should either remove close_worker and have just 1 method, or make close_worker do what it sounds like it should do.

But the approach you're suggesting here is correct. Having restart just tell workers to close, then flushing and closing their stream comms, and only then removing the state would be fix. That's basically what's proposed to do for #6390. That will be pretty quick, but it's not the quickest possible path.

I think the quickest fix would be

diff --git a/distributed/worker.py b/distributed/worker.py
index a8f4b6ba..1277a78b 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1245,7 +1245,8 @@ class Worker(ServerNode):
                 logger.error(
                     f"Scheduler was unaware of this worker {self.address!r}. Shutting down."
                 )
-                await self.close()
+                # Something is out of sync; have the nanny restart us if possible.
+                await self.close(nanny=False)
                 return
 
             self.scheduler_delay = response["time"] - middle

especially since we should do it anyway #6387.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken stability Issue or feature related to cluster stability (e.g. deadlock)
Projects
None yet
2 participants