-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client.restart()
may cause workers to shut down instead of restarting
#6494
Comments
Wouldn't something like below be a very-short-term release? (untested) diff --git a/distributed/scheduler.py b/distributed/scheduler.py
index d4a84c184..149fa0819 100644
--- a/distributed/scheduler.py
+++ b/distributed/scheduler.py
@@ -5095,10 +5095,10 @@ class Scheduler(SchedulerState, ServerNode):
for addr in list(self.workers):
try:
- # Ask the worker to close if it doesn't have a nanny,
- # otherwise the nanny will kill it anyway
- await self.remove_worker(
- address=addr, close=addr not in nannies, stimulus_id=stimulus_id
+ await self.close_worker(
+ worker=addr,
+ stimulus_id=stimulus_id,
+ safe=True,
)
except Exception:
logger.info(
@@ -5141,6 +5141,13 @@ class Scheduler(SchedulerState, ServerNode):
exc_info=True,
)
+ for addr in list(self.workers):
+ await self.remove_worker(
+ address=addr,
+ stimulus_id=stimulus_id,
+ safe=True,
+ close=False,
+ )
self.clear_task_state()
with suppress(AttributeError): If #6387 is a quick fix that also works for me |
No, because distributed/distributed/scheduler.py Lines 3432 to 3442 in 6d85a85
We should either remove But the approach you're suggesting here is correct. Having I think the quickest fix would be diff --git a/distributed/worker.py b/distributed/worker.py
index a8f4b6ba..1277a78b 100644
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -1245,7 +1245,8 @@ class Worker(ServerNode):
logger.error(
f"Scheduler was unaware of this worker {self.address!r}. Shutting down."
)
- await self.close()
+ # Something is out of sync; have the nanny restart us if possible.
+ await self.close(nanny=False)
return
self.scheduler_delay = response["time"] - middle
especially since we should do it anyway #6387. |
Client.restart()
immediately removes WorkerStates before the workers have fully shut down (#6390) by callingremove_worker
. However, this doesn't flush theBatchedSend
or wait for confirmation that the worker has received the message. So if the worker heartbeats in this interval after theWorkerState
has been removed, but before theop: close
has reached the worker, the scheduler will get mad at it and the worker will shut itself down instead of restarting. Only after it starts to shut down will it receive theop: close, nanny=True
message from the scheduler, which it will effectively ignore.There are a few ways to address this:
missing
message instead of shutting down. This is reasonable to do anyway. Restart worker via Nanny on connection failure #6387@jrbourbeau discovered this, and we initially thought it was an issue with the fact that
Worker.close
await
s a bunch of things before turning off its heartbeats to the scheduler. We thought that the worker was partway through the closing process, but still heartbeating. It's true that this can happen, and it probably shouldn't. However, if an extraneous heartbeat does occur while closing, and the scheduler replies withmissing
, then theWorker.close()
call in response to that will just jump on the bandwagon of the first close call that's already running, so it won't actually cause a shutdown if the first call was doing a restart.Therefore, I think this is purely about the race condition on the scheduler.
cc @fjetter
The text was updated successfully, but these errors were encountered: