Scheduler was unaware of this worker
message during worker shutdown
#6961
Labels
Scheduler was unaware of this worker
message during worker shutdown
#6961
@bnaul reported seeing frequent messages like
Scheduler was unaware of this worker 'tcp://10.124.34.24:45585'. Shutting down
on an adaptive cluster with 2k workers. I think this message is likely to show up somewhat often during normal worker shutdown, due to the inconsistency in how the scheduler defines closed workers: #6390. It's not actually a sign of anything being wrong in this case, but is noisy and misleading.Here's a possible flow for a worker being closed:
remove_worker
on the worker and instantly removes its state tracking that worker’s existence. It also queues an{"op": "close"}
message to send to that worker (but does not send it yet).{"op": "close"}
message has reached the worker, the worker sends another heartbeat (or potentially, a heartbeat was even already on the wire before the scheduler ranremove_worker
){"status": "missing"}
response doesn’t actually change anything worker-side (it still shuts down as usual)Eliminating this message on the worker side would be easy:
On the scheduler side though, it's reflective of a larger problem around how we represent the state of a closed worker: #6390.
The problem is that the scheduler puts things in a "closed" state (by deleting
self.workers[address]
) while the worker may still be alive and connected. In the case of a clean shutdown, it would probably make sense if worker closure followed a request-response:workers[address]
around, but theWorkerState.status
is set toclosing
, and any tasks it's running or storing are transitioned off immediately (like currently happens).workers[address]
.This confirmation wouldn't have to be a message or RPC, per se—it could simply take the form of the worker closing the batched stream to the scheduler. The point is though, there needs to be some way of representing gray area where we've asked a worker to close, but haven't confirmed that it's gone yet.
The text was updated successfully, but these errors were encountered: