-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock when emerging from closing_gracefully #6867
Comments
Let's take a step back and discuss the issue How is the scheduler reverting something like this? I don't think that "reverting" a |
We encoded the assumption that the server lifecycle is unidirectional in a couple of places. start and close do both not behave sane if we allow for restarts |
I agree with @fjetter. I'm surprised that a worker in |
If anything, I think we should review the |
The state
Worker side, the status is a variant of distributed/distributed/worker.py Lines 929 to 946 in f43bc47
The two different statuses are necessary
distributed/distributed/scheduler.py Lines 6219 to 6230 in f43bc47
This code path is thoroughly covered by unit tests. |
I think this points out a lot of issues, and there are a number of ways we could go here.
I know But I think the fix first has to happen on the scheduler side, to prevent new tasks from being sent to draining workers. Otherwise, even if we handle the task well on the worker side (don't leave it in |
Since we do not want this situation to be too simple, there is yet another pseudo-retirement code path with distributed/distributed/worker.py Lines 1585 to 1610 in f43bc47
This will close the worker for good. If that is the only broken code path, that's good news. Taking a step back again. What clearly happens here is that people were looking at different areas of the code and we clearly have different versions of the code around and different versions in our memory. For instance, just browsing the code I can find...
I think the problem here is that we should clearly distinguish a server status and a state machine or task status. Server stati
State machine / Tasks stati
These two sets are entangled but they have very different state diagrams. The server side acyclic while the latter isn't. That problem cause already a lot of complexity in writing Server.close and was causing problems there, most notably we had Zombie workers in the past because these states were intertwined. I believe the very first step here is to break them apart |
Agreed.
It is the other way around: first the scheduler asks to run a task, and then it asks the worker to shut down. This happens when either the client or the adaptive scaler asks to, and the scheduler can predict neither.
What you describe is already happening, 3 lines above the snippet you posted: distributed/distributed/scheduler.py Lines 6162 to 6166 in e38d3a9
This however means that
This would open a huge can of worms if the scheduler reverts the decision. Related: #3761 |
This line at the top of
Worker.execute
is wrong:distributed/distributed/worker.py
Lines 2144 to 2145 in e1b9e20
The problem is that
closing_gracefully
is reversible. This normally doesn't happen. However, there are legitimate use cases whereSchedule.retire_workers
can give up and revert a worker fromclosing_gracefully
back to running - namely, if there are no longer any peer workers anymore that can accept its unique in-memory tasks.This can cause a rather extreme race condition where
{op: compute-task}
command followed, within the same batched-send packet, by{op: worker-status-change, status: closing_gracefully}
.Worker.execute
asyncio task to be spawned and, as soon as it reaches its turn in the event loop, return None.running
state forever. This is not a problem forclosing
andclosed
, as we're irreversibly tearing down everything anyways.{op: worker-status-change, status: running}
.The fix for this issue is trivial (just remove
closing-gracefully
from the line above); a deterministic reproducer is probably going to be very ugly.This issue interacts with #3761.
FYI @fjetter
The text was updated successfully, but these errors were encountered: