-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for workers to return in Client.restart
#6714
Conversation
Non-nanny workers no longer go gentle into that good night. This breaks `test_restart_some_nannies_some_not` since it re-orders when plugins run, and causes a TimeoutError there. That test can be simplified a lot. Also, because the client doesn't call restart on the scheduler as an RPC, but rather a strange call-response pattern, errors from the scheduler aren't resurfaced to the client. If `restart` fails quickly on the scheduler, then the cilent will hang until its internal timeout passes as well (2x the defined timeout). This is all a bit silly and should just switch to an RPC.
This lets us propagate errors, and is simpler anyway.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 46m 52s ⏱️ + 14m 11s For more details on these failures, see this check. Results for commit 9ead9c4. ± Comparison against base commit 04421e4. ♻️ This comment has been updated with latest results. |
If multiple clients are connected, the ones that didn't call `restart` still need to release their keys. driveby: refcounts were not being reset on clients that didn't call `restart`. So after restart, if a client reused a key that was referenced before restart, it would never be releasable.
Update here: in CI, we're very consistently seeing My guess is that this isn't a problem with this PR or #6637, but just uncovering an existing issue around the implementation of worker restart on nannies. We're seeing something like
With the excessive logging statements I added in 5a388d1, that shows a couple interesting things:
|
Restarting is apparently very slow in CI. See if this actually fixes tests failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally favor throwing an error when something goes wrong instead of silently swallowing it up and reporting back as normal. I am now wondering whether we should handle the failure case more explicitly. If we fail to restart workers/nannies (and consequently to reset the scheduler state), it feels like a fatal error. IIUC, at that point, the state of the scheduler and the entire cluster is pretty much FUBAR. Instead of raising an unhandled TimeoutError
, should we try to gracefully shut down what's left of the cluster and scheduler and return a more fatalistic error message to the user like Failed to restart the cluster after {timeout}s. This left the cluster in a non-recoverable state and it is shutting down.
? Note that this would be a dramatically breaking change.
I wouldn't go this far. We're not failing to reset scheduler state—though yes, if something goes wrong in here, that indicates a big problem (this is the entirety of the "reset scheduler state" code): distributed/distributed/scheduler.py Lines 5133 to 5141 in e18ea37
I guess similarly, But calling the I don't love raising this TimeoutError when not all workers come back. If you have 1000 workers in your cluster, and 1 fails to restart properly, you probably don't care. But if 900 don't come back, you do. Both would show up as the same TimeoutError though. Exposing a |
This is passing now besides:
|
I'm fine with having
Agreed that there is some issue here. I think two possible solutions would be to add the For For me, this means three things:
|
This reverts commit 868c0c2.
Good catch; fixed.
Also fixed. I think more broadly, what we'll say is that "
I implemented this, but I've now decided against it. I think that leaving non-restarted nannies alone makes for a simpler contract. With what I have above, the |
Going to add a flag to enable/disable waiting for workers, then I think we'll be good to go. |
For clarity: Does that ensure that we will have no nannies continuing with business as usual even though it had been asked to restart? That is what I would like for us to achieve here. If this is not the case, maybe add a follow-up ticket to change the semantics of restarting a nanny to something that ensures that we will only keep talking to nannies that did in fact restart, but also gives them enough time to do so. For example, should sending the restart request to the nanny fail for some reason, I do not want to keep that one around. At the same time, you have a point that restarts might be too short for the nannies to act. |
separating whether a worker took too long to shut down vs start up allows us to guarantee all old workers are removed
Updated to call
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this approach to Client.restart
. The contract feels clean and it eliminates another coroutine race on the nanny.
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
10ms is way too fast
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
@hendrikmakait I think this is ready, assuming CI passes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, thanks!
Taking over #6637 since Florian and Hendrik are out.
This goes a step further and resolves the inconsistent treatment of nanny vs non-nanny workers. Now we wait for all workers to come back, even if they don't have nannies. This may not actually be a good idea; at the least, it's a breaking change.
It also refactors the calling of
Scheduler.restart
into an RPC versus a bulk comms call-response (an odd pattern).Closes #6637
pre-commit run --all-files