Wait for workers to return in `Client.restart` #6714

gjoseph92 · 2022-07-11T19:16:45Z

Taking over #6637 since Florian and Hendrik are out.

This goes a step further and resolves the inconsistent treatment of nanny vs non-nanny workers. Now we wait for all workers to come back, even if they don't have nannies. This may not actually be a good idea; at the least, it's a breaking change.

It also refactors the calling of Scheduler.restart into an RPC versus a bulk comms call-response (an odd pattern).

Closes #6637

Tests added / passed
Passes pre-commit run --all-files

Non-nanny workers no longer go gentle into that good night. This breaks `test_restart_some_nannies_some_not` since it re-orders when plugins run, and causes a TimeoutError there. That test can be simplified a lot. Also, because the client doesn't call restart on the scheduler as an RPC, but rather a strange call-response pattern, errors from the scheduler aren't resurfaced to the client. If `restart` fails quickly on the scheduler, then the cilent will hang until its internal timeout passes as well (2x the defined timeout). This is all a bit silly and should just switch to an RPC.

This lets us propagate errors, and is simpler anyway.

github-actions · 2022-07-11T21:30:16Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 46m 52s ⏱️ + 14m 11s
  2 980 tests +  3   2 890 ✔️ +  4     87 💤 ±0 3 ❌ - 1
22 096 runs +24 21 057 ✔️ +25 1 036 💤 ±0 3 ❌ - 1

For more details on these failures, see this check.

Results for commit 9ead9c4. ± Comparison against base commit 04421e4.

♻️ This comment has been updated with latest results.

If multiple clients are connected, the ones that didn't call `restart` still need to release their keys. driveby: refcounts were not being reset on clients that didn't call `restart`. So after restart, if a client reused a key that was referenced before restart, it would never be releasable.

gjoseph92 · 2022-07-14T22:13:28Z

Update here: in CI, we're very consistently seeing test_restart_waits_for_new_workers and other tests fail with TimeoutError: 5 worker(s) did not restart within 10s. Typically none (or few) of the workers will successfully restart, consistently in all CI runs.

My guess is that this isn't a problem with this PR or #6637, but just uncovering an existing issue around the implementation of worker restart on nannies.

We're seeing something like

2022-07-14 19:23:24,949 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Starting worker process
2022-07-14 19:23:24,949 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Start called when already starting
2022-07-14 19:23:24,976 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:25,131 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Starting worker process
2022-07-14 19:23:25,517 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Starting worker process
2022-07-14 19:23:25,521 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Start called when already starting
2022-07-14 19:23:25,565 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:25,623 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Start called when already starting
2022-07-14 19:23:25,664 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:26,928 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Starting worker process
2022-07-14 19:23:26,963 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Worker process started
2022-07-14 19:23:26,963 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Worker process started
2022-07-14 19:23:28,043 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Starting worker process
2022-07-14 19:23:28,043 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Start called when already starting
2022-07-14 19:23:28,097 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:28,098 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Start called when already starting
2022-07-14 19:23:28,277 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Worker process started
2022-07-14 19:23:28,584 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Worker process started
2022-07-14 19:23:30,430 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Worker process started
2022-07-14 19:23:32,132 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,197 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,198 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,221 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,222 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,540 - distributed.core - ERROR - 5 worker(s) did not restart within 10s
Traceback (most recent call last):
...
asyncio.exceptions.TimeoutError: 5 worker(s) did not restart within 10s
2022-07-14 19:23:32,928 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:54538
2022-07-14 19:23:32,928 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:54538
2022-07-14 19:23:32,929 - distributed.worker - INFO -          dashboard at:            127.0.0.1:54539
2022-07-14 19:23:32,929 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:54462
2022-07-14 19:23:32,929 - distributed.worker - INFO - -------------------------------------------------

With the excessive logging statements I added in 5a388d1, that shows a couple interesting things:

The workers are shutting down just fine. The threadpoolexecutors aren't blocked or anything. They're hanging in startup.

The fact that we see both Starting worker process and Start called when already starting is suspicious. Turns out that's happening because of a race condition between Nanny.restart and the nanny's normal logic that restarts the worker process whenever it dies. The WARNING - Restarting worker makes this clear. We're calling instantiate at the same time here

distributed/distributed/nanny.py

Lines 529 to 535 in 930d3dc

    
           if self.status not in ( 
        
               Status.closing, 
        
               Status.closed, 
        
               Status.closing_gracefully, 
        
           ): 
        
               logger.warning("Restarting worker") 
        
               await self.instantiate()

and here

distributed/distributed/nanny.py

Lines 484 to 488 in 930d3dc

    
           async def restart(self, timeout=30): 
        
               async def _(): 
        
                   if self.process is not None: 
        
                       await self.kill() 
        
                       await self.instantiate()

Concerning as an instantiate race sounds, it actually looks like it'll be fine, thanks to this and this. But it still might be a little silly.

We're stuck in

distributed/distributed/nanny.py

Lines 690 to 691 in 85afcbc

msg = await self._wait_until_connected(uid)

logger.info(f"{self.worker_address} - Connected to worker")

since we never see Connected to worker. This is just waiting for await worker, basically.
Just a couple hundred ms after our timeout, the first worker starts successfully. So maybe there's actually no problem, and it's just that CI machines are really slow, and we need a longer timeout? (And why it fails most commonly on Windows?) Still, it's surprising it would fail so consistently.

Restarting is apparently very slow in CI. See if this actually fixes tests failing.

hendrikmakait

I generally favor throwing an error when something goes wrong instead of silently swallowing it up and reporting back as normal. I am now wondering whether we should handle the failure case more explicitly. If we fail to restart workers/nannies (and consequently to reset the scheduler state), it feels like a fatal error. IIUC, at that point, the state of the scheduler and the entire cluster is pretty much FUBAR. Instead of raising an unhandled TimeoutError, should we try to gracefully shut down what's left of the cluster and scheduler and return a more fatalistic error message to the user like Failed to restart the cluster after {timeout}s. This left the cluster in a non-recoverable state and it is shutting down.? Note that this would be a dramatically breaking change.

distributed/scheduler.py

This reverts commit 5a388d1.

gjoseph92 · 2022-07-15T18:31:54Z

If we fail to restart workers/nannies (and consequently to reset the scheduler state), it feels like a fatal error. IIUC, at that point, the state of the scheduler and the entire cluster is pretty much FUBAR

I wouldn't go this far. We're not failing to reset scheduler state—though yes, if something goes wrong in here, that indicates a big problem (this is the entirety of the "reset scheduler state" code):

distributed/distributed/scheduler.py

Lines 5133 to 5141 in e18ea37

    
           logger.info("Releasing all requested keys") 
        
           for cs in self.clients.values(): 
        
               self.client_releases_keys( 
        
                   keys=[ts.key for ts in cs.wants_what], 
        
                   client=cs.client_key, 
        
                   stimulus_id=stimulus_id, 
        
               ) 
        
           self.clear_task_state()

I guess similarly, remove_worker should never fail.

But calling the Nanny.restart() RPC, or waiting for N workers to return, could fail. That just indicates something was off with those particular workers. The scheduler state (and other workers) are still fine. We should leave it up to users whether to catch and ignore that error, or shut down the cluster themselves. But the cluster, overall, is fine, so I think preemptively killing it is overkill.

I don't love raising this TimeoutError when not all workers come back. If you have 1000 workers in your cluster, and 1 fails to restart properly, you probably don't care. But if 900 don't come back, you do. Both would show up as the same TimeoutError though. Exposing a wait_for_workers= argument maybe could make the most sense, so you can control this behavior. Our default would probably be to wait for 100% to come back though.

gjoseph92 · 2022-07-15T18:34:21Z

This is passing now besides:

Flaky test_scatter_compute_lose #6730
4 instances of Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731 across a couple different tests

hendrikmakait · 2022-07-18T07:55:07Z

We should leave it up to users whether to catch and ignore that error, or shut down the cluster themselves. But the cluster, overall, is fine, so I think preemptively killing it is overkill.

I'm fine with having restart() work on a best-effort basis and leave it up to the deployment system as a last resort to correct the deployment state (i.e., spinning up missing workers).

I don't love raising this TimeoutError when not all workers come back. If you have 1000 workers in your cluster, and 1 fails to restart properly, you probably don't care. But if 900 don't come back, you do.

Agreed that there is some issue here. I think two possible solutions would be to add the wait_for_workers parameter or return the actual numbers of workers that managed to restart in time with the TimeoutError (possibly subclassing to a custom RestartTimeoutError that carries that information). This would allow the client to handle different scenarios depending on whether 0.1 %, 10 %, or 90% of workers are missing.

For restart() to work on a best-effort basis that attempts a restart and reports back how successful it has been with that, we should structure it such that the state of the scheduler after restart() will be a consistent state that is as close as possible to the desired one.

For me, this means three things:

Scheduler state has been successfully reset. We need to ensure that we always run https://github.com/gjoseph92/distributed/blob/e18ea3759ae1b6ecdfb3c734d48a6eed2a94819e/distributed/scheduler.py#L5184-L5189. Currently, this code only gets executed if we do not timeout while restarting the nannies.
We ensure that non-nanny workers are properly removed. Effectively this means that https://github.com/gjoseph92/distributed/blob/e18ea3759ae1b6ecdfb3c734d48a6eed2a94819e/distributed/scheduler.py#L5158-L5164 should not be cancelled. While remove_worker()should not fail right now, we might timeout if there are too many workers that we want to remove and the scheduler struggles. This would then cancel the remaining removal work. Future code changes might exacerbate the problem. Shielding the coroutine could be an easy solution that allows early reporting-back.
As discussed in Ensure client.restart waits for workers to leave and come back #6637 (comment), we should remove all nannies that failed to restart successfully before timing out. There is probably something off with them, and we want to eliminate them. IIUC, these should eventually spin down and be replaced by the deployment system.
[BONUS]: If possible and readable, schedule coroutines for restarting nannies and removing workers concurrently instead of sequentially so that the timeout parameter resembles the time restart() will take more closely.

This reverts commit 868c0c2.

…rkers

gjoseph92 · 2022-07-18T17:33:30Z

Currently, this code only gets executed if we do not timeout while restarting the nannies

Good catch; fixed.

we might timeout if there are too many workers that we want to remove and the scheduler struggles

Also fixed. I think more broadly, what we'll say is that "restart tells all workers to restart (or non-Nanny workers to shut down), then waits timeout seconds for them to come back". So the timeout doesn't apply to the shutdown process, only to the waiting for workers to return. This simplifies things a bit.

we should remove all nannies that failed to restart successfully before timing out

I implemented this, but I've now decided against it. I think that leaving non-restarted nannies alone makes for a simpler contract. With what I have above, the timeout is really just a convenience saving you a Client.wait_for_workers. Just because a nanny didn't restart within a user-specified window isn't necessarily an indication that anything's wrong with it. You could pass timeout=0.1 and none of your nannies would be able to restart fast enough, but that doesn't mean they're broken.

gjoseph92 · 2022-07-18T17:34:00Z

Going to add a flag to enable/disable waiting for workers, then I think we'll be good to go.

hendrikmakait · 2022-07-18T19:04:20Z

we should remove all nannies that failed to restart successfully before timing out

I implemented this, but I've now decided against it. I think that leaving non-restarted nannies alone makes for a simpler contract. With what I have above, the timeout is really just a convenience saving you a Client.wait_for_workers. Just because a nanny didn't restart within a user-specified window isn't necessarily an indication that anything's wrong with it. You could pass timeout=0.1 and none of your nannies would be able to restart fast enough, but that doesn't mean they're broken.

For clarity: Does that ensure that we will have no nannies continuing with business as usual even though it had been asked to restart? That is what I would like for us to achieve here.

If this is not the case, maybe add a follow-up ticket to change the semantics of restarting a nanny to something that ensures that we will only keep talking to nannies that did in fact restart, but also gives them enough time to do so. For example, should sending the restart request to the nanny fail for some reason, I do not want to keep that one around. At the same time, you have a point that restarts might be too short for the nannies to act.

separating whether a worker took too long to shut down vs start up allows us to guarantee all old workers are removed

gjoseph92 · 2022-07-18T20:40:51Z

Does that ensure that we will have no nannies continuing with business as usual even though it had been asked to restart?

Updated to call Nanny.kill instead of Nanny.restart. This allows us to remove any workers that failed to shut down in time, guaranteeing that after a restart, there are no old workers connected.

Going to add a flag to enable/disable waiting for workers

Done

hendrikmakait

I really like this approach to Client.restart. The contract feels clean and it eliminates another coroutine race on the nanny.

distributed/scheduler.py

distributed/client.py

distributed/scheduler.py

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

10ms is way too fast

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

gjoseph92 · 2022-07-19T14:38:43Z

@hendrikmakait I think this is ready, assuming CI passes?

hendrikmakait

Nice work, thanks!

gjoseph92 · 2022-07-19T19:10:24Z

fjetter and others added 4 commits July 1, 2022 18:36

Ensure client.restart waits for workers to leave

bd61588

Call restart as RPC from client

010c911

This lets us propagate errors, and is simpler anyway.

Client restart cleanup even if restart fails

f0a4938

gjoseph92 added 3 commits July 14, 2022 14:48

Merge branch 'main' into restart-wait-for-workers

e18fd1e

logging to track down why restart is hanging

5a388d1

Make restart timeout 2x longer by default

f7da0d2

Restarting is apparently very slow in CI. See if this actually fixes tests failing.

hendrikmakait reviewed Jul 15, 2022

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

gjoseph92 added 4 commits July 15, 2022 11:08

Revert "logging to track down why restart is hanging"

fdf8358

This reverts commit 5a388d1.

Fix test_AllProgress: reorder plugin.restart()

659e2d1

Inner function for single wait_for timeout

b5c6ff0

docstring on client as well

e18ea37

Clarify restart docstring

b82e9ff

gjoseph92 added 9 commits July 18, 2022 10:23

Move other clearing ops first

c22f99c

Don't apply timeout to remove_worker

025d02e

Connect to nannies in parallel

648f1e9

Explain TimeoutError contract

868c0c2

Revert "Explain TimeoutError contract"

2864e23

This reverts commit 868c0c2.

Only apply timeout to worker-waiting

08014e9

move proc restart testing to more appropraite test

55f16cd

remove _worker_coroutines

0e3d96d

Merge remote-tracking branch 'upstream/main' into restart-wait-for-wo…

bcc3ad6

…rkers

gjoseph92 added 2 commits July 18, 2022 16:21

kill nannies instead of restart

3d6a938

separating whether a worker took too long to shut down vs start up allows us to guarantee all old workers are removed

Add wait_for_workers option

b7c5e40

gjoseph92 marked this pull request as ready for review July 18, 2022 20:40

hendrikmakait reviewed Jul 19, 2022

View reviewed changes

gjoseph92 and others added 4 commits July 19, 2022 10:28

Fix docstrings & error messages

1dd9620

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

Note new workers may take place of old ones

675c425

decrease wait_for_workers poll interval

8b4fa1a

10ms is way too fast

Missed one typo

7eb97ba

Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>

hendrikmakait approved these changes Jul 19, 2022

View reviewed changes

gjoseph92 self-assigned this Jul 19, 2022

hendrikmakait and others added 2 commits July 19, 2022 11:18

Drop redundant geninc (dask#6740)

b4b9605

fix test_restart_nanny_timeout_exceeded

9ead9c4

gjoseph92 merged commit f12da73 into dask:main Jul 19, 2022

gjoseph92 deleted the restart-wait-for-workers branch July 19, 2022 19:14

gjoseph92 mentioned this pull request Jul 19, 2022

⚠️ CI failed ⚠️ - test_deadlock fails intermittently coiled/benchmarks#166

Open

z4m0 pushed a commit to z4m0/distributed that referenced this pull request Jul 20, 2022

Wait for workers to return in Client.restart (dask#6714)

e8b4a3f

gjoseph92 mentioned this pull request Jul 20, 2022

Flaky test_restart_fast_sync and test_fast_kill #6746

Closed

jmoralez mentioned this pull request Jul 26, 2022

[ci] [dask] CI jobs failing with Dask 2022.7.1 microsoft/LightGBM#5390

Open

hendrikmakait mentioned this pull request Oct 21, 2022

Track reason of workers closing and restarting #7166

Merged

2 tasks

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Wait for workers to return in Client.restart (dask#6714)

de8e564

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for workers to return in `Client.restart` #6714

Wait for workers to return in `Client.restart` #6714

gjoseph92 commented Jul 11, 2022

github-actions bot commented Jul 11, 2022 •

edited

Loading

gjoseph92 commented Jul 14, 2022

hendrikmakait left a comment •

edited

Loading

gjoseph92 commented Jul 15, 2022

gjoseph92 commented Jul 15, 2022

hendrikmakait commented Jul 18, 2022 •

edited

Loading

gjoseph92 commented Jul 18, 2022

gjoseph92 commented Jul 18, 2022

hendrikmakait commented Jul 18, 2022

gjoseph92 commented Jul 18, 2022

hendrikmakait left a comment

gjoseph92 commented Jul 19, 2022

hendrikmakait left a comment

gjoseph92 commented Jul 19, 2022 •

edited

Loading

Wait for workers to return in Client.restart #6714

Wait for workers to return in Client.restart #6714

Conversation

gjoseph92 commented Jul 11, 2022

github-actions bot commented Jul 11, 2022 • edited Loading

Unit Test Results

gjoseph92 commented Jul 14, 2022

hendrikmakait left a comment • edited Loading

Choose a reason for hiding this comment

gjoseph92 commented Jul 15, 2022

gjoseph92 commented Jul 15, 2022

hendrikmakait commented Jul 18, 2022 • edited Loading

gjoseph92 commented Jul 18, 2022

gjoseph92 commented Jul 18, 2022

hendrikmakait commented Jul 18, 2022

gjoseph92 commented Jul 18, 2022

hendrikmakait left a comment

Choose a reason for hiding this comment

gjoseph92 commented Jul 19, 2022

hendrikmakait left a comment

Choose a reason for hiding this comment

gjoseph92 commented Jul 19, 2022 • edited Loading

Wait for workers to return in `Client.restart` #6714

Wait for workers to return in `Client.restart` #6714

github-actions bot commented Jul 11, 2022 •

edited

Loading

hendrikmakait left a comment •

edited

Loading

hendrikmakait commented Jul 18, 2022 •

edited

Loading

gjoseph92 commented Jul 19, 2022 •

edited

Loading