Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for workers to return in Client.restart #6714

Merged
merged 30 commits into from
Jul 19, 2022

Conversation

gjoseph92
Copy link
Collaborator

Taking over #6637 since Florian and Hendrik are out.

This goes a step further and resolves the inconsistent treatment of nanny vs non-nanny workers. Now we wait for all workers to come back, even if they don't have nannies. This may not actually be a good idea; at the least, it's a breaking change.

It also refactors the calling of Scheduler.restart into an RPC versus a bulk comms call-response (an odd pattern).

Closes #6637

  • Tests added / passed
  • Passes pre-commit run --all-files

fjetter and others added 4 commits July 1, 2022 18:36
Non-nanny workers no longer go gentle into that good night.

This breaks `test_restart_some_nannies_some_not` since it re-orders when plugins run, and causes a TimeoutError there. That test can be simplified a lot.

Also, because the client doesn't call restart on the scheduler as an RPC, but rather a strange call-response pattern, errors from the scheduler aren't resurfaced to the client. If `restart` fails quickly on the scheduler, then the cilent will hang until its internal timeout passes as well (2x the defined timeout). This is all a bit silly and should just switch to an RPC.
This lets us propagate errors, and is simpler anyway.
@github-actions
Copy link
Contributor

github-actions bot commented Jul 11, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 46m 52s ⏱️ + 14m 11s
  2 980 tests +  3    2 890 ✔️ +  4       87 💤 ±0  3  - 1 
22 096 runs  +24  21 057 ✔️ +25  1 036 💤 ±0  3  - 1 

For more details on these failures, see this check.

Results for commit 9ead9c4. ± Comparison against base commit 04421e4.

♻️ This comment has been updated with latest results.

If multiple clients are connected, the ones that didn't call `restart` still need to release their keys.

driveby: refcounts were not being reset on clients that didn't call `restart`. So after restart, if a client reused a key that was referenced before restart, it would never be releasable.
@gjoseph92
Copy link
Collaborator Author

Update here: in CI, we're very consistently seeing test_restart_waits_for_new_workers and other tests fail with TimeoutError: 5 worker(s) did not restart within 10s. Typically none (or few) of the workers will successfully restart, consistently in all CI runs.

My guess is that this isn't a problem with this PR or #6637, but just uncovering an existing issue around the implementation of worker restart on nannies.

We're seeing something like

2022-07-14 19:23:24,949 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Starting worker process
2022-07-14 19:23:24,949 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Start called when already starting
2022-07-14 19:23:24,976 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:25,131 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Starting worker process
2022-07-14 19:23:25,517 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Starting worker process
2022-07-14 19:23:25,521 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Start called when already starting
2022-07-14 19:23:25,565 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:25,623 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Start called when already starting
2022-07-14 19:23:25,664 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:26,928 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Starting worker process
2022-07-14 19:23:26,963 - distributed.nanny - INFO - tcp://127.0.0.1:54494 - Worker process started
2022-07-14 19:23:26,963 - distributed.nanny - INFO - tcp://127.0.0.1:54497 - Worker process started
2022-07-14 19:23:28,043 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Starting worker process
2022-07-14 19:23:28,043 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Start called when already starting
2022-07-14 19:23:28,097 - distributed.nanny - WARNING - Restarting worker
2022-07-14 19:23:28,098 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Start called when already starting
2022-07-14 19:23:28,277 - distributed.nanny - INFO - tcp://127.0.0.1:54506 - Worker process started
2022-07-14 19:23:28,584 - distributed.nanny - INFO - tcp://127.0.0.1:54513 - Worker process started
2022-07-14 19:23:30,430 - distributed.nanny - INFO - tcp://127.0.0.1:54483 - Worker process started
2022-07-14 19:23:32,132 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,197 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,198 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,221 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,222 - distributed.nanny - ERROR - Restart timed out after 8.0s; returning before finished
2022-07-14 19:23:32,540 - distributed.core - ERROR - 5 worker(s) did not restart within 10s
Traceback (most recent call last):
...
asyncio.exceptions.TimeoutError: 5 worker(s) did not restart within 10s
2022-07-14 19:23:32,928 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:54538
2022-07-14 19:23:32,928 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:54538
2022-07-14 19:23:32,929 - distributed.worker - INFO -          dashboard at:            127.0.0.1:54539
2022-07-14 19:23:32,929 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:54462
2022-07-14 19:23:32,929 - distributed.worker - INFO - -------------------------------------------------

With the excessive logging statements I added in 5a388d1, that shows a couple interesting things:

  1. The workers are shutting down just fine. The threadpoolexecutors aren't blocked or anything. They're hanging in startup.

  2. The fact that we see both Starting worker process and Start called when already starting is suspicious. Turns out that's happening because of a race condition between Nanny.restart and the nanny's normal logic that restarts the worker process whenever it dies. The WARNING - Restarting worker makes this clear. We're calling instantiate at the same time here

    if self.status not in (
    Status.closing,
    Status.closed,
    Status.closing_gracefully,
    ):
    logger.warning("Restarting worker")
    await self.instantiate()
    and here
    async def restart(self, timeout=30):
    async def _():
    if self.process is not None:
    await self.kill()
    await self.instantiate()

    Concerning as an instantiate race sounds, it actually looks like it'll be fine, thanks to this and this. But it still might be a little silly.

  3. We're stuck in

    msg = await self._wait_until_connected(uid)
    logger.info(f"{self.worker_address} - Connected to worker")
    since we never see Connected to worker. This is just waiting for await worker, basically.

  4. Just a couple hundred ms after our timeout, the first worker starts successfully. So maybe there's actually no problem, and it's just that CI machines are really slow, and we need a longer timeout? (And why it fails most commonly on Windows?) Still, it's surprising it would fail so consistently.

Restarting is apparently very slow in CI. See if this actually fixes tests failing.
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally favor throwing an error when something goes wrong instead of silently swallowing it up and reporting back as normal. I am now wondering whether we should handle the failure case more explicitly. If we fail to restart workers/nannies (and consequently to reset the scheduler state), it feels like a fatal error. IIUC, at that point, the state of the scheduler and the entire cluster is pretty much FUBAR. Instead of raising an unhandled TimeoutError, should we try to gracefully shut down what's left of the cluster and scheduler and return a more fatalistic error message to the user like Failed to restart the cluster after {timeout}s. This left the cluster in a non-recoverable state and it is shutting down.? Note that this would be a dramatically breaking change.

distributed/scheduler.py Outdated Show resolved Hide resolved
@gjoseph92
Copy link
Collaborator Author

If we fail to restart workers/nannies (and consequently to reset the scheduler state), it feels like a fatal error. IIUC, at that point, the state of the scheduler and the entire cluster is pretty much FUBAR

I wouldn't go this far. We're not failing to reset scheduler state—though yes, if something goes wrong in here, that indicates a big problem (this is the entirety of the "reset scheduler state" code):

logger.info("Releasing all requested keys")
for cs in self.clients.values():
self.client_releases_keys(
keys=[ts.key for ts in cs.wants_what],
client=cs.client_key,
stimulus_id=stimulus_id,
)
self.clear_task_state()

I guess similarly, remove_worker should never fail.

But calling the Nanny.restart() RPC, or waiting for N workers to return, could fail. That just indicates something was off with those particular workers. The scheduler state (and other workers) are still fine. We should leave it up to users whether to catch and ignore that error, or shut down the cluster themselves. But the cluster, overall, is fine, so I think preemptively killing it is overkill.

I don't love raising this TimeoutError when not all workers come back. If you have 1000 workers in your cluster, and 1 fails to restart properly, you probably don't care. But if 900 don't come back, you do. Both would show up as the same TimeoutError though. Exposing a wait_for_workers= argument maybe could make the most sense, so you can control this behavior. Our default would probably be to wait for 100% to come back though.

@gjoseph92
Copy link
Collaborator Author

This is passing now besides:

@hendrikmakait
Copy link
Member

hendrikmakait commented Jul 18, 2022

We should leave it up to users whether to catch and ignore that error, or shut down the cluster themselves. But the cluster, overall, is fine, so I think preemptively killing it is overkill.

I'm fine with having restart() work on a best-effort basis and leave it up to the deployment system as a last resort to correct the deployment state (i.e., spinning up missing workers).

I don't love raising this TimeoutError when not all workers come back. If you have 1000 workers in your cluster, and 1 fails to restart properly, you probably don't care. But if 900 don't come back, you do.

Agreed that there is some issue here. I think two possible solutions would be to add the wait_for_workers parameter or return the actual numbers of workers that managed to restart in time with the TimeoutError (possibly subclassing to a custom RestartTimeoutError that carries that information). This would allow the client to handle different scenarios depending on whether 0.1 %, 10 %, or 90% of workers are missing.

For restart() to work on a best-effort basis that attempts a restart and reports back how successful it has been with that, we should structure it such that the state of the scheduler after restart() will be a consistent state that is as close as possible to the desired one.

For me, this means three things:

  1. Scheduler state has been successfully reset. We need to ensure that we always run https://github.com/gjoseph92/distributed/blob/e18ea3759ae1b6ecdfb3c734d48a6eed2a94819e/distributed/scheduler.py#L5184-L5189. Currently, this code only gets executed if we do not timeout while restarting the nannies.
  2. We ensure that non-nanny workers are properly removed. Effectively this means that https://github.com/gjoseph92/distributed/blob/e18ea3759ae1b6ecdfb3c734d48a6eed2a94819e/distributed/scheduler.py#L5158-L5164 should not be cancelled. While remove_worker()should not fail right now, we might timeout if there are too many workers that we want to remove and the scheduler struggles. This would then cancel the remaining removal work. Future code changes might exacerbate the problem. Shielding the coroutine could be an easy solution that allows early reporting-back.
  3. As discussed in Ensure client.restart waits for workers to leave and come back #6637 (comment), we should remove all nannies that failed to restart successfully before timing out. There is probably something off with them, and we want to eliminate them. IIUC, these should eventually spin down and be replaced by the deployment system.
  4. [BONUS]: If possible and readable, schedule coroutines for restarting nannies and removing workers concurrently instead of sequentially so that the timeout parameter resembles the time restart() will take more closely.

@gjoseph92
Copy link
Collaborator Author

Currently, this code only gets executed if we do not timeout while restarting the nannies

Good catch; fixed.

we might timeout if there are too many workers that we want to remove and the scheduler struggles

Also fixed. I think more broadly, what we'll say is that "restart tells all workers to restart (or non-Nanny workers to shut down), then waits timeout seconds for them to come back". So the timeout doesn't apply to the shutdown process, only to the waiting for workers to return. This simplifies things a bit.

we should remove all nannies that failed to restart successfully before timing out

I implemented this, but I've now decided against it. I think that leaving non-restarted nannies alone makes for a simpler contract. With what I have above, the timeout is really just a convenience saving you a Client.wait_for_workers. Just because a nanny didn't restart within a user-specified window isn't necessarily an indication that anything's wrong with it. You could pass timeout=0.1 and none of your nannies would be able to restart fast enough, but that doesn't mean they're broken.

@gjoseph92
Copy link
Collaborator Author

Going to add a flag to enable/disable waiting for workers, then I think we'll be good to go.

@hendrikmakait
Copy link
Member

we should remove all nannies that failed to restart successfully before timing out

I implemented this, but I've now decided against it. I think that leaving non-restarted nannies alone makes for a simpler contract. With what I have above, the timeout is really just a convenience saving you a Client.wait_for_workers. Just because a nanny didn't restart within a user-specified window isn't necessarily an indication that anything's wrong with it. You could pass timeout=0.1 and none of your nannies would be able to restart fast enough, but that doesn't mean they're broken.

For clarity: Does that ensure that we will have no nannies continuing with business as usual even though it had been asked to restart? That is what I would like for us to achieve here.

If this is not the case, maybe add a follow-up ticket to change the semantics of restarting a nanny to something that ensures that we will only keep talking to nannies that did in fact restart, but also gives them enough time to do so. For example, should sending the restart request to the nanny fail for some reason, I do not want to keep that one around. At the same time, you have a point that restarts might be too short for the nannies to act.

separating whether a worker took too long to shut down vs start up allows us to guarantee all old workers are removed
@gjoseph92
Copy link
Collaborator Author

Does that ensure that we will have no nannies continuing with business as usual even though it had been asked to restart?

Updated to call Nanny.kill instead of Nanny.restart. This allows us to remove any workers that failed to shut down in time, guaranteeing that after a restart, there are no old workers connected.

Going to add a flag to enable/disable waiting for workers

Done

@gjoseph92 gjoseph92 marked this pull request as ready for review July 18, 2022 20:40
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this approach to Client.restart. The contract feels clean and it eliminates another coroutine race on the nanny.

distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/client.py Outdated Show resolved Hide resolved
distributed/client.py Show resolved Hide resolved
distributed/scheduler.py Show resolved Hide resolved
distributed/scheduler.py Outdated Show resolved Hide resolved
distributed/scheduler.py Show resolved Hide resolved
distributed/scheduler.py Show resolved Hide resolved
gjoseph92 and others added 4 commits July 19, 2022 10:28
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
Co-authored-by: Hendrik Makait <hendrik.makait@gmail.com>
@gjoseph92
Copy link
Collaborator Author

@hendrikmakait I think this is ready, assuming CI passes?

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks!

@gjoseph92 gjoseph92 self-assigned this Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants