-
-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Client.restart_workers
method
#7154
Conversation
info = self.scheduler_info() | ||
for worker in workers: | ||
if info["workers"][worker]["nanny"] is None: | ||
raise ValueError( | ||
f"Restarting workers requires a nanny to be used. Worker {worker} has type {info['workers'][worker]['type']}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but is a little clunky. If in the future we want similar behavior elsewhere, we might consider pushing this sort of logic down into Scheduler.broadcast
directly. I've held off on doing so for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you do not handle this error here? I would expect broadcast to raise if we are selecting nannies but there are None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today I think it just ignores non-Nanny
workers when nanny=True
distributed/distributed/scheduler.py
Lines 5864 to 5866 in ec00cd5
results = await All( | |
[send_message(address) for address in addresses if address is not None] | |
) |
In ^ that snippet address=None
for non-Nanny
workers
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ± 0 15 suites ±0 6h 38m 10s ⏱️ + 33m 53s For more details on these failures, see this check. Results for commit 1aea53f. ± Comparison against base commit 621994e. ♻️ This comment has been updated with latest results. |
Neat! This need comes up a lot. cc @pentschev @quasiben (who may find this of interest) |
Ah, grand. I'd be curious to hear why this comes up in a RAPIDS context -- at least I assume it's RAPIDS-based based on the Ben / Peter pings : ) |
|
||
See Also | ||
-------- | ||
Client.restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to cross reference this new method in Client.restart as well. I would also appreciate a sentence about the differences between the two methods
distributed/client.py
Outdated
return self.sync( | ||
self.scheduler.broadcast, msg={"op": "restart"}, workers=workers, nanny=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nanny.restart
can return timed out
which should be handled here.
I don't thin we should return the output of the broadcast. I doubt this is useful / a good user facing API
distributed/client.py
Outdated
@@ -3455,6 +3455,44 @@ def restart(self, timeout=no_default, wait_for_workers=True): | |||
self._restart, timeout=timeout, wait_for_workers=wait_for_workers | |||
) | |||
|
|||
def restart_workers(self, workers: list[str]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nanny.restart
is taking a timeout argument. We should allow this to be provided here
info = self.scheduler_info() | ||
for worker in workers: | ||
if info["workers"][worker]["nanny"] is None: | ||
raise ValueError( | ||
f"Restarting workers requires a nanny to be used. Worker {worker} has type {info['workers'][worker]['type']}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you do not handle this error here? I would expect broadcast to raise if we are selecting nannies but there are None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really useful!
Thanks @jrbourbeau for the work on this and @jakirkham for the ping. This is something that I remember coming up a couple of times, mostly from people wanting to have a CPU worker that eventually turns into a GPU worker or vice-versa. Jacob, you also had some use for that, didn't you? I can't remember what was the exact use case now though. |
Kinda, I have some plans around restarting the worker with different config options which is adjacent to this change. I'm interested in something more along the lines of re-deploying the cluster but without losing HPC/Cloud resources that have already been allocated. See the dask-agent experimental repo I was playing with a while ago. |
Thanks for the review @fjetter -- just pushed a commit that should handle your suggestions |
There are also times where workers become unusable or need to be refreshed. Currently we tell people to restart the full cluster, but maybe a less drastic option (like this one) would be useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason we do not reuse the restarting logic within Scheduler.restart
and instead implement Client.restart_worker
with (what seem to be) slightly different post-conditions? I'd rather have the scheduler be responsible for restarting workers and avoid two functions that might drift even further apart in semantics in the future.
Restarting workers within Scheduler.restart
distributed/distributed/scheduler.py
Lines 5722 to 5804 in 6afce9c
n_workers = len(self.workers) | |
nanny_workers = { | |
addr: ws.nanny for addr, ws in self.workers.items() if ws.nanny | |
} | |
# Close non-Nanny workers. We have no way to restart them, so we just let them go, | |
# and assume a deployment system is going to restart them for us. | |
await asyncio.gather( | |
*( | |
self.remove_worker(address=addr, stimulus_id=stimulus_id) | |
for addr in self.workers | |
if addr not in nanny_workers | |
) | |
) | |
logger.debug("Send kill signal to nannies: %s", nanny_workers) | |
async with contextlib.AsyncExitStack() as stack: | |
nannies = await asyncio.gather( | |
*( | |
stack.enter_async_context( | |
rpc(nanny_address, connection_args=self.connection_args) | |
) | |
for nanny_address in nanny_workers.values() | |
) | |
) | |
start = monotonic() | |
resps = await asyncio.gather( | |
*( | |
asyncio.wait_for( | |
# FIXME does not raise if the process fails to shut down, | |
# see https://github.com/dask/distributed/pull/6427/files#r894917424 | |
# NOTE: Nanny will automatically restart worker process when it's killed | |
nanny.kill(timeout=timeout), | |
timeout, | |
) | |
for nanny in nannies | |
), | |
return_exceptions=True, | |
) | |
# NOTE: the `WorkerState` entries for these workers will be removed | |
# naturally when they disconnect from the scheduler. | |
# Remove any workers that failed to shut down, so we can guarantee | |
# that after `restart`, there are no old workers around. | |
bad_nannies = [ | |
addr for addr, resp in zip(nanny_workers, resps) if resp is not None | |
] | |
if bad_nannies: | |
await asyncio.gather( | |
*( | |
self.remove_worker(addr, stimulus_id=stimulus_id) | |
for addr in bad_nannies | |
) | |
) | |
raise TimeoutError( | |
f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s" | |
) | |
self.log_event([client, "all"], {"action": "restart", "client": client}) | |
if wait_for_workers: | |
while len(self.workers) < n_workers: | |
# NOTE: if new (unrelated) workers join while we're waiting, we may return before | |
# our shut-down workers have come back up. That's fine; workers are interchangeable. | |
if monotonic() < start + timeout: | |
await asyncio.sleep(0.2) | |
else: | |
msg = ( | |
f"Waited for {n_workers} worker(s) to reconnect after restarting, " | |
f"but after {timeout}s, only {len(self.workers)} have returned. " | |
"Consider a longer timeout, or `wait_for_workers=False`." | |
) | |
if (n_nanny := len(nanny_workers)) < n_workers: | |
msg += ( | |
f" The {n_workers - n_nanny} worker(s) not using Nannies were just shut " | |
"down instead of restarted (restart is only possible with Nannies). If " | |
"your deployment system does not automatically re-launch terminated " | |
"processes, then those workers will never come back, and `Client.restart` " | |
"will always time out. Do not use `Client.restart` in that case." | |
) | |
raise TimeoutError(msg) from None |
Post conditions on Scheduler.restart
distributed/distributed/scheduler.py
Lines 5678 to 5685 in 6afce9c
Workers without nannies are shut down, hoping an external deployment system | |
will restart them. Therefore, if not using nannies and your deployment system | |
does not automatically restart workers, ``restart`` will just shut down all | |
workers, then time out! | |
After `restart`, all connected workers are new, regardless of whether `TimeoutError` | |
was raised. Any workers that failed to shut down in time are removed, and | |
may or may not shut down on their own in the future. |
It looks like |
I was referring to the snippet from |
One conceptual issue I see with this PR is that tasks running on restarting workers will fail (or increment their Reproducer @gen_cluster(
client=True,
Worker=Nanny,
nthreads=[("", 1)],
config={"distributed.scheduler.allowed-failures": 0},
)
async def test_restart_workers_fails_executing_task(c, s, a):
ev_start = Event()
ev_block = Event()
def clog(ev_start, ev_block):
ev_start.set()
ev_block.wait()
fut = c.submit(
clog,
ev_start=ev_start,
ev_block=ev_block,
key="wait",
)
ev_start.wait()
await c.restart_workers(workers=[a.worker_address])
assert await fut.result() |
@jrbourbeau: In #7184, I have started working on underlying changes to |
Thanks @hendrikmakait -- I've included a note about this in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jrbourbeau. It looks like all the points from @fjetter are addressed as well.
distributed/client.py
Outdated
will restart all workers and also reset local state on the cluster | ||
(e.g. all keys are released). | ||
|
||
Additionally, this method makes no safety guarantees for tasks that are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, this method makes no safety guarantees for tasks that are | |
Additionally, this method does not gracefully handle tasks that are |
Thanks all for the feedback and @fjetter @hendrikmakait for reviewing! Merging this one in -- CI failures are unrelated |
Sometimes users want to restart individual / a subset of workers in their cluster without restarting their entire cluster (with
client.restart()
). This PR adds aClient.restart_workers
method for this (similar to the existingClient.retire_workers
method we provide).Noting that there is a relatively high level on engagement on #1823.
Closes #1823