-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure client.restart waits for workers to leave and come back #6637
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -5083,11 +5083,15 @@ def clear_task_state(self): | |||||||||||||||||||||||||||||||||||
for collection in self._task_state_collections: | ||||||||||||||||||||||||||||||||||||
collection.clear() | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
def _get_worker_ids(self) -> set[str]: | ||||||||||||||||||||||||||||||||||||
return set({ws.server_id for ws in self.workers.values()}) | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
@log_errors | ||||||||||||||||||||||||||||||||||||
async def restart(self, client=None, timeout=30): | ||||||||||||||||||||||||||||||||||||
"""Restart all workers. Reset local state.""" | ||||||||||||||||||||||||||||||||||||
stimulus_id = f"restart-{time()}" | ||||||||||||||||||||||||||||||||||||
n_workers = len(self.workers) | ||||||||||||||||||||||||||||||||||||
initial_workers = self._get_worker_ids() | ||||||||||||||||||||||||||||||||||||
n_workers = len(initial_workers) | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
logger.info("Send lost future signal to clients") | ||||||||||||||||||||||||||||||||||||
for cs in self.clients.values(): | ||||||||||||||||||||||||||||||||||||
|
@@ -5161,7 +5165,9 @@ async def restart(self, client=None, timeout=30): | |||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
self.log_event([client, "all"], {"action": "restart", "client": client}) | ||||||||||||||||||||||||||||||||||||
start = time() | ||||||||||||||||||||||||||||||||||||
while time() < start + 10 and len(self.workers) < n_workers: | ||||||||||||||||||||||||||||||||||||
while time() < start + 10 and ( | ||||||||||||||||||||||||||||||||||||
len(self.workers) < n_workers or initial_workers & self._get_worker_ids() | ||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See this comment: https://github.com/dask/distributed/pull/6637/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R5108-R5109 Workers without nannies will be closed and not come back. So we shouldn't wait for them. We should just be waiting for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way I understood the intent of that comment and the code, we purposefully rely on the deployment system to bring the workers back up and we therefore want to wait for that to happen. Otherwise, in a non-nanny environment, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have no guarantee that the deployment system will bring them back up. It's entirely possible it's just a With the 10s timeout, I guess it's okay to wait for all of them, even if they'll never come back. But it feels like a kind of odd assumption to make. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is related to Florian's comment here: #6637 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't a restarted worker going to have a new distributed/distributed/core.py Line 341 in d88c1d2
distributed/distributed/worker.py Line 1113 in d88c1d2
So I don't know why we'd even expect those IDs to come back. Indeed, if they did come back that would be highly concerning. I think this might be reasonable (though obviously not tolerant to new workers joining during the restart):
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a scenario where due to some race condition an old worker might somehow still be in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's possible if we hit the distributed/distributed/nanny.py Lines 479 to 491 in f7f6501
However, in that case, I think we'd want to ignore that worker in our count of workers we expect to come back. Since we didn't successfully shut it down, it's still around, and if/when it does shut down in the future, it's quite possible it won't come back. We can disregard the fact that we tried to restart that one; it's a lost cause. Honestly, I wonder if we should call But if we don't do that, then I think I'd then be torn on if n_restart_failed := sum(resp != "OK" for resp in resps):
logger.error(
"Not all workers responded positively: %s",
resps,
exc_info=True,
)
n_workers -= n_restart_failed Then checking |
||||||||||||||||||||||||||||||||||||
): | ||||||||||||||||||||||||||||||||||||
await asyncio.sleep(0.01) | ||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||
self.report({"op": "restart"}) | ||||||||||||||||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -3493,6 +3493,16 @@ async def test_Client_clears_references_after_restart(c, s, a, b): | |||||||
assert key not in c.refcount | ||||||||
|
||||||||
|
||||||||
@pytest.mark.slow | ||||||||
@gen_cluster(Worker=Nanny, client=True, nthreads=[("", 1)] * 5) | ||||||||
async def test_restart_waits_for_new_workers(c, s, *workers): | ||||||||
initial_workers = set(s.workers) | ||||||||
await c.restart() | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
assert len(s.workers) == len(initial_workers) | ||||||||
for w in workers: | ||||||||
assert w.address not in s.workers | ||||||||
|
||||||||
|
||||||||
@gen_cluster(Worker=Nanny, client=True) | ||||||||
async def test_restart_timeout_is_logged(c, s, a, b): | ||||||||
with captured_logger(logging.getLogger("distributed.client")) as logger: | ||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems kinda unnecessary to me as a Scheduler method, where else would it be used?