Ensure client.restart waits for workers to leave and come back #6637

fjetter · 2022-06-27T17:49:02Z

Client should be blocking until the workers are cycled through. At the very least since #6504 we're no longer removing workers immediately from the internal state tracking and therefore client.restart would return immediately before any workers were removed.

This is at the very least an issue since #6504 possibly before. I have a strong suspicion that this is responsible for coiled/benchmarks#166
I believe in cases of timeouts of individual nanny restarts, this was a problem before #6504 already

github-actions · 2022-06-27T20:46:53Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±  0       15 suites ±0 6h 49m 26s ⏱️ + 10m 41s
  2 909 tests +  1   2 819 ✔️ - 2   81 💤 ±0   6 ❌ +2 3 🔥 +1
21 543 runs +10 20 577 ✔️ +2 952 💤 +2 10 ❌ +4 4 🔥 +2

For more details on these failures and errors, see this check.

Results for commit bd61588. ± Comparison against base commit 18a4642.

♻️ This comment has been updated with latest results.

hendrikmakait · 2022-06-28T08:42:52Z

distributed/tests/test_client.py

+@gen_cluster(Worker=Nanny, client=True, nthreads=[("", 1)])
+async def test_restart_waits_for_new_workers(c, s, a):
+    initial_workers = set(s.workers)
+    await c.restart()


Suggested change

await c.restart()

await c.restart()

assert len(s.workers) == len(initial_workers)

hendrikmakait · 2022-06-28T08:46:57Z

distributed/tests/test_client.py

+async def test_restart_waits_for_new_workers(c, s, a):
+    initial_workers = set(s.workers)
+    await c.restart()
+    assert set(s.workers) != initial_workers


Should we be more cautious to check that we do not have an overlap in the IDs at all (as we do in the implementation)?

yes, the challenge here is to not test the implementation. We can, however, check that the workers we started with are no longer known to the scheduler

I agree that we generally should not test the implementation, but I think in this case, it simply turns out that we can implement the functionality using the actual conditions we would like to test, so this should be fine.

hendrikmakait · 2022-06-28T08:48:15Z

distributed/scheduler.py

@@ -5140,7 +5144,11 @@ async def restart(self, client=None, timeout=30):

        self.log_event([client, "all"], {"action": "restart", "client": client})
        start = time()
-        while time() < start + 10 and len(self.workers) < n_workers:
+        while (
+            time() < start + 10


Do we still need this? I can only imagine that this was added to give the scheduler time to start dropping workers, for which we added conditions below.

Well, the "correct" way would be to time out if this condition is not met and not silently return. That would be a breaking change and I'm not convinced this is worth it

I'm not sure if we're talking about the same thing, to be more specific, I was referring to time() < start + 10 only.

yes, and I think what we do is bad practice. We simply wait for 10 seconds and return regardless of whether the cluster is in the state we want it to be or not. Instead, we should raise a TimeoutError but that would be a breaking change.

My bad, I read that as an OR condition which felt weird. Your explanation totally makes sense for and AND

hendrikmakait

The new test appears to be broken in 2 CI runs, is that related or some CI GC issue?

fjetter · 2022-06-30T10:37:11Z

There are some weird failures that may be fixed by #6651

fjetter · 2022-06-30T10:37:23Z

waiting for merge until that one is merged and I can rebase this PR

fjetter · 2022-07-05T09:13:59Z

The new test is unfortunately not solid. It appears that the workers are timing out

gjoseph92 · 2022-07-06T14:59:51Z

distributed/scheduler.py

@@ -5083,11 +5083,15 @@ def clear_task_state(self):
        for collection in self._task_state_collections:
            collection.clear()

+    def _get_worker_ids(self) -> set[str]:


This seems kinda unnecessary to me as a Scheduler method, where else would it be used?

gjoseph92 · 2022-07-06T15:06:05Z

distributed/scheduler.py

@@ -5161,7 +5165,9 @@ async def restart(self, client=None, timeout=30):

        self.log_event([client, "all"], {"action": "restart", "client": client})
        start = time()
-        while time() < start + 10 and len(self.workers) < n_workers:
+        while time() < start + 10 and (
+            len(self.workers) < n_workers or initial_workers & self._get_worker_ids()


See this comment: https://github.com/dask/distributed/pull/6637/files#diff-bbcf2e505bf2f9dd0dc25de4582115ee4ed4a6e80997affc7b22122912cc6591R5108-R5109

Workers without nannies will be closed and not come back. So we shouldn't wait for them. We should just be waiting for nanny_workers.

The way I understood the intent of that comment and the code, we purposefully rely on the deployment system to bring the workers back up and we therefore want to wait for that to happen. Otherwise, in a non-nanny environment, Scheduler.restart would just be a Scheduler.shutdown with extra steps. I might be wrong though...and we should definitely improve the way we handle the case where the deployment system does not restart workers.

We have no guarantee that the deployment system will bring them back up. It's entirely possible it's just a Scheduler.shutdown!

With the 10s timeout, I guess it's okay to wait for all of them, even if they'll never come back. But it feels like a kind of odd assumption to make.

This is related to Florian's comment here: #6637 (comment)

gjoseph92 · 2022-07-06T15:33:58Z

distributed/scheduler.py

@@ -5161,7 +5165,9 @@ async def restart(self, client=None, timeout=30):

        self.log_event([client, "all"], {"action": "restart", "client": client})
        start = time()
-        while time() < start + 10 and len(self.workers) < n_workers:
+        while time() < start + 10 and (
+            len(self.workers) < n_workers or initial_workers & self._get_worker_ids()


Isn't a restarted worker going to have a new server_id? The ID is coming from the Worker, not the Nanny instance:

distributed/distributed/core.py

Line 341 in d88c1d2

self.id = type(self).__name__ + "-" + str(uuid.uuid4())

distributed/distributed/worker.py

Line 1113 in d88c1d2

server_id=self.id,

So I don't know why we'd even expect those IDs to come back. Indeed, if they did come back that would be highly concerning.

I think this might be reasonable (though obviously not tolerant to new workers joining during the restart):

Suggested change

len(self.workers) < n_workers or initial_workers & self._get_worker_ids()

len(self.running) < len(nanny_workers)

Is there a scenario where due to some race condition an old worker might somehow still be in self.worker or reconnecting before shutting down? I think that's what initial_workers & self._get_worker_ids() is intended to catch.

Yes, that's possible if we hit the not all(resp == "OK" for resp in resps) branch when telling Nanny workers to restart. If the nanny times out trying to kill the worker subprocess, that worker may still be connected.

distributed/distributed/nanny.py

Lines 479 to 491 in f7f6501

async def restart(self, timeout=30):

async def _():

if self.process is not None:

await self.kill()

await self.instantiate()

try:

await asyncio.wait_for(_(), timeout)

except TimeoutError:

logger.error(

f"Restart timed out after {timeout}s; returning before finished"

)

return "timed out"

However, in that case, I think we'd want to ignore that worker in our count of workers we expect to come back. Since we didn't successfully shut it down, it's still around, and if/when it does shut down in the future, it's quite possible it won't come back. We can disregard the fact that we tried to restart that one; it's a lost cause.

Honestly, I wonder if we should call remove_worker on the workers that didn't respond "OK". Their state is a bit undefined, but it's probably kinda broken. Maybe we don't want to talk to them anymore. If we remove them, the scheduler will reject them the next time they try to talk to it, so it's safe.

But if we don't do that, then I think initial_workers & self._get_worker_ids() is too strict. Because the only reason some initial_workers would still be connected is that their nannies failed to shut down. If they failed to shut down, then they may never restart, so we shouldn't wait for them.

I'd then be torn on len(self.workers) < n_workers vs len(self.running) < n_workers. It's quite possible these broken workers aren't running anymore, and never will be. So just checking len(self.workers) means they'll still be counted. OTOH, it's nice to wait for new workers to be fully running instead of just connected. We could do

if n_restart_failed := sum(resp != "OK" for resp in resps): logger.error( "Not all workers responded positively: %s", resps, exc_info=True, ) n_workers -= n_restart_failed

Then checking len(self.running) < n_workers will ignore the workers we don't expect to restart. But I don't know if it's necessary.

fjetter requested review from crusaderky, graingert and hendrikmakait June 27, 2022 17:57

fjetter self-assigned this Jun 27, 2022

fjetter mentioned this pull request Jun 27, 2022

⚠️ CI failed ⚠️ - test_deadlock fails intermittently coiled/benchmarks#166

Open

fjetter added the regression label Jun 27, 2022

fjetter changed the title ~~Ensure client.restart waits for workers to leave~~ Ensure client.restart waits for workers to leave and come back Jun 27, 2022

hendrikmakait reviewed Jun 28, 2022

View reviewed changes

hendrikmakait approved these changes Jun 29, 2022

View reviewed changes

hendrikmakait reviewed Jun 29, 2022

View reviewed changes

hendrikmakait self-requested a review June 29, 2022 11:23

Ensure client.restart waits for workers to leave

bd61588

fjetter force-pushed the client_restart_waits_for_workers_to_leave branch from 90931f0 to bd61588 Compare July 1, 2022 16:36

ncclementi mentioned this pull request Jul 5, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#202

Closed

crusaderky assigned hendrikmakait and unassigned fjetter Jul 6, 2022

gjoseph92 reviewed Jul 6, 2022

View reviewed changes

hendrikmakait assigned gjoseph92 Jul 11, 2022

gjoseph92 mentioned this pull request Jul 11, 2022

Wait for workers to return in Client.restart #6714

Merged

2 tasks

gjoseph92 closed this in #6714 Jul 19, 2022

fjetter deleted the client_restart_waits_for_workers_to_leave branch August 9, 2022 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure client.restart waits for workers to leave and come back #6637

Ensure client.restart waits for workers to leave and come back #6637

fjetter commented Jun 27, 2022 •

edited

Loading

github-actions bot commented Jun 27, 2022 •

edited

Loading

hendrikmakait Jun 28, 2022

hendrikmakait Jun 28, 2022

fjetter Jun 28, 2022

hendrikmakait Jun 28, 2022

hendrikmakait Jun 28, 2022

fjetter Jun 28, 2022

hendrikmakait Jun 29, 2022

fjetter Jun 29, 2022

hendrikmakait Jun 29, 2022

hendrikmakait left a comment

fjetter commented Jun 30, 2022

fjetter commented Jun 30, 2022

fjetter commented Jul 5, 2022

gjoseph92 Jul 6, 2022

gjoseph92 Jul 6, 2022

hendrikmakait Jul 7, 2022 •

edited

Loading

gjoseph92 Jul 7, 2022

hendrikmakait Jul 8, 2022

gjoseph92 Jul 6, 2022

hendrikmakait Jul 7, 2022 •

edited

Loading

gjoseph92 Jul 7, 2022

	await c.restart()
	await c.restart()
	assert len(s.workers) == len(initial_workers)

	len(self.workers) < n_workers or initial_workers & self._get_worker_ids()
	len(self.running) < len(nanny_workers)

	async def restart(self, timeout=30):
	async def _():
	if self.process is not None:
	await self.kill()
	await self.instantiate()

	try:
	await asyncio.wait_for(_(), timeout)
	except TimeoutError:
	logger.error(
	f"Restart timed out after {timeout}s; returning before finished"
	)
	return "timed out"

Ensure client.restart waits for workers to leave and come back #6637

Ensure client.restart waits for workers to leave and come back #6637

Conversation

fjetter commented Jun 27, 2022 • edited Loading

github-actions bot commented Jun 27, 2022 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

fjetter commented Jun 30, 2022

fjetter commented Jun 30, 2022

fjetter commented Jul 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmakait Jul 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented Jun 27, 2022 •

edited

Loading

github-actions bot commented Jun 27, 2022 •

edited

Loading

hendrikmakait Jul 7, 2022 •

edited

Loading

hendrikmakait Jul 7, 2022 •

edited

Loading