SpecCluster resilience to broken workers #8233

crusaderky · 2023-10-04T12:25:27Z

Closes Flaky test_broken_worker #8230

I can't understand how this test works most times in CI. It hangs for me deterministically when I run it by hand.

distributed/deploy/spec.py

github-actions · 2023-10-04T13:27:22Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      21 files +      1       21 suites +1 10h 22m 55s ⏱️ + 8m 48s
  3 836 tests ±      0   3 723 ✔️ +      5   107 💤 - 10 6 ❌ +5
36 030 runs +1 345 34 279 ✔️ +1 317 1 745 💤 +23 6 ❌ +5

For more details on these failures, see this check.

Results for commit b13d4cd. ± Comparison against base commit de3f755.

♻️ This comment has been updated with latest results.

graingert · 2023-10-04T14:52:51Z

distributed/deploy/spec.py

+        try:
+            await self


await self and self._correct_state() should be updated to both call await self.close() on Exception then you won't need the code here

To my understanding _correct_state is used for all enlarge/shrink operations. So if you e.g. fail to increase the cluster size from 100 to 200 workers because you only have 150 hosts available, it should not shut down the healthy 100 workers.

In that case the try/await self lines should be swapped?

If I put await self outside of the try block, a wealth of tests designed to fail to start the cluster crash with `some RPCs left active by test

Ok that's a bug, can keep the await self in the try for now and we can try fixing exceptions in await self not closing later

distributed/deploy/spec.py

hendrikmakait

CI looks pretty broken, please address before merging.

revert stress test

This reverts commit 9e757f2.

This reverts commit e806f90.

crusaderky · 2023-10-05T15:55:31Z

Ready for final review pass and merge

graingert · 2023-10-05T16:14:32Z

distributed/utils_test.py

-) -> Generator[None, None, None]:
-    """Contextmanager to assert that a certain exception with cause was raised
+    *more_causes: type[BaseException] | tuple[type[BaseException], ...] | str | None,
+) -> Iterator[None]:


type should be Generator as contextmanager calls .throw

It's a bug that contextmanager accepts Iterator functions

Today I learned somthing new :)

contextlib.contextmanager allows Iterator[T] -- should probably be Generator[T, None, None]? python/typeshed#2772

Prior to dask#8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With dask#8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see dask#5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.

crusaderky force-pushed the test_broken_worker branch from f85f503 to 531b1b4 Compare October 4, 2023 12:29

crusaderky commented Oct 4, 2023

View reviewed changes

distributed/deploy/spec.py Outdated Show resolved Hide resolved

crusaderky force-pushed the test_broken_worker branch from 531b1b4 to d81aab4 Compare October 4, 2023 12:38

crusaderky changed the title ~~Fix flaky test_broken_worker~~ SpecCluster resilience to broken workers Oct 4, 2023

crusaderky self-assigned this Oct 4, 2023

crusaderky force-pushed the test_broken_worker branch 2 times, most recently from 18fd000 to 5819d9a Compare October 4, 2023 12:47

crusaderky marked this pull request as ready for review October 4, 2023 13:38

crusaderky requested review from jacobtomlinson and fjetter as code owners October 4, 2023 13:38

graingert reviewed Oct 4, 2023

View reviewed changes

distributed/deploy/spec.py Show resolved Hide resolved

graingert approved these changes Oct 4, 2023

View reviewed changes

hendrikmakait requested changes Oct 5, 2023

View reviewed changes

crusaderky marked this pull request as draft October 5, 2023 10:24

crusaderky force-pushed the test_broken_worker branch from 32f5677 to 9e757f2 Compare October 5, 2023 10:47

crusaderky added 7 commits October 5, 2023 14:50

Fix flaky test_broken_worker

7811c7e

revert stress test

Remove redundant gen.coroutine hacks

0a3253a

Don't wrap Server start exceptions in RuntimeError

99d58a8

Revert "Don't wrap Server start exceptions in RuntimeError"

5b3db95

This reverts commit 9e757f2.

fix failing tests

0486542

try..except only around correct_state

a9ceed9

Revert "try..except only around correct_state"

b13d4cd

This reverts commit e806f90.

crusaderky force-pushed the test_broken_worker branch from e806f90 to b13d4cd Compare October 5, 2023 13:51

crusaderky marked this pull request as ready for review October 5, 2023 15:55

graingert reviewed Oct 5, 2023

View reviewed changes

graingert approved these changes Oct 5, 2023

View reviewed changes

annotate with Generator

a4c3273

crusaderky merged commit 9a8b380 into dask:main Oct 6, 2023
18 of 23 checks passed

crusaderky deleted the test_broken_worker branch October 6, 2023 10:56

wence- mentioned this pull request Oct 27, 2023

Errors when scaling up cluster no longer propagate to client side #8309

Open

wence- mentioned this pull request Oct 30, 2023

Restore ordering of worker update in _correct_state_internal #8314

Open

2 tasks

fjetter mentioned this pull request Nov 2, 2023

Ensure adaptive properties work as expected for SpecCluster #8324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpecCluster resilience to broken workers #8233

SpecCluster resilience to broken workers #8233

crusaderky commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading

graingert Oct 4, 2023

crusaderky Oct 4, 2023

graingert Oct 4, 2023

crusaderky Oct 5, 2023

graingert Oct 5, 2023

hendrikmakait left a comment

crusaderky commented Oct 5, 2023

graingert Oct 5, 2023

crusaderky Oct 6, 2023

SpecCluster resilience to broken workers #8233

SpecCluster resilience to broken workers #8233

Conversation

crusaderky commented Oct 4, 2023 • edited Loading

github-actions bot commented Oct 4, 2023 • edited Loading

Unit Test Results

graingert Oct 4, 2023

Choose a reason for hiding this comment

crusaderky Oct 4, 2023

Choose a reason for hiding this comment

graingert Oct 4, 2023

Choose a reason for hiding this comment

crusaderky Oct 5, 2023

Choose a reason for hiding this comment

graingert Oct 5, 2023

Choose a reason for hiding this comment

hendrikmakait left a comment

Choose a reason for hiding this comment

crusaderky commented Oct 5, 2023

graingert Oct 5, 2023

Choose a reason for hiding this comment

crusaderky Oct 6, 2023

Choose a reason for hiding this comment

crusaderky commented Oct 4, 2023 •

edited

Loading

github-actions bot commented Oct 4, 2023 •

edited

Loading