Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpecCluster resilience to broken workers #8233

Merged
merged 8 commits into from
Oct 6, 2023

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Oct 4, 2023

I can't understand how this test works most times in CI. It hangs for me deterministically when I run it by hand.

distributed/deploy/spec.py Outdated Show resolved Hide resolved
@crusaderky crusaderky changed the title Fix flaky test_broken_worker SpecCluster resilience to broken workers Oct 4, 2023
@crusaderky crusaderky self-assigned this Oct 4, 2023
@crusaderky crusaderky force-pushed the test_broken_worker branch 2 times, most recently from 18fd000 to 5819d9a Compare October 4, 2023 12:47
@github-actions
Copy link
Contributor

github-actions bot commented Oct 4, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  +       1         21 suites  +1   10h 22m 55s ⏱️ + 8m 48s
  3 836 tests ±       0    3 723 ✔️ +       5     107 💤  - 10  6 +5 
36 030 runs  +1 345  34 279 ✔️ +1 317  1 745 💤 +23  6 +5 

For more details on these failures, see this check.

Results for commit b13d4cd. ± Comparison against base commit de3f755.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as ready for review October 4, 2023 13:38
Comment on lines +474 to +473
try:
await self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

await self and self._correct_state() should be updated to both call await self.close() on Exception then you won't need the code here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding _correct_state is used for all enlarge/shrink operations. So if you e.g. fail to increase the cluster size from 100 to 200 workers because you only have 150 hosts available, it should not shut down the healthy 100 workers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the try/await self lines should be swapped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I put await self outside of the try block, a wealth of tests designed to fail to start the cluster crash with `some RPCs left active by test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's a bug, can keep the await self in the try for now and we can try fixing exceptions in await self not closing later

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI looks pretty broken, please address before merging.

@crusaderky crusaderky marked this pull request as draft October 5, 2023 10:24
@crusaderky crusaderky marked this pull request as ready for review October 5, 2023 15:55
@crusaderky
Copy link
Collaborator Author

Ready for final review pass and merge

) -> Generator[None, None, None]:
"""Contextmanager to assert that a certain exception with cause was raised
*more_causes: type[BaseException] | tuple[type[BaseException], ...] | str | None,
) -> Iterator[None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type should be Generator as contextmanager calls .throw

It's a bug that contextmanager accepts Iterator functions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crusaderky crusaderky merged commit 9a8b380 into dask:main Oct 6, 2023
18 of 23 checks passed
@crusaderky crusaderky deleted the test_broken_worker branch October 6, 2023 10:56
wence- added a commit to wence-/distributed that referenced this pull request Oct 30, 2023
Prior to dask#8233, when correcting state after calling Cluster.scale, we
would wait until all futures had completed before updating the mapping
of workers that we knew about. This meant that failure to boot a
worker would propagate from a message on the worker side to an
exception on the cluster side. With dask#8233 this order was changed, so
that the workers we know about are updated before checking if the
worker successfully booted. With this change, any exception is not
propagated from the worker to the cluster, and so we cannot easily
tell if scaling our cluster was successful. While _correct_state has
issues (see dask#5919) until we can fix this properly, at least restore
the old behaviour of propagating any raised exceptions to the cluster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test_broken_worker
3 participants