-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpecCluster resilience to broken workers #8233
Conversation
f85f503
to
531b1b4
Compare
531b1b4
to
d81aab4
Compare
18fd000
to
5819d9a
Compare
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files + 1 21 suites +1 10h 22m 55s ⏱️ + 8m 48s For more details on these failures, see this check. Results for commit b13d4cd. ± Comparison against base commit de3f755. ♻️ This comment has been updated with latest results. |
try: | ||
await self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await self
and self._correct_state()
should be updated to both call await self.close()
on Exception then you won't need the code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding _correct_state
is used for all enlarge/shrink operations. So if you e.g. fail to increase the cluster size from 100 to 200 workers because you only have 150 hosts available, it should not shut down the healthy 100 workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case the try/await self lines should be swapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I put await self
outside of the try block, a wealth of tests designed to fail to start the cluster crash with `some RPCs left active by test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok that's a bug, can keep the await self in the try for now and we can try fixing exceptions in await self not closing later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI looks pretty broken, please address before merging.
32f5677
to
9e757f2
Compare
e806f90
to
b13d4cd
Compare
Ready for final review pass and merge |
distributed/utils_test.py
Outdated
) -> Generator[None, None, None]: | ||
"""Contextmanager to assert that a certain exception with cause was raised | ||
*more_causes: type[BaseException] | tuple[type[BaseException], ...] | str | None, | ||
) -> Iterator[None]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type should be Generator as contextmanager calls .throw
It's a bug that contextmanager accepts Iterator functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today I learned somthing new :)
Prior to dask#8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With dask#8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see dask#5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.
I can't understand how this test works most times in CI. It hangs for me deterministically when I run it by hand.