Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore ordering of worker update in _correct_state_internal #8314

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Oct 30, 2023

Prior to #8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With #8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see #5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.

This partially addresses #8309 in that it restores the old behaviour, though a more principled fix is desired long-term.

  • Tests added / passed
  • Passes pre-commit run --all-files

Prior to dask#8233, when correcting state after calling Cluster.scale, we
would wait until all futures had completed before updating the mapping
of workers that we knew about. This meant that failure to boot a
worker would propagate from a message on the worker side to an
exception on the cluster side. With dask#8233 this order was changed, so
that the workers we know about are updated before checking if the
worker successfully booted. With this change, any exception is not
propagated from the worker to the cluster, and so we cannot easily
tell if scaling our cluster was successful. While _correct_state has
issues (see dask#5919) until we can fix this properly, at least restore
the old behaviour of propagating any raised exceptions to the cluster.
@github-actions
Copy link
Contributor

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   15h 52m 20s ⏱️ + 5m 52s
  3 944 tests +  1    3 819 ✔️  - 2     117 💤 ±0    7 +  2  1 🔥 +1 
49 542 runs  +16  47 146 ✔️  - 7  2 367 💤 ±0  26 +20  3 🔥 +3 

For more details on these failures and errors, see this check.

Results for commit cfe5906. ± Comparison against base commit 6f5109c.

@fjetter
Copy link
Member

fjetter commented Nov 2, 2023

I'm struggling a little to understand how your fix related to the exception being raised. IIUC you would expect await asyncio.gather(*worker_futs) to raise an exception. How is the update of workers related to that?


Turns out that the current logic is broken in a way that affects adaptive clusters. I added a test covering this in #8324
This is explicitly incompatible with your proposed fix and I'm inclined to go for #8324 since this is a public API regression while the _correct_state is a private method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants