Restore ordering of worker update in _correct_state_internal #8314

wence- · 2023-10-30T18:12:32Z

Prior to #8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With #8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see #5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.

This partially addresses #8309 in that it restores the old behaviour, though a more principled fix is desired long-term.

Tests added / passed
Passes pre-commit run --all-files

Prior to dask#8233, when correcting state after calling Cluster.scale, we would wait until all futures had completed before updating the mapping of workers that we knew about. This meant that failure to boot a worker would propagate from a message on the worker side to an exception on the cluster side. With dask#8233 this order was changed, so that the workers we know about are updated before checking if the worker successfully booted. With this change, any exception is not propagated from the worker to the cluster, and so we cannot easily tell if scaling our cluster was successful. While _correct_state has issues (see dask#5919) until we can fix this properly, at least restore the old behaviour of propagating any raised exceptions to the cluster.

github-actions · 2023-10-30T19:32:31Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      27 files ±  0       27 suites ±0 15h 52m 20s ⏱️ + 5m 52s
  3 944 tests +  1   3 819 ✔️ - 2   117 💤 ±0   7 ❌ +  2 1 🔥 +1
49 542 runs +16 47 146 ✔️ - 7 2 367 💤 ±0 26 ❌ +20 3 🔥 +3

For more details on these failures and errors, see this check.

Results for commit cfe5906. ± Comparison against base commit 6f5109c.

fjetter · 2023-11-02T10:52:01Z

I'm struggling a little to understand how your fix related to the exception being raised. IIUC you would expect await asyncio.gather(*worker_futs) to raise an exception. How is the update of workers related to that?

Turns out that the current logic is broken in a way that affects adaptive clusters. I added a test covering this in #8324
This is explicitly incompatible with your proposed fix and I'm inclined to go for #8324 since this is a public API regression while the _correct_state is a private method.

wence- requested review from jacobtomlinson and fjetter as code owners October 30, 2023 18:12

wence- mentioned this pull request Oct 30, 2023

Errors when scaling up cluster no longer propagate to client side #8309

Open

fjetter mentioned this pull request Nov 2, 2023

Ensure adaptive properties work as expected for SpecCluster #8324

Open

wence- mentioned this pull request Jan 12, 2024

[BUG] Using Client.wait_for_workers Does Not Properly Wait for Workers rapidsai/cugraph#4082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore ordering of worker update in _correct_state_internal #8314

Restore ordering of worker update in _correct_state_internal #8314

wence- commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

fjetter commented Nov 2, 2023

Restore ordering of worker update in _correct_state_internal #8314

Are you sure you want to change the base?

Restore ordering of worker update in _correct_state_internal #8314

Conversation

wence- commented Oct 30, 2023

github-actions bot commented Oct 30, 2023

Unit Test Results

fjetter commented Nov 2, 2023