Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace test_(do_not_)steal_communication_heavy_tasks tests with more robust versions #7243

Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Nov 2, 2022

This PR drops flaky test_steal_communication_heavy_tasks_ since it contradicts test_do_not_steal_communication_heavy_tasks and should outdated after the latest changes to work-stealing.

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait self-assigned this Nov 2, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Nov 2, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±  0         15 suites  ±0   6h 19m 22s ⏱️ - 17m 42s
  3 172 tests +  3    3 088 ✔️ +  3    83 💤 ±0  1 ±0 
23 472 runs  +24  22 569 ✔️ +25  900 💤  - 1  3 ±0 

For more details on these failures, see this check.

Results for commit 2510468. ± Comparison against base commit aa1c6d8.

♻️ This comment has been updated with latest results.

@fjetter
Copy link
Member

fjetter commented Nov 3, 2022

If they are contradicting tests, why is it flaky? Shouldn't it reliably fail? This still sounds like there is a bug somewhere

@hendrikmakait
Copy link
Member Author

The entire test makes fairly little sense at the moment. I think it should have been dropped in #7075 and yet I missed it. We could spend time cleaning up the test, but the way it is written right now, I think it is prone to a bunch of race conditions and does not test anything useful. Also, any useful behavior should be tested in other places (e.g. test_do_not_steal_communication_heavy_tasks, test_steal_expensive_data_slow_computation)

@@ -825,44 +825,6 @@ def block_reduce(x, y, event):
assert not b.data
Copy link
Collaborator

@crusaderky crusaderky Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_do_not_steal_communication_heavy_tasks doesn't look right.

Firstly, there's a race condition. The test waits for x and y to enter running state on the worker, polling every 100ms.
The first iteration of while not a.state.tasks will always fail (because we never yielded the event loop since submit).
The second iteration, 100ms later, most times will find that a and b are in memory and two block_reduce are executing. However there's nothing guaranteeing it, so when you call steal.balance() on the next line you might have x or y still running, or in memory but without the scheduler knowing yet. Unlikely, but possible.

The test implicitly relies on tasks of unknown duration to be stolen (#5572). It should be changed not to rely on this specific use case.

Finally, why bother calling steal.stop()? If the block_reduce tasks can't be stolen, it should be inconsequential.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhauled test in #7250. This PR can be merged as is.

@crusaderky
Copy link
Collaborator

crusaderky commented Nov 3, 2022

If they are contradicting tests, why is it flaky? Shouldn't it reliably fail? This still sounds like there is a bug somewhere

It doesn't fail because it doesn't actually test where the tasks end up being executed.
Unsure why it's flaky - @hendrikmakait any insight? Do you have a log?

@hendrikmakait hendrikmakait changed the title Drop test_steal_communication_heavy_tasks since it contradicts test_d… Replace test_(do_not_)steal_communication_heavy_tasks tests with more robust versions Nov 3, 2022
@hendrikmakait
Copy link
Member Author

Unsure why it's flaky - @hendrikmakait any insight? Do you have a log?

I've run it a couple thousand times locally without being able to reproduce the flake. Since the implementation of test_steal_communication_heavy_tasks does not do anything useful (and is weird), I would not want to spend more time debugging it.

test_do_not_steal_communication_heavy_tasks doesn't look right.

You have a good point there, I have dropped that one as well and added three other tests that should cover the functionality that we initially wanted to test here. I like your use of wait_for_state in #7250 and will create a follow-up PR that introduces that to the assert_balance and _run_dependency_test helpers to get rid of the awkward looping.

Finally, why bother calling steal.stop()? If the block_reduce tasks can't be stolen, it should be inconsequential.

steal.stop() functions as a glorified wait() that blocks until all stealing requests have been dealt with and are no longer in_flight.

@crusaderky crusaderky merged commit d88d175 into dask:main Nov 8, 2022
@@ -1446,6 +1446,57 @@ def func(*args):
assert (ntasks_per_worker < ideal * 1.5).all(), (ideal, ntasks_per_worker)


def test_balance_steal_communication_heavy_tasks():
dependencies = {"a": 10, "b": 10}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change this line to {"a": 1e-6, "b": 1e-6} the test remains green, and if go lower than that I instead get "ValueError: Expected a value larger than 16 integer but got 10."
Are we sure we're actually testing anything here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test breaks if we double the cost of the dependencies to {"a": 20, "b": 20} as the tasks have become too expensive to move in this setup. By reducing the cost of the dependencies, we are checking whether we would move tasks that are cheap to move which is the desired behavior. Reducing the cost below 1e-6 creates trouble since we are calculating the actual size of the dependencies as the product of their specified cost and the available bandwidth.

@crusaderky
Copy link
Collaborator

I think I was too hasty in approving.

  1. See comment above
  2. You forgot to actually delete test_steal_communication_heavy_tasks
  3. Isn't test_do_not_steal_communication_heavy_tasks redundant now?

@hendrikmakait
Copy link
Member Author

  1. You forgot to actually delete test_steal_communication_heavy_tasks
  2. Isn't test_do_not_steal_communication_heavy_tasks redundant now?

Yikes, looks like merging main went wrong and brought those back in, sorry for missing that! See #7269 for a follow-up that fixes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants