Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dump cluster state on all test failures #5674

Merged
merged 5 commits into from
Jan 21, 2022

Conversation

crusaderky
Copy link
Collaborator

Change @gen_cluster to perform a state dump on all test failures (notably, including xfails - this should be negligible in terms of extra cost) instead of just timeouts.

Shorten the stack trace in case of @gen_cluster timeout.

@crusaderky crusaderky self-assigned this Jan 20, 2022
@gjoseph92 gjoseph92 self-requested a review January 20, 2022 17:04
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I could see not wanting dumps from xfails in the future, but I assume you're thinking that at the moment, there could be useful information in there?

@crusaderky
Copy link
Collaborator Author

LGTM. I could see not wanting dumps from xfails in the future, but I assume you're thinking that at the moment, there could be useful information in there?

Not really - it's just not straightforward to detect xfails from where I'm dumping and there's very little if any harm done by dumping a few extra kilobytes

crusaderky and others added 2 commits January 20, 2022 18:38
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @crusaderky! Will merge after CI finishes up

@crusaderky
Copy link
Collaborator Author

@jrbourbeau I'd like to verify if the trick to avoid dumping xfailed tests works first

@jrbourbeau
Copy link
Member

Sure, sounds good. FWIW I ran distributed/tests/test_scheduler.py::test_non_existent_worker

@pytest.mark.xfail()
@gen_cluster(client=True, nthreads=[])
async def test_non_existent_worker(c, s):
with dask.config.set({"distributed.comm.timeouts.connect": "100ms"}):
await s.add_worker(
address="127.0.0.1:5738",
status="running",
nthreads=2,
nbytes={},
host_info={},
)
futures = c.map(inc, range(10))
await asyncio.sleep(0.300)
assert not s.workers
assert all(ts.state == "no-worker" for ts in s.tasks.values())

locally a test_cluster_dump/test_non_existent_worker.yaml file was created

@crusaderky
Copy link
Collaborator Author

Trying again

@crusaderky
Copy link
Collaborator Author

I've updated a wealth of tests that require a parameterized gen_cluster. One of them was xfailed and was simply outputting "test.yaml" as a dump. Please review again.

Known issues

  • If a parameterized test fails more than once, you'll only get the output from the last failure
  • The xfail marker is not detected when it's applied to individual parameters of @pytest.mark.parametrize

test()
await wait(fut)
assert s.tasks
assert all(ts.annotations == {"foo": "bar"} for ts in s.tasks.values())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed race condition where the test would potentially find an empty s.tasks dict

@@ -140,14 +140,16 @@ def test_decide_worker_coschedule_order_neighbors(ndeps, nthreads):
nthreads=nthreads,
config={"distributed.scheduler.work-stealing": False},
)
async def test(c, s, *workers):
async def test_decide_worker_coschedule_order_neighbors_(c, s, *workers):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure unique dump file name

@@ -732,19 +733,15 @@ async def assert_balanced(inp, expected, c, s, *workers):
],
)
def test_balance(inp, expected):
async def test(*args, **kwargs):
async def test_balance_(*args, **kwargs):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure unique dump file name

@@ -678,7 +678,7 @@ def cluster(
for worker in workers:
worker["address"] = worker["queue"].get(timeout=5)
except queue.Empty:
raise pytest.xfail.Exception("Worker failed to start in test")
pytest.xfail("Worker failed to start in test")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're the same - this looks cleaner imho

@crusaderky
Copy link
Collaborator Author

Test output and artifacts inspected - ready for final review and merge

crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 21, 2022
Comment on lines +248 to +249
while "x" in a.tasks:
await asyncio.sleep(0.01)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this addition was needed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because nothing tests for the existence of the release_key method in the plugin until a key is released

@crusaderky crusaderky merged commit 0595052 into dask:main Jan 21, 2022
@crusaderky crusaderky deleted the dump_on_all_errors branch January 21, 2022 17:25
crusaderky added a commit to crusaderky/distributed that referenced this pull request Jan 28, 2022
commit 14bddba1482de035ca416368f918d94b115d3660
Merge: df079fc 9a66b71
Author: crusaderky <crusaderky@gmail.com>
Date:   Tue Jan 25 17:16:58 2022 +0000

    Merge branch 'main' into AMM/RetireWorker

commit df079fc
Merge: 9676934 5835ce3
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:08:51 2022 +0000

    Merge branch 'AMM/staging' into AMM/RetireWorker

commit 5835ce3
Merge: ee68b02 b0b8e95
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:07:27 2022 +0000

    Merge branch 'AMM/test_close_gracefully' into AMM/staging

commit ee68b02
Merge: 7d4e6ee ccad288
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 15:06:50 2022 +0000

    Merge branch 'main' into AMM/staging

commit b0b8e95
Author: crusaderky <crusaderky@gmail.com>
Date:   Mon Jan 24 14:32:29 2022 +0000

    Code review

commit 9676934
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:50:53 2022 +0000

    AMM to manage retire_workers()

commit 7d4e6ee
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:39:41 2022 +0000

    Fix flaky test_close_gracefully and test_lifetime (dask#5677)

commit 7faab51
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:39:41 2022 +0000

    harden test

commit aef3b71
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 14:29:10 2022 +0000

    Increase resilience on slow CI

commit af84e40
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:39:18 2022 +0000

    Dump cluster state on all test failures (dask#5674)

commit 5054c19
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:38:38 2022 +0000

    Paused workers shouldn't steal tasks (dask#5665)

commit eadb35f
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:37:48 2022 +0000

    transition flight to missing if no who_has (dask#5653)

commit 581aee8
Merge: 940bb45 60c0d60
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:36:09 2022 +0000

    Merge branch 'main' into AMM/test_close_gracefully

commit 940bb45
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:20:10 2022 +0000

    tweak comment

commit 731d132
Author: crusaderky <crusaderky@gmail.com>
Date:   Fri Jan 21 12:12:03 2022 +0000

    Fix flaky test_close_gracefully and test_lifetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants