-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dump cluster state on all test failures #5674
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I could see not wanting dumps from xfails in the future, but I assume you're thinking that at the moment, there could be useful information in there?
Not really - it's just not straightforward to detect xfails from where I'm dumping and there's very little if any harm done by dumping a few extra kilobytes |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @crusaderky! Will merge after CI finishes up
@jrbourbeau I'd like to verify if the trick to avoid dumping xfailed tests works first |
Sure, sounds good. FWIW I ran distributed/distributed/tests/test_scheduler.py Lines 1321 to 1335 in 7ebeda4
locally a |
Trying again |
I've updated a wealth of tests that require a parameterized gen_cluster. One of them was xfailed and was simply outputting "test.yaml" as a dump. Please review again. Known issues
|
test() | ||
await wait(fut) | ||
assert s.tasks | ||
assert all(ts.annotations == {"foo": "bar"} for ts in s.tasks.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed race condition where the test would potentially find an empty s.tasks dict
@@ -140,14 +140,16 @@ def test_decide_worker_coschedule_order_neighbors(ndeps, nthreads): | |||
nthreads=nthreads, | |||
config={"distributed.scheduler.work-stealing": False}, | |||
) | |||
async def test(c, s, *workers): | |||
async def test_decide_worker_coschedule_order_neighbors_(c, s, *workers): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure unique dump file name
@@ -732,19 +733,15 @@ async def assert_balanced(inp, expected, c, s, *workers): | |||
], | |||
) | |||
def test_balance(inp, expected): | |||
async def test(*args, **kwargs): | |||
async def test_balance_(*args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure unique dump file name
@@ -678,7 +678,7 @@ def cluster( | |||
for worker in workers: | |||
worker["address"] = worker["queue"].get(timeout=5) | |||
except queue.Empty: | |||
raise pytest.xfail.Exception("Worker failed to start in test") | |||
pytest.xfail("Worker failed to start in test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they're the same - this looks cleaner imho
Test output and artifacts inspected - ready for final review and merge |
while "x" in a.tasks: | ||
await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why this addition was needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because nothing tests for the existence of the release_key method in the plugin until a key is released
commit 14bddba1482de035ca416368f918d94b115d3660 Merge: df079fc 9a66b71 Author: crusaderky <crusaderky@gmail.com> Date: Tue Jan 25 17:16:58 2022 +0000 Merge branch 'main' into AMM/RetireWorker commit df079fc Merge: 9676934 5835ce3 Author: crusaderky <crusaderky@gmail.com> Date: Mon Jan 24 15:08:51 2022 +0000 Merge branch 'AMM/staging' into AMM/RetireWorker commit 5835ce3 Merge: ee68b02 b0b8e95 Author: crusaderky <crusaderky@gmail.com> Date: Mon Jan 24 15:07:27 2022 +0000 Merge branch 'AMM/test_close_gracefully' into AMM/staging commit ee68b02 Merge: 7d4e6ee ccad288 Author: crusaderky <crusaderky@gmail.com> Date: Mon Jan 24 15:06:50 2022 +0000 Merge branch 'main' into AMM/staging commit b0b8e95 Author: crusaderky <crusaderky@gmail.com> Date: Mon Jan 24 14:32:29 2022 +0000 Code review commit 9676934 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 14:50:53 2022 +0000 AMM to manage retire_workers() commit 7d4e6ee Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 14:39:41 2022 +0000 Fix flaky test_close_gracefully and test_lifetime (dask#5677) commit 7faab51 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 14:39:41 2022 +0000 harden test commit aef3b71 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 14:29:10 2022 +0000 Increase resilience on slow CI commit af84e40 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:39:18 2022 +0000 Dump cluster state on all test failures (dask#5674) commit 5054c19 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:38:38 2022 +0000 Paused workers shouldn't steal tasks (dask#5665) commit eadb35f Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:37:48 2022 +0000 transition flight to missing if no who_has (dask#5653) commit 581aee8 Merge: 940bb45 60c0d60 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:36:09 2022 +0000 Merge branch 'main' into AMM/test_close_gracefully commit 940bb45 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:20:10 2022 +0000 tweak comment commit 731d132 Author: crusaderky <crusaderky@gmail.com> Date: Fri Jan 21 12:12:03 2022 +0000 Fix flaky test_close_gracefully and test_lifetime
Change
@gen_cluster
to perform a state dump on all test failures (notably, including xfails - this should be negligible in terms of extra cost) instead of just timeouts.Shorten the stack trace in case of
@gen_cluster
timeout.