Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only set 5s connect timeout in gen_cluster tests #6822

Merged
merged 12 commits into from
Aug 5, 2022

Conversation

gjoseph92
Copy link
Collaborator

This subtly refactors (and renames) the use of the _reconfigure contextmanager, which is used to set some dask config defaults. Rather than calling _reconfigure within clean, we separate it out (since its job is different) and call it manually in the gen_cluster, gen_test, and loop fixtures.

We also remove the "distributed.comm.timeouts.connect": "5s" default timeout, except for in gen_cluster, since that's the only test setup where we can be confident the scheduler isn't running in a subprocess (which could take >5s to start). I'm not even sure what the value is in keeping this timeout for gen_cluster, but it shouldn't hurt either, so I left it for now.

See explanation in #6731 (comment).

  • Passes pre-commit run --all-files

Closes #6731

This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.
@gjoseph92 gjoseph92 self-assigned this Aug 3, 2022
When no timeout was given to `restart`, it used 4x `Client.timeout`, which is set to `distributed.comm.timeouts.connect` 🤦. So what used to be a 20s timeout became a 2min timeout. And that timeout is passed down into `Worker.close`, so it gives the ThreadPoolExecutor a longer time to wait.
@github-actions
Copy link
Contributor

github-actions bot commented Aug 3, 2022

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       15 files  ±0         15 suites  ±0   6h 40m 30s ⏱️ + 6m 23s
  2 989 tests +1    2 899 ✔️ +1       88 💤 ±0  2 ±0 
22 165 runs  +8  21 118 ✔️ +7  1 044 💤 ±0  3 +1 

For more details on these failures, see this check.

Results for commit 5303f90. ± Comparison against base commit 4f6960a.

♻️ This comment has been updated with latest results.

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great detective work!

@@ -1880,16 +1882,16 @@ def check_instances():


@contextmanager
def _reconfigure():
def default_test_config(**extra_config):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not a fan of this name since it sounds like this would be the default for basically all tests, but then again I currently lack a better one. If you happen to have an idea for a more descriptive name, I'd be happy, otherwise, leave it as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hendrikmakait I renamed to config_for_cluster_tests and added a little docstring, what do you think?

@gjoseph92
Copy link
Collaborator Author

I'm looking into test_submit_after_failed_worker_sync. The test itself seems to be passing, but it's failing exiting the cluster contextmanager.

@fjetter
Copy link
Member

fjetter commented Aug 4, 2022

FYI I was working on a similar change before w/out final conclusion #5791 I originally proposed to just lower the timeout for local test execution.

We took this as an opportunity to test the extreme case of large timeouts to configure tests better. I never finished this, though

@gjoseph92
Copy link
Collaborator Author

These changes have made test_submit_after_failed_worker_sync flaky. I think they would do the same for test_gather_after_failed_worker.

The problem is that the RPC's timeout is now longer than the asyncio.wait_for timeout in

async def do_disconnect():
logger.info(f"Disconnecting {addr}")
async with rpc(addr, **rpc_kwargs) as w:
logger.info(f"Disconnecting {addr} - RPC connected")
# If the worker was killed hard (e.g. sigterm) during test runtime,
# we do not know at this point and may not be able to connect
with suppress(EnvironmentError, CommClosedError):
# Do not request a reply since comms will be closed by the
# worker before a reply can be made and we will always trigger
# the timeout
await w.terminate(reply=False)
logger.info(f"Disconnecting {addr} - sent terminate")
await asyncio.wait_for(do_disconnect(), timeout=timeout)

Before, the RPC would time out after 5s, and be ignored by the suppress(CommClosedError). Now, the wait_for times out after 20s, but the RPC is still trying to connect for 30s. So an asyncio.TimeoutError comes out of the wait_for, failing the test.

I could fix this in this PR by just passing the timeout into the RPC:

diff --git a/distributed/utils_test.py b/distributed/utils_test.py
index 8c7ddd27..50c3044c 100644
--- a/distributed/utils_test.py
+++ b/distributed/utils_test.py
@@ -788,7 +788,7 @@ async def disconnect(addr, timeout=3, rpc_kwargs=None):
 
     async def do_disconnect():
         logger.info(f"Disconnecting {addr}")
-        async with rpc(addr, **rpc_kwargs) as w:
+        async with rpc(addr, timeout=timeout * 0.9, **rpc_kwargs) as w:
             logger.info(f"Disconnecting {addr} - RPC connected")
             # If the worker was killed hard (e.g. sigterm) during test runtime,
             # we do not know at this point and may not be able to connect

However, IMO trying to open an RPC while the subprocess may be shutting down is silly, when we have POSIX signals at our disposal. I'm removing the RPC approach entirely in #6829.

(With the debug logs I added, you can see that cluster is trying to shut down both workers, including the one that just got SIGTERM'd. The RPC keeps trying to connect to the dead worker, but the asyncio.wait_for times out before the RPC times out.)


So (after removing some debug code) I'd like to merge this, then #6829 separately (in any order).

@gjoseph92
Copy link
Collaborator Author

Looking at test_worker_who_has_clears_after_failed_connection now

@gjoseph92
Copy link
Collaborator Author

That test has a race condition, fixed in #6832 and unrelated to this PR.

I think this is ready to go. Any final 👍 👎 ?

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gjoseph92! I hope this will do the trick! 🤞

@fjetter fjetter merged commit d6160c8 into dask:main Aug 5, 2022
@gjoseph92 gjoseph92 deleted the fix-flaky-5s-timeouts branch August 5, 2022 17:28
@@ -1905,8 +1908,7 @@ def clean(threads=True, instances=True, processes=True):
with check_thread_leak() if threads else nullcontext():
with check_process_leak(check=processes):
with check_instances() if instances else nullcontext():
with _reconfigure():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are tests that depended on def test_example(cleanup): also calling _reconfigure

what should these tests do instead?

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment