-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only set 5s connect timeout in gen_cluster
tests
#6822
Conversation
This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.
When no timeout was given to `restart`, it used 4x `Client.timeout`, which is set to `distributed.comm.timeouts.connect` 🤦. So what used to be a 20s timeout became a 2min timeout. And that timeout is passed down into `Worker.close`, so it gives the ThreadPoolExecutor a longer time to wait.
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 15 files ±0 15 suites ±0 6h 40m 30s ⏱️ + 6m 23s For more details on these failures, see this check. Results for commit 5303f90. ± Comparison against base commit 4f6960a. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great detective work!
distributed/utils_test.py
Outdated
@@ -1880,16 +1882,16 @@ def check_instances(): | |||
|
|||
|
|||
@contextmanager | |||
def _reconfigure(): | |||
def default_test_config(**extra_config): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm not a fan of this name since it sounds like this would be the default for basically all tests, but then again I currently lack a better one. If you happen to have an idea for a more descriptive name, I'd be happy, otherwise, leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hendrikmakait I renamed to config_for_cluster_tests
and added a little docstring, what do you think?
I'm looking into |
FYI I was working on a similar change before w/out final conclusion #5791 I originally proposed to just lower the timeout for local test execution. We took this as an opportunity to test the extreme case of large timeouts to configure tests better. I never finished this, though |
These changes have made The problem is that the RPC's timeout is now longer than the distributed/distributed/utils_test.py Lines 789 to 802 in 1db1d9f
Before, the RPC would time out after 5s, and be ignored by the I could fix this in this PR by just passing the timeout into the RPC: diff --git a/distributed/utils_test.py b/distributed/utils_test.py
index 8c7ddd27..50c3044c 100644
--- a/distributed/utils_test.py
+++ b/distributed/utils_test.py
@@ -788,7 +788,7 @@ async def disconnect(addr, timeout=3, rpc_kwargs=None):
async def do_disconnect():
logger.info(f"Disconnecting {addr}")
- async with rpc(addr, **rpc_kwargs) as w:
+ async with rpc(addr, timeout=timeout * 0.9, **rpc_kwargs) as w:
logger.info(f"Disconnecting {addr} - RPC connected")
# If the worker was killed hard (e.g. sigterm) during test runtime,
# we do not know at this point and may not be able to connect However, IMO trying to open an RPC while the subprocess may be shutting down is silly, when we have POSIX signals at our disposal. I'm removing the RPC approach entirely in #6829. (With the debug logs I added, you can see that So (after removing some debug code) I'd like to merge this, then #6829 separately (in any order). |
This reverts commit 1db1d9f.
Looking at |
That test has a race condition, fixed in #6832 and unrelated to this PR. I think this is ready to go. Any final 👍 👎 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gjoseph92! I hope this will do the trick! 🤞
@@ -1905,8 +1908,7 @@ def clean(threads=True, instances=True, processes=True): | |||
with check_thread_leak() if threads else nullcontext(): | |||
with check_process_leak(check=processes): | |||
with check_instances() if instances else nullcontext(): | |||
with _reconfigure(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are tests that depended on def test_example(cleanup):
also calling _reconfigure
what should these tests do instead?
This subtly refactors (and renames) the use of the
_reconfigure
contextmanager, which is used to set some dask config defaults. Rather than calling_reconfigure
withinclean
, we separate it out (since its job is different) and call it manually in thegen_cluster
,gen_test
, andloop
fixtures.We also remove the
"distributed.comm.timeouts.connect": "5s"
default timeout, except for ingen_cluster
, since that's the only test setup where we can be confident the scheduler isn't running in a subprocess (which could take >5s to start). I'm not even sure what the value is in keeping this timeout forgen_cluster
, but it shouldn't hurt either, so I left it for now.See explanation in #6731 (comment).
pre-commit run --all-files
Closes #6731
test_dashboard
#6751test_dashboard_non_standard_ports
#6752distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero
#6395test_defaults
#6753test_multiple_workers
#6754test_multiple_workers_2
#6755test_signal_handling[Signals.SIGTERM]
#6759test_signal_handling[Signals.SIGTERM]
#6760test_error_during_startup[--nanny]
#6761test_error_during_startup[--no-nanny]
#6762test_scheduler_address_env
#6763test_nanny
#6764test_separate_key_cert
#6765test_queue_in_task
#6773