Only set 5s connect timeout in `gen_cluster` tests #6822

gjoseph92 · 2022-08-03T03:43:19Z

This subtly refactors (and renames) the use of the _reconfigure contextmanager, which is used to set some dask config defaults. Rather than calling _reconfigure within clean, we separate it out (since its job is different) and call it manually in the gen_cluster, gen_test, and loop fixtures.

We also remove the "distributed.comm.timeouts.connect": "5s" default timeout, except for in gen_cluster, since that's the only test setup where we can be confident the scheduler isn't running in a subprocess (which could take >5s to start). I'm not even sure what the value is in keeping this timeout for gen_cluster, but it shouldn't hurt either, so I left it for now.

See explanation in #6731 (comment).

Passes pre-commit run --all-files

Closes #6731

Closes Flaky test_dashboard #6751
Closes Flaky test_dashboard_non_standard_ports #6752
Closes distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395
Closes Flaky test_defaults #6753
Closes Flaky test_multiple_workers #6754
Closes Flaky test_multiple_workers_2 #6755
Closes Flaky test_signal_handling[Signals.SIGTERM] #6759
Closes Flaky test_signal_handling[Signals.SIGTERM] #6760
Closes Flaky test_error_during_startup[--nanny] #6761
Closes Flaky test_error_during_startup[--no-nanny] #6762
Closes Flaky test_scheduler_address_env #6763
Closes Flaky test_nanny #6764
Closes Flaky test_separate_key_cert #6765
Closes Flaky test_queue_in_task #6773

This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.

When no timeout was given to `restart`, it used 4x `Client.timeout`, which is set to `distributed.comm.timeouts.connect` 🤦. So what used to be a 20s timeout became a 2min timeout. And that timeout is passed down into `Worker.close`, so it gives the ThreadPoolExecutor a longer time to wait.

github-actions · 2022-08-03T06:03:37Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      15 files ±0       15 suites ±0 6h 40m 30s ⏱️ + 6m 23s
  2 989 tests +1   2 899 ✔️ +1     88 💤 ±0 2 ❌ ±0
22 165 runs +8 21 118 ✔️ +7 1 044 💤 ±0 3 ❌ +1

For more details on these failures, see this check.

Results for commit 5303f90. ± Comparison against base commit 4f6960a.

♻️ This comment has been updated with latest results.

hendrikmakait

LGTM, great detective work!

hendrikmakait · 2022-08-03T08:58:44Z

distributed/utils_test.py

@@ -1880,16 +1882,16 @@ def check_instances():


 @contextmanager
-def _reconfigure():
+def default_test_config(**extra_config):


nit: I'm not a fan of this name since it sounds like this would be the default for basically all tests, but then again I currently lack a better one. If you happen to have an idea for a more descriptive name, I'd be happy, otherwise, leave it as is.

@hendrikmakait I renamed to config_for_cluster_tests and added a little docstring, what do you think?

gjoseph92 · 2022-08-03T15:43:36Z

I'm looking into test_submit_after_failed_worker_sync. The test itself seems to be passing, but it's failing exiting the cluster contextmanager.

fjetter · 2022-08-04T08:32:57Z

FYI I was working on a similar change before w/out final conclusion #5791 I originally proposed to just lower the timeout for local test execution.

We took this as an opportunity to test the extreme case of large timeouts to configure tests better. I never finished this, though

gjoseph92 · 2022-08-04T16:39:07Z

These changes have made test_submit_after_failed_worker_sync flaky. I think they would do the same for test_gather_after_failed_worker.

The problem is that the RPC's timeout is now longer than the asyncio.wait_for timeout in

distributed/distributed/utils_test.py

Lines 789 to 802 in 1db1d9f

    
           async def do_disconnect(): 
        
               logger.info(f"Disconnecting {addr}") 
        
               async with rpc(addr, **rpc_kwargs) as w: 
        
                   logger.info(f"Disconnecting {addr} - RPC connected") 
        
                   # If the worker was killed hard (e.g. sigterm) during test runtime, 
        
                   # we do not know at this point and may not be able to connect 
        
                   with suppress(EnvironmentError, CommClosedError): 
        
                       # Do not request a reply since comms will be closed by the 
        
                       # worker before a reply can be made and we will always trigger 
        
                       # the timeout 
        
                       await w.terminate(reply=False) 
        
                       logger.info(f"Disconnecting {addr} - sent terminate") 
        
           await asyncio.wait_for(do_disconnect(), timeout=timeout)

Before, the RPC would time out after 5s, and be ignored by the suppress(CommClosedError). Now, the wait_for times out after 20s, but the RPC is still trying to connect for 30s. So an asyncio.TimeoutError comes out of the wait_for, failing the test.

I could fix this in this PR by just passing the timeout into the RPC:

diff --git a/distributed/utils_test.py b/distributed/utils_test.py
index 8c7ddd27..50c3044c 100644
--- a/distributed/utils_test.py
+++ b/distributed/utils_test.py
@@ -788,7 +788,7 @@ async def disconnect(addr, timeout=3, rpc_kwargs=None):
 
     async def do_disconnect():
         logger.info(f"Disconnecting {addr}")
-        async with rpc(addr, **rpc_kwargs) as w:
+        async with rpc(addr, timeout=timeout * 0.9, **rpc_kwargs) as w:
             logger.info(f"Disconnecting {addr} - RPC connected")
             # If the worker was killed hard (e.g. sigterm) during test runtime,
             # we do not know at this point and may not be able to connect

However, IMO trying to open an RPC while the subprocess may be shutting down is silly, when we have POSIX signals at our disposal. I'm removing the RPC approach entirely in #6829.

(With the debug logs I added, you can see that cluster is trying to shut down both workers, including the one that just got SIGTERM'd. The RPC keeps trying to connect to the dead worker, but the asyncio.wait_for times out before the RPC times out.)

So (after removing some debug code) I'd like to merge this, then #6829 separately (in any order).

This reverts commit 1db1d9f.

gjoseph92 · 2022-08-04T16:54:25Z

Looking at test_worker_who_has_clears_after_failed_connection now

gjoseph92 · 2022-08-04T20:30:41Z

That test has a race condition, fixed in #6832 and unrelated to this PR.

I think this is ready to go. Any final 👍 👎 ?

fjetter

Thanks @gjoseph92! I hope this will do the trick! 🤞

graingert · 2022-08-08T13:59:31Z

distributed/utils_test.py

@@ -1905,8 +1908,7 @@ def clean(threads=True, instances=True, processes=True):
    with check_thread_leak() if threads else nullcontext():
        with check_process_leak(check=processes):
            with check_instances() if instances else nullcontext():
-                with _reconfigure():


there are tests that depended on def test_example(cleanup): also calling _reconfigure

what should these tests do instead?

gjoseph92 added 3 commits August 2, 2022 20:44

Apply 5s timeout, etc. to fixtures selectively

263bb36

rename

e59b331

gjoseph92 requested review from graingert and hendrikmakait August 3, 2022 03:43

gjoseph92 self-assigned this Aug 3, 2022

hendrikmakait approved these changes Aug 3, 2022

View reviewed changes

gjoseph92 added 6 commits August 3, 2022 10:28

REVERT add debugging logs

9794b50

test_submit_after_failed_worker_sync fail for logs

6b37a0a

rename to config_for_cluster_tests

3c3dab3

log level back to info so tests pass

57b3827

un-fail test

8943f05

debug logs for test_failed_workers cluster tests

1db1d9f

gjoseph92 mentioned this pull request Aug 4, 2022

Race condition connecting to vs shutting down subprocesses in tests using cluster() #6828

Closed

gjoseph92 added 2 commits August 4, 2022 10:39

Revert "debug logs for test_failed_workers cluster tests"

e131cae

This reverts commit 1db1d9f.

Remove debug code

5303f90

gjoseph92 mentioned this pull request Aug 4, 2022

Flaky test_worker_who_has_clears_after_failed_connection #6831

Closed

fjetter approved these changes Aug 5, 2022

View reviewed changes

fjetter merged commit d6160c8 into dask:main Aug 5, 2022

gjoseph92 deleted the fix-flaky-5s-timeouts branch August 5, 2022 17:28

graingert reviewed Aug 8, 2022

View reviewed changes

gjoseph92 mentioned this pull request Aug 8, 2022

Deprecate cleanup fixture; replace with gonna_run_dask fixture #6854

Draft

2 tasks

ncclementi mentioned this pull request Aug 12, 2022

⚠️ CI failed ⚠️ coiled/benchmarks#214

Closed

gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022

Only set 5s connect timeout in gen_cluster tests (dask#6822)

9e89d64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only set 5s connect timeout in `gen_cluster` tests #6822

Only set 5s connect timeout in `gen_cluster` tests #6822

gjoseph92 commented Aug 3, 2022

github-actions bot commented Aug 3, 2022 •

edited

Loading

hendrikmakait left a comment

hendrikmakait Aug 3, 2022

gjoseph92 Aug 4, 2022

gjoseph92 commented Aug 3, 2022

fjetter commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

fjetter left a comment

graingert Aug 8, 2022

Only set 5s connect timeout in gen_cluster tests #6822

Only set 5s connect timeout in gen_cluster tests #6822

Conversation

gjoseph92 commented Aug 3, 2022

github-actions bot commented Aug 3, 2022 • edited Loading

Unit Test Results

hendrikmakait left a comment

Choose a reason for hiding this comment

hendrikmakait Aug 3, 2022

Choose a reason for hiding this comment

gjoseph92 Aug 4, 2022

Choose a reason for hiding this comment

gjoseph92 commented Aug 3, 2022

fjetter commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

gjoseph92 commented Aug 4, 2022

fjetter left a comment

Choose a reason for hiding this comment

graingert Aug 8, 2022

Choose a reason for hiding this comment

Only set 5s connect timeout in `gen_cluster` tests #6822

Only set 5s connect timeout in `gen_cluster` tests #6822

github-actions bot commented Aug 3, 2022 •

edited

Loading