Flaky tests: `OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s` #6731

gjoseph92 · 2022-07-15T18:24:33Z

Catch-all issue for tests failing like OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s while the client is trying to connect to the scheduler.

TODO: link other flaky-test issues related to this.

The text was updated successfully, but these errors were encountered:

gjoseph92 · 2022-07-20T21:12:47Z

All of these tests are commonly failing with OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s:

gjoseph92 · 2022-07-26T16:45:44Z

This should have been obvious, but I just realized it now: every one of these tests involves running the scheduler in a subprocess with some form of poen(["dask-scheduler"]) and connecting to it. The majority of them are in test_dask_scheduler.py. Those also tend to run at the very beginning of the test suite, so if they were leaking subprocesses or ports or something, they could be affecting most of the subsequent tests.

Also, looking at the stderr output of most tests, they all show distributed._signals - INFO - Received signal SIGINT (2). According to the timestamps, it's often within less than 5s of the scheduler starting. I don't think that's the cause of the problem. Rather, it's probably that inside the popen contextmanager in the test code, an exception is being raised (possibly the client connection error?), which causes popen to send SIGINT to the subprocess.

gjoseph92 · 2022-07-26T23:29:11Z

Current theory is that this is actually really simple:

importing pandas (or something else) is especially slow on the macOS runners for some reason
it takes more than 5s for the dask-scheduler subprocess to be available for a connection
the client times out and fails the test

We can reproduce a similar-looking error by trying to launch a Client when there's no scheduler available at all:

def test_wat(loop, requires_default_ports):
    with Client(f"127.0.0.1:{Scheduler.default_port}", loop=loop) as c:
        pass

...
>           raise OSError(
                f"Timed out trying to connect to {addr} after {timeout} s"
            ) from active_exception
E           OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s

distributed/comm/core.py:317: OSError
================================================================================== slowest 20 durations ==================================================================================
5.00s call     distributed/cli/tests/test_dask_scheduler.py::test_wat
0.01s teardown distributed/cli/tests/test_dask_scheduler.py::test_wat

(1 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================================ short test summary info =================================================================================
FAILED distributed/cli/tests/test_dask_scheduler.py::test_wat - OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s

We can see from the test duration that it actually does wait the full 5s; this isn't one of those cases where the error message claims a timeout happened, but it was actually a CancelledError or something that got converted into a TimeoutError.

FYI, the reason for the 5s timeout, rather than default 30s connect timeout, is that the loop fixture depends on the clean fixture which uses the clean contextmanager which calls _reconfigure which sets "distributed.comm.timeouts.connect": "5s".

So in theory, just using a longer timeout would probably make these tests pass most of the time.

But I'm going to try to figure out why scheduler startup is so slow first. Ideally they would not all take 10s each, even on CI runners.

gjoseph92 · 2022-08-03T01:49:39Z

In gjoseph92#4, I ran just the test_dask_scheduler.py tests on macOS, launched a py-spy profiler to record a number of known offenders, and uploaded the profiles as a test artifact. (I could see this being useful to do again in the future.)

Comparing profiles between flaky tests that failed and ones that didn't on a particular run, there wasn't an obvious difference. I expected something major to stand out, but they looked pretty similar.

dask-scheduler subprocess for a test that failed Flaky test_dashboard_non_standard_ports #6752:

dask-scheduler subprocess for a test that passed distributed/cli/tests/test_dask_scheduler.py::test_dashboard_port_zero #6395:

In the second one, you can see the add_client (I added a sleep in there to ensure it would show up better with the sampling profiler). In the first, it doesn't happen because the client times out. In the successful case, the scheduler was done with imports after ~3.6s. In the failed case, it was done after ~4.1s.

So it seems like slow imports are actually the issue, and they're taking so long that normal variability is putting us right up against our 5s timeout.

Slow imports

Overall:

1.90 seconds of imports at startup (before dask-scheduler.main starts running)
2.09 seconds to import distributed.dashboard.scheduler (1s for pandas, 1s for pure bokeh)

So with 4s of imports before useful work can happen, a 5s connection timeout indeed seems likely to fail sometimes 😁

Some imports that stood out:

bokeh ~2s
1. pandas ~1s
numpy (via distributed) ~260ms
worker_state_machine.py ~200ms making dataclasses
urllib3 ~170ms
jinja2 (via dask) ~170ms
cytoolz (via dask) ~140ms
loading yaml (via dask) ~140ms
yaml (via dask) ~130ms
fsspec (via cluster_dump.py) ~110ms

None of these seem particularly avoidable (numpy will always be there #5729, urllib3/jinja/toolz/yaml/etc probably have to happen). A couple small things:

Defer fsspec import in cluster_dump.py until the dump actually needs to be written
Switch from dataclasses to msgspec structs in worker state machine?

A larger change that @graingert proposed was to wait to set up the dashboard (aka import bokeh and pandas) until after we'd started listening for client connections. That way, we can be doing both at once (ish). The issue is that a number of the tests are assuming that once a client is connected, the dashboard must be up. We could refactor this, but I don't really want to deal with it. I don't imagine that it would make much difference for users in real life anyway.

What should we do

After all that, I think we should just make distributed.comm.timeouts.connect longer when we're connecting to a scheduler run in a subprocess 😄

The 5s connect timeout arguably makes sense for async or gen_cluster tests. No subprocesses, everything should be really quick. But when launching a scheduler from scratch in a new process, it's not a good idea to have expectations for the speed of that on a CI machine.

I did some archaeology and "distributed.comm.timeouts.connect": "5s" actually used to only be set in gen_cluster. Then in #2687, it was moved from there to the clean fixture, where it's been since.

The main question to me is, should we:

Manually dask.config.set({"distributed.comm.timeouts.connect": "10s"}) in all the tests where we make a dask-scheduler subprocess and connect a client to it.
- pros: well-contained
- cons: tests added in the future (or that we miss) might forget to do this
Just change the value to 10s in the _reconfigure contextmanager—updating it everywhere
- pros: we know it'll cover every case, including future tests. A reasonable compromise, since it's still a relatively short timeout.
- cons: we don't know what else it might affect. Or if it'll allow tests to pass/flake that should be failing consistently. Or if tests will still rarely, occasionally, need >10s to connect on a bad CI day.
Move the distributed.comm.timeouts.connect": "5s" from _reconfigure (where it gets set everywhere the loop fixture is used) to just gen_cluster, maybe cluster, maaaaybe gen_test?
- pros: now only applying the short timeout to places we're certain it's appropriate
- cons: wouldn't apply to manually-created clusters or some other esoteric cases where it could be appropriate—though not having it doesn't make the tests wrong.
Remove the distributed.comm.timeouts.connect setting from _reconfigure entirely, and fall back on the default value of 30s.
- pros: the justification for a short timeout in CI is actually a little dubious. We've been learning the hard way not to do anything that relies on timing in CI jobs. I don't think a short connect timeout is actually testing anything about correctness—I imagine its main purpose is making bad tests fail faster, for better DX. This probably applies mostly to running tests locally. It's not that helpful in CI though. If the tests don't actually need it to be correct, and it's causing problems, why have it at all?
- cons: actually bad tests are slower to fail during local development. And if something systematically starts making cluster connection take 2-3x longer, we might not notice.

gjoseph92 · 2022-08-03T02:20:05Z

Waaait I think this flakiness was introduced in #6231 @graingert.

Prior to that commit, the loop fixture (which all of these flaky tests use) didn't call or depend on clean.

That change added the cleanup fixture to loop.

Prior to that, fixtures that set the 5s timeout (by calling clean directly) were:

cluster
gen_test
gen_cluster
cleanup

loop and loop_in_thread did not used to call clean and therefore didn't have a 5s timeout. It's good that they call clean now, but maybe not appropriate that they have the timeout.

Another way to say that is: should this _reconfigure function actually be a part of clean? It doesn't seem related to the job of check_thread_leak, check_process_leak, and check_instances.

distributed/distributed/utils_test.py

Lines 1882 to 1909 in 4f6960a

    
           @contextmanager 
        
           def _reconfigure(): 
        
               reset_config() 
        
               with dask.config.set( 
        
                   { 
        
                       "local_directory": tempfile.gettempdir(), 
        
                       "distributed.comm.timeouts.connect": "5s", 
        
                       "distributed.admin.tick.interval": "500 ms", 
        
                       "distributed.worker.profile.enabled": False, 
        
                   } 
        
               ): 
        
                   # Restore default logging levels 
        
                   # XXX use pytest hooks/fixtures instead? 
        
                   for name, level in logging_levels.items(): 
        
                       logging.getLogger(name).setLevel(level) 
        
                   yield 
        
           @contextmanager 
        
           def clean(threads=True, instances=True, processes=True): 
        
               asyncio.set_event_loop(None) 
        
               with check_thread_leak() if threads else nullcontext(): 
        
                   with check_process_leak(check=processes): 
        
                       with check_instances() if instances else nullcontext(): 
        
                           with _reconfigure(): 
        
                               yield

cc @hendrikmakait

This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.

gjoseph92 added the flaky test Intermittent failures on CI. label Jul 15, 2022

gjoseph92 mentioned this issue Jul 15, 2022

Wait for workers to return in Client.restart #6714

Merged

2 tasks

This comment was marked as off-topic.

Sign in to view

This was referenced Jul 20, 2022

Flaky test_multiple_workers #6754

Closed

Flaky tests: plain TimeoutError during client connection to scheduler #6757

Open

This comment was marked as off-topic.

Sign in to view

windowshopr mentioned this issue Jul 20, 2022

2 machines, timed out after X seconds #6768

Closed

gjoseph92 mentioned this issue Jul 20, 2022

[Tracking] High-impact flaky test issues #6774

Open

gjoseph92 self-assigned this Jul 26, 2022

gjoseph92 mentioned this issue Aug 3, 2022

Only set 5s connect timeout in gen_cluster tests #6822

Merged

1 task

hendrikmakait mentioned this issue Aug 3, 2022

Introducing an autouse fixture to (re)set config options for all tests. #6498

Closed

This was referenced Aug 4, 2022

Don't connect to cluster subprocesses at shutdown #6829

Merged

Fix flaky test_worker_who_has_clears_after_failed_connection #6832

Merged

Dump has_what, missing_dep_flight #6830

Merged

fjetter closed this as completed in #6822 Aug 5, 2022

gjoseph92 mentioned this issue Aug 5, 2022

call exit callback even if AsyncProcess is reaped elsewhere #6684

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky tests: `OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s` #6731

Flaky tests: `OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s` #6731

gjoseph92 commented Jul 15, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

gjoseph92 commented Jul 20, 2022

gjoseph92 commented Jul 26, 2022

gjoseph92 commented Jul 26, 2022

gjoseph92 commented Aug 3, 2022

gjoseph92 commented Aug 3, 2022

Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731

Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731

Comments

gjoseph92 commented Jul 15, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

gjoseph92 commented Jul 20, 2022

gjoseph92 commented Jul 26, 2022

gjoseph92 commented Jul 26, 2022

gjoseph92 commented Aug 3, 2022

Slow imports

What should we do

gjoseph92 commented Aug 3, 2022

Flaky tests: `OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s` #6731

Flaky tests: `OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s` #6731