Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky tests: OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s #6731

Closed
gjoseph92 opened this issue Jul 15, 2022 · 8 comments · Fixed by #6822
Closed
Assignees
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

Catch-all issue for tests failing like OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s while the client is trying to connect to the scheduler.

TODO: link other flaky-test issues related to this.

@gjoseph92 gjoseph92 added the flaky test Intermittent failures on CI. label Jul 15, 2022
@windowshopr

This comment was marked as off-topic.

@gjoseph92

This comment was marked as off-topic.

@windowshopr

This comment was marked as off-topic.

@gjoseph92
Copy link
Collaborator Author

This should have been obvious, but I just realized it now: every one of these tests involves running the scheduler in a subprocess with some form of poen(["dask-scheduler"]) and connecting to it. The majority of them are in test_dask_scheduler.py. Those also tend to run at the very beginning of the test suite, so if they were leaking subprocesses or ports or something, they could be affecting most of the subsequent tests.

Also, looking at the stderr output of most tests, they all show distributed._signals - INFO - Received signal SIGINT (2). According to the timestamps, it's often within less than 5s of the scheduler starting. I don't think that's the cause of the problem. Rather, it's probably that inside the popen contextmanager in the test code, an exception is being raised (possibly the client connection error?), which causes popen to send SIGINT to the subprocess.

@gjoseph92
Copy link
Collaborator Author

Current theory is that this is actually really simple:

  1. importing pandas (or something else) is especially slow on the macOS runners for some reason
  2. it takes more than 5s for the dask-scheduler subprocess to be available for a connection
  3. the client times out and fails the test

We can reproduce a similar-looking error by trying to launch a Client when there's no scheduler available at all:

def test_wat(loop, requires_default_ports):
    with Client(f"127.0.0.1:{Scheduler.default_port}", loop=loop) as c:
        pass
...
>           raise OSError(
                f"Timed out trying to connect to {addr} after {timeout} s"
            ) from active_exception
E           OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s

distributed/comm/core.py:317: OSError
================================================================================== slowest 20 durations ==================================================================================
5.00s call     distributed/cli/tests/test_dask_scheduler.py::test_wat
0.01s teardown distributed/cli/tests/test_dask_scheduler.py::test_wat

(1 durations < 0.005s hidden.  Use -vv to show these durations.)
================================================================================ short test summary info =================================================================================
FAILED distributed/cli/tests/test_dask_scheduler.py::test_wat - OSError: Timed out trying to connect to tcp://127.0.0.1:8786 after 5 s

We can see from the test duration that it actually does wait the full 5s; this isn't one of those cases where the error message claims a timeout happened, but it was actually a CancelledError or something that got converted into a TimeoutError.

FYI, the reason for the 5s timeout, rather than default 30s connect timeout, is that the loop fixture depends on the clean fixture which uses the clean contextmanager which calls _reconfigure which sets "distributed.comm.timeouts.connect": "5s".

So in theory, just using a longer timeout would probably make these tests pass most of the time.

But I'm going to try to figure out why scheduler startup is so slow first. Ideally they would not all take 10s each, even on CI runners.

@gjoseph92 gjoseph92 self-assigned this Jul 26, 2022
@gjoseph92
Copy link
Collaborator Author

In gjoseph92#4, I ran just the test_dask_scheduler.py tests on macOS, launched a py-spy profiler to record a number of known offenders, and uploaded the profiles as a test artifact. (I could see this being useful to do again in the future.)

Comparing profiles between flaky tests that failed and ones that didn't on a particular run, there wasn't an obvious difference. I expected something major to stand out, but they looked pretty similar.

Screen Shot 2022-08-02 at 1 09 03 PM

Screen Shot 2022-08-02 at 1 10 05 PM

In the second one, you can see the add_client (I added a sleep in there to ensure it would show up better with the sampling profiler). In the first, it doesn't happen because the client times out. In the successful case, the scheduler was done with imports after ~3.6s. In the failed case, it was done after ~4.1s.

So it seems like slow imports are actually the issue, and they're taking so long that normal variability is putting us right up against our 5s timeout.

Slow imports

Overall:

  • 1.90 seconds of imports at startup (before dask-scheduler.main starts running)
  • 2.09 seconds to import distributed.dashboard.scheduler (1s for pandas, 1s for pure bokeh)

So with 4s of imports before useful work can happen, a 5s connection timeout indeed seems likely to fail sometimes 😁

Some imports that stood out:

  1. bokeh ~2s
    1. pandas ~1s
  2. numpy (via distributed) ~260ms
  3. worker_state_machine.py ~200ms making dataclasses
  4. urllib3 ~170ms
  5. jinja2 (via dask) ~170ms
  6. cytoolz (via dask) ~140ms
  7. loading yaml (via dask) ~140ms
  8. yaml (via dask) ~130ms
  9. fsspec (via cluster_dump.py) ~110ms

None of these seem particularly avoidable (numpy will always be there #5729, urllib3/jinja/toolz/yaml/etc probably have to happen). A couple small things:

  • Defer fsspec import in cluster_dump.py until the dump actually needs to be written
  • Switch from dataclasses to msgspec structs in worker state machine?

A larger change that @graingert proposed was to wait to set up the dashboard (aka import bokeh and pandas) until after we'd started listening for client connections. That way, we can be doing both at once (ish). The issue is that a number of the tests are assuming that once a client is connected, the dashboard must be up. We could refactor this, but I don't really want to deal with it. I don't imagine that it would make much difference for users in real life anyway.


What should we do

After all that, I think we should just make distributed.comm.timeouts.connect longer when we're connecting to a scheduler run in a subprocess 😄

The 5s connect timeout arguably makes sense for async or gen_cluster tests. No subprocesses, everything should be really quick. But when launching a scheduler from scratch in a new process, it's not a good idea to have expectations for the speed of that on a CI machine.

I did some archaeology and "distributed.comm.timeouts.connect": "5s" actually used to only be set in gen_cluster. Then in #2687, it was moved from there to the clean fixture, where it's been since.

The main question to me is, should we:

  1. Manually dask.config.set({"distributed.comm.timeouts.connect": "10s"}) in all the tests where we make a dask-scheduler subprocess and connect a client to it.
    • pros: well-contained
    • cons: tests added in the future (or that we miss) might forget to do this
  2. Just change the value to 10s in the _reconfigure contextmanager—updating it everywhere
    • pros: we know it'll cover every case, including future tests. A reasonable compromise, since it's still a relatively short timeout.
    • cons: we don't know what else it might affect. Or if it'll allow tests to pass/flake that should be failing consistently. Or if tests will still rarely, occasionally, need >10s to connect on a bad CI day.
  3. Move the distributed.comm.timeouts.connect": "5s" from _reconfigure (where it gets set everywhere the loop fixture is used) to just gen_cluster, maybe cluster, maaaaybe gen_test?
    • pros: now only applying the short timeout to places we're certain it's appropriate
    • cons: wouldn't apply to manually-created clusters or some other esoteric cases where it could be appropriate—though not having it doesn't make the tests wrong.
  4. Remove the distributed.comm.timeouts.connect setting from _reconfigure entirely, and fall back on the default value of 30s.
    • pros: the justification for a short timeout in CI is actually a little dubious. We've been learning the hard way not to do anything that relies on timing in CI jobs. I don't think a short connect timeout is actually testing anything about correctness—I imagine its main purpose is making bad tests fail faster, for better DX. This probably applies mostly to running tests locally. It's not that helpful in CI though. If the tests don't actually need it to be correct, and it's causing problems, why have it at all?
    • cons: actually bad tests are slower to fail during local development. And if something systematically starts making cluster connection take 2-3x longer, we might not notice.

@gjoseph92
Copy link
Collaborator Author

Waaait I think this flakiness was introduced in #6231 @graingert.

Prior to that commit, the loop fixture (which all of these flaky tests use) didn't call or depend on clean.

That change added the cleanup fixture to loop.

Prior to that, fixtures that set the 5s timeout (by calling clean directly) were:

  • cluster
  • gen_test
  • gen_cluster
  • cleanup

loop and loop_in_thread did not used to call clean and therefore didn't have a 5s timeout. It's good that they call clean now, but maybe not appropriate that they have the timeout.


Another way to say that is: should this _reconfigure function actually be a part of clean? It doesn't seem related to the job of check_thread_leak, check_process_leak, and check_instances.

@contextmanager
def _reconfigure():
reset_config()
with dask.config.set(
{
"local_directory": tempfile.gettempdir(),
"distributed.comm.timeouts.connect": "5s",
"distributed.admin.tick.interval": "500 ms",
"distributed.worker.profile.enabled": False,
}
):
# Restore default logging levels
# XXX use pytest hooks/fixtures instead?
for name, level in logging_levels.items():
logging.getLogger(name).setLevel(level)
yield
@contextmanager
def clean(threads=True, instances=True, processes=True):
asyncio.set_event_loop(None)
with check_thread_leak() if threads else nullcontext():
with check_process_leak(check=processes):
with check_instances() if instances else nullcontext():
with _reconfigure():
yield

cc @hendrikmakait

gjoseph92 added a commit to gjoseph92/distributed that referenced this issue Aug 3, 2022
This is basically option 3 in dask#6731 (comment). I can't think of a justification why this timeout should be set globally. All the other things in there are necessary to make things run more reasonably in tests. The timeout is the opposite; there's nothing about Ci that should make us think connections will be faster.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants