Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test_dashboard_non_standard_ports #6752

Closed
gjoseph92 opened this issue Jul 20, 2022 · 1 comment · Fixed by #6822
Closed

Flaky test_dashboard_non_standard_ports #6752

gjoseph92 opened this issue Jul 20, 2022 · 1 comment · Fixed by #6822
Labels
flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

______________________ test_dashboard_non_standard_ports _______________________
ConnectionRefusedError: [Errno 61] Connection refused
The above exception was the direct cause of the following exception:
addr = 'tcp://127.0.0.1:49506', timeout = 5, deserialize = True
handshake_overrides = None
connection_args = {'extra_conn_args': {}, 'require_encryption': False, 'ssl_context': None}
scheme = 'tcp', loc = '127.0.0.1:49506'
backend = <distributed.comm.tcp.TCPBackend object at 0x10ba27dc0>
connector = <distributed.comm.tcp.TCPConnector object at 0x13713d840>
comm = None, time_left = <function connect.<locals>.time_left at 0x1371d32e0>
backoff_base = 0.01
asyncdefconnect(
        addr, timeout=None, deserialize=True, handshake_overrides=None, **connection_args
    ):
"""
    Connect to the given address (a URI such as ``tcp://127.0.0.1:1234``)
    and yield a ``Comm`` object.  If the connection attempt fails, it is
    retried until the *timeout* is expired.
    """
if timeout isNone:
            timeout = dask.config.get("distributed.comm.timeouts.connect")
        timeout = parse_timedelta(timeout, default="seconds")
        scheme, loc = parse_address(addr)
        backend = registry.get_backend(scheme)
        connector = backend.get_connector()
        comm = None
        start = time()
deftime_left():
            deadline = start + timeout
returnmax(0, deadline - time())
        backoff_base = 0.01
        attempt = 0
# Prefer multiple small attempts than one long attempt. This should protect
# primarily from DNS race conditions
# gh3104, gh4176, gh4167
        intermediate_cap = timeout / 5
        active_exception = None
while time_left() > 0:
try:
>               comm = await asyncio.wait_for(
                    connector.connect(loc, deserialize=deserialize, **connection_args),
                    timeout=min(intermediate_cap, time_left()),
                )
distributed/comm/core.py:291: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
fut = <Task finished name='Task-181' coro=<BaseTCPConnector.connect() done, defined at /Users/runner/work/distributed/distri...'in <distributed.comm.tcp.TCPConnector object at 0x13713d840>: ConnectionRefusedError: [Errno 61] Connection refused')>
timeout = 0.32024097442626953
asyncdefwait_for(fut, timeout):
"""Wait for the single Future or coroutine to complete, with timeout.
    Coroutine will be wrapped in Task.
    Returns result of the Future or coroutine.  When a timeout occurs,
    it cancels the task and raises TimeoutError.  To avoid the task
    cancellation, wrap it in shield().
    If the wait is cancelled, the task is also cancelled.
    This function is a coroutine.
    """
        loop = events.get_running_loop()
if timeout isNone:
returnawait fut
if timeout <= 0:
            fut = ensure_future(fut, loop=loop)
if fut.done():
return fut.result()
await _cancel_and_wait(fut, loop=loop)
try:
return fut.result()
except exceptions.CancelledError as exc:
raise exceptions.TimeoutError() fromexc
        waiter = loop.create_future()
        timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
        cb = functools.partial(_release_waiter, waiter)
        fut = ensure_future(fut, loop=loop)
        fut.add_done_callback(cb)
try:
# wait until the future completes or the timeout
try:
await waiter
except exceptions.CancelledError:
if fut.done():
return fut.result()
else:
                    fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
raise
if fut.done():
>               return fut.result()
../../../miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py:445: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
self = <distributed.comm.tcp.TCPConnector object at 0x13713d840>
address = '127.0.0.1:49506', deserialize = True
connection_args = {'extra_conn_args': {}, 'require_encryption': False, 'ssl_context': None}
ip = '127.0.0.1', port = 49506, kwargs = {}
asyncdefconnect(self, address, deserialize=True, **connection_args):
self._check_encryption(address, connection_args)
        ip, port = parse_host_port(address)
        kwargs = self._get_connect_args(**connection_args)
try:
            stream = awaitself.client.connect(
                ip, port, max_buffer_size=MAX_BUFFER_SIZE, **kwargs
            )
# Under certain circumstances tornado will have a closed connnection with an
# error and not raise a StreamClosedError.
#
# This occurs with tornado 5.x and openssl 1.1+
if stream.closed() and stream.error:
raise StreamClosedError(stream.error)
except StreamClosedError as e:
# The socket connect() call failed
>           convert_stream_closed_error(self, e)
distributed/comm/tcp.py:461: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
obj = <distributed.comm.tcp.TCPConnector object at 0x13713d840>
exc = ConnectionRefusedError(61, 'Connection refused')
defconvert_stream_closed_error(obj, exc):
"""
    Re-raise StreamClosedError as CommClosedError.
    """
if exc.real_error isnotNone:
# The stream was closed because of an underlying OS error
            exc = exc.real_error
ifisinstance(exc, ssl.SSLError):
if"UNKNOWN_CA"in exc.reason:
raise FatalCommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}")
>           raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") fromexc
E           distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x13713d840>: ConnectionRefusedError: [Errno 61] Connection refused
distributed/comm/tcp.py:142: CommClosedError
The above exception was the direct cause of the following exception:
loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x13713f580>
deftest_dashboard_non_standard_ports(loop):
        pytest.importorskip("bokeh")
        port1 = open_port()
        port2 = open_port()
with popen(
            [
"dask-scheduler",
f"--port={port1}",
f"--dashboard-address=:{port2}",
            ]
        ) as proc:
>           with Client(f"127.0.0.1:{port1}", loop=loop) as c:
distributed/cli/tests/test_dask_scheduler.py:118: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
distributed/client.py:940: in __init__
self.start(timeout=timeout)
distributed/client.py:1098: in start
    sync(self.loop, self._start, **kwargs)
distributed/utils.py:405: in sync
raise exc.with_traceback(tb)
distributed/utils.py:378: in f
    result = yield future
../../../miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/gen.py:762: in run
    value = future.result()
distributed/client.py:1178: in _start
awaitself._ensure_connected(timeout=timeout)
distributed/client.py:1241: in _ensure_connected
    comm = await connect(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
addr = 'tcp://127.0.0.1:49506', timeout = 5, deserialize = True
handshake_overrides = None
connection_args = {'extra_conn_args': {}, 'require_encryption': False, 'ssl_context': None}
scheme = 'tcp', loc = '127.0.0.1:49506'
backend = <distributed.comm.tcp.TCPBackend object at 0x10ba27dc0>
connector = <distributed.comm.tcp.TCPConnector object at 0x13713d840>
comm = None, time_left = <function connect.<locals>.time_left at 0x1371d32e0>
backoff_base = 0.01
asyncdefconnect(
        addr, timeout=None, deserialize=True, handshake_overrides=None, **connection_args
    ):
"""
    Connect to the given address (a URI such as ``tcp://127.0.0.1:1234``)
    and yield a ``Comm`` object.  If the connection attempt fails, it is
    retried until the *timeout* is expired.
    """
if timeout isNone:
            timeout = dask.config.get("distributed.comm.timeouts.connect")
        timeout = parse_timedelta(timeout, default="seconds")
        scheme, loc = parse_address(addr)
        backend = registry.get_backend(scheme)
        connector = backend.get_connector()
        comm = None
        start = time()
deftime_left():
            deadline = start + timeout
returnmax(0, deadline - time())
        backoff_base = 0.01
        attempt = 0
# Prefer multiple small attempts than one long attempt. This should protect
# primarily from DNS race conditions
# gh3104, gh4176, gh4167
        intermediate_cap = timeout / 5
        active_exception = None
while time_left() > 0:
try:
                comm = await asyncio.wait_for(
                    connector.connect(loc, deserialize=deserialize, **connection_args),
                    timeout=min(intermediate_cap, time_left()),
                )
break
except FatalCommClosedError:
raise
# Note: CommClosed inherits from OSError
except (asyncio.TimeoutError, OSError) as exc:
                active_exception = exc
# As descibed above, the intermediate timeout is used to distributed
# initial, bulk connect attempts homogeneously. In particular with
# the jitter upon retries we should not be worred about overloading
# any more DNS servers
                intermediate_cap = timeout
# FullJitter see https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
                upper_cap = min(time_left(), backoff_base * (2**attempt))
                backoff = random.uniform(0, upper_cap)
                attempt += 1
                logger.debug(
"Could not connect to %s, waiting for %s before retrying", loc, backoff
                )
await asyncio.sleep(backoff)
else:
>           raiseOSError(
f"Timed out trying to connect to {addr} after {timeout} s"
            ) fromactive_exception
E           OSError: Timed out trying to connect to tcp://127.0.0.1:49506 after 5 s
distributed/comm/core.py:317: OSError
----------------------------- Captured stderr call -----------------------------
[2022](https://github.com/dask/distributed/runs/7395221705?check_suite_focus=true#step:11:2023)-07-18 18:18:32,721 - distributed.scheduler - INFO - -----------------------------------------------
2022-07-18 18:18:35,009 - distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2022-07-18 18:18:35,325 - distributed.scheduler - INFO - State start
2022-07-18 18:18:35,338 - distributed.scheduler - INFO - -----------------------------------------------
2022-07-18 18:18:35,338 - distributed.scheduler - INFO - Clear task state
2022-07-18 18:18:35,339 - distributed.scheduler - INFO -   Scheduler at:    tcp://10.79.7.43:49506
2022-07-18 18:18:35,339 - distributed.scheduler - INFO -   dashboard at:                    :49507
2022-07-18 18:18:35,397 - distributed._signals - INFO - Received signal SIGINT (2)
2022-07-18 18:18:35,409 - distributed.scheduler - INFO - Scheduler closing...
2022-07-18 18:18:35,410 - distributed.scheduler - INFO - Scheduler closing all comms
2022-07-18 18:18:35,415 - distributed.scheduler - INFO - Stopped scheduler at 'tcp://10.79.7.43:49506'
2022-07-18 18:18:35,420 - distributed.scheduler - INFO - End scheduler

https://github.com/dask/distributed/runs/7395221705?check_suite_focus=true#step:11:2156

@gjoseph92
Copy link
Collaborator Author

2022-09-08 00:33:56,409 - distributed.scheduler - INFO - State start
2022-09-08 00:33:56,450 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:50113
2022-09-08 00:33:56,450 - distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
2022-09-08 00:33:56,463 - distributed.scheduler - INFO - Receive client connection: Client-ec564c4a-2f0d-11ed-b1b4-00505688f765
2022-09-08 00:33:56,464 - distributed.core - INFO - Starting established connection
2022-09-08 00:34:34,616 - distributed.dask_worker - INFO - End worker

Aborted!
2022-09-08 00:35:36,615 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 68.92s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-09-08 00:35:47,265 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 11.45s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-09-08 00:35:53,705 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 6.44s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-09-08 00:36:03,869 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 9.38s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-09-08 00:36:07,336 - distributed.scheduler - INFO - Remove client Client-ec564c4a-2f0d-11ed-b1b4-00505688f765
2022-09-08 00:36:11,010 - distributed.scheduler - INFO - Remove client Client-ec564c4a-2f0d-11ed-b1b4-00505688f765
2022-09-08 00:36:11,013 - distributed.core - INFO - Event loop was unresponsive in Scheduler for 7.93s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
2022-09-08 00:36:11,072 - distributed.scheduler - INFO - Close client connection: Client-ec564c4a-2f0d-11ed-b1b4-00505688f765
2022-09-08 00:36:11,357 - distributed.scheduler - INFO - Scheduler closing...
2022-09-08 00:36:11,365 - distributed.scheduler - INFO - Scheduler closing all comms

https://github.com/dask/distributed/runs/8239632429?check_suite_focus=true#step:18:2034

68 seconds?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant