Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@gen_cluster with many nannies causes flakiness #5662

Open
crusaderky opened this issue Jan 14, 2022 · 0 comments
Open

@gen_cluster with many nannies causes flakiness #5662

crusaderky opened this issue Jan 14, 2022 · 0 comments

Comments

@crusaderky
Copy link
Collaborator

Any test decorated as follows:

@gen_cluster(client=True, nthreads=[("", 1)] * 10, Worker=Nanny)

is extremely flaky on CI, regardless of environment; the message is

distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 15 seconds

This issue is glaring with 10 nannies, but I suspect it may also be happening more sporadically with less of them.

See for example https://github.com/dask/distributed/runs/4805666018?check_suite_focus=true :

E           tornado.util.TimeoutError: Operation timed out after 60 seconds

../../../miniconda3/envs/dask-distributed/lib/python3.9/site-packages/tornado/ioloop.py:529: TimeoutError
----------------------------- Captured stderr call -----------------------------
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 15 seconds; retrying
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 264, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 951, in coro
    s, ws = await start_cluster(
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 852, in start_cluster
    await asyncio.gather(*workers)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 692, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 268, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 15 seconds
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 15 seconds; retrying
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 264, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 951, in coro
    s, ws = await start_cluster(
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 852, in start_cluster
    await asyncio.gather(*workers)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 692, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 268, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 15 seconds
distributed.worker - INFO - Stopping worker
distributed.worker - INFO - Closed worker has not yet started: Status.undefined
distributed.worker - INFO - Stopping worker
distributed.worker - INFO - Closed worker has not yet started: Status.undefined
distributed.worker - INFO - Stopping worker
distributed.worker - INFO - Closed worker has not yet started: Status.undefined
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Nanny failed to start in 15 seconds; retrying
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 338, in start
    response = await self.instantiate()
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 407, in instantiate
    result = await asyncio.wait_for(
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 468, in wait_for
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 264, in _
    await asyncio.wait_for(self.start(), timeout=timeout)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 951, in coro
    s, ws = await start_cluster(
  File "/Users/runner/work/distributed/distributed/distributed/utils_test.py", line 852, in start_cluster
    await asyncio.gather(*workers)
  File "/Users/runner/miniconda3/envs/dask-distributed/lib/python3.9/asyncio/tasks.py", line 692, in _wrap_awaitable
    return (yield from awaitable.__await__())
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 268, in _
    raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 15 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant