Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CI stability #6625

Open
fjetter opened this issue Jun 24, 2022 · 6 comments
Open

Update CI stability #6625

fjetter opened this issue Jun 24, 2022 · 6 comments

Comments

@fjetter
Copy link
Member

fjetter commented Jun 24, 2022

With a couple of recent merges, I triggered yesterday another "CI stress test" that runs our suite a couple of times in a row (this time 10)

see https://github.com/fjetter/distributed/tree/stress_ci
which is based on dc019ed
with fjetter@68689f0 on top

The results of this test run can be seen https://github.com/fjetter/distributed/runs/7029246894?check_suite_focus=true

Summary

We had overall 80 total jobs spread on the different OSs and python versions of which 32 failed.

OS success failure total success rate
Ubuntu 28 2 30 93%
Windows 13 17 30 43%
OSX 7 13 20 35%

If we look at an entire test run, i.e. a full test matrix for a given run number, not a single job would've been successful.

Looking at the kinds of test failures, we will see that three jobs on windows failed due to a GH actions test timeout of 120s. And two test runs where cancelled by github without further information, also on windows.
The timing out test runs do not have anything obvious in common. In fact, one of the three timed out tests appears to have finished running the pytest suite but still timed out.

A modified test report available here https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093

@crusaderky
Copy link
Collaborator

crusaderky commented Jun 24, 2022

Does this test include #6591?

1.

I see that you ran on fjetter/distributed.
I got a feeling that whenever I run on crusaderky/distributed I get a lot more random failures than on dask/distributed - likely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?

2.

I strongly suspect a lot of failures may be related to #6271.
How many of the failed tests were using nannies?
How many were using more than 2 nannies? CI hosts mount 2 CPUs each.

2b.

I think we could figure out a straightforward way to mass-skip all tests decorated with @gen_cluster(Worker=Nanny) and see how many failures remain.

@fjetter
Copy link
Member Author

fjetter commented Jun 24, 2022

Does this test include #6591?

yes

ikely because if you're in the free tier you have a higher chance of getting older/cheaper hardware?

the dask project is also on free tier, isn't it?

@fjetter
Copy link
Member Author

fjetter commented Jun 24, 2022

Most of the windows tests are failing because of a disk permission problem during cleanup. @graingert suggested that using the pytest fixtures instead of tempfile would help with this.

@fjetter
Copy link
Member Author

fjetter commented Jun 29, 2022

New update based on a8eb3b2

using branch https://github.com/fjetter/distributed/tree/stress_ci

again running 10 iterations, see fjetter@3fe45b4

Test reports and summary was generated with code available here fjetter@6be7790

Test report is a again available at https://gistpreview.github.io/?ecc2cdddf651df9ee0c7966e210c9093/a8eb3b23b8fe91f52758db155e7151e3d516cbdc.html

cancelled failure success total success_rate
('macos-latest', '3.10') 0 7 3 10 0.3
('macos-latest', '3.8') 0 10 0 10 0
('ubuntu-latest', '3.10') 0 0 10 10 1
('ubuntu-latest', '3.8') 0 2 8 10 0.8
('ubuntu-latest', '3.9') 0 0 10 10 1
('windows-latest', '3.10') 0 5 5 10 0.5
('windows-latest', '3.8') 1 1 8 10 0.8
('windows-latest', '3.9') 3 2 5 10 0.5

On average, every full test matrix included 2.7 failures. There was no full test matrix successful.

We can observe, again, a couple of systematic problems

  • CLI tests are timing out. They appear to try to connect to a scheduler that is not coming up. Judging from the logs, the scheduler is stuck in init somewhere
  • A couple of attempts to schedule new coroutines after our new AsyncTaskGroup already closed, e.g. distributed.deploy.tests.test_local.test_close_twice
  • A couple of unclosed comms or unclosed clusters (likely an with ... missing in the individual tests)
  • A couple of leaked threads
  • Nanny startup failures due to windows disk problems (e.g. test_chaos_rechunk Flaky test_chaos_rechunk due to windows disk errors #6641)
  • Very likely still stuff I missed

@fjetter
Copy link
Member Author

fjetter commented Aug 11, 2022

New update based on e1b9e20

using branch https://github.com/fjetter/distributed/tree/stress_ci

test_report.html.zip

High level summary

failure success cancelled total success_rate
('macos-latest', '3.10') 0 10 0 10 1
('macos-latest', '3.8') 10 0 0 10 0
('ubuntu-latest', '3.10') 3 7 0 10 0.7
('ubuntu-latest', '3.8') 4 6 0 10 0.6
('ubuntu-latest', '3.9') 2 8 0 10 0.8
('windows-latest', '3.10') 0 10 0 10 1
('windows-latest', '3.8') 0 10 0 10 1
('windows-latest', '3.9') 2 8 0 10 0.8

We're performing already much better; with the exception of OSX py3.8 where not a single run was successful. Below a few detailed reports about the kinds of errors encountered

Detailed error report

This is a groupby on the truncated error message as a proxy for a fuzzy match

message_trunc test PR with possible fix
AssertionError: assert 3 == 1 ['test_avoid_churn']
AssertionError: assert False ['test_restart_waits_for_new_workers']
assert not b"Future excep ['test_quiet_close_process[False]'] #6857
asyncio.exceptions.TimeoutErro ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler']
failed on teardown with " ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]'] #6865 or #6863
pytest.PytestUnraisableExcepti ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker'] #6865 or #6863
Full error messages
message_trunc message test
AssertionError: assert 3 == 1 ['AssertionError: assert 3 == 1\n + where 3 = len(deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]))\n + where deque([(1660136972.0109015, {'n': 1, 'status': 'up'}), (1660136972.6130333, {'status': 'down', 'workers': [0]}), (1660136972.6989691, {'n': 1, 'status': 'up'})]) = <distributed.deploy.adaptive.Adaptive object at 0x000002420E9220A0>.log'] ['test_avoid_churn']
AssertionError: assert False ['AssertionError: assert False\n + where False = <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))>(dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']))\n + where <bound method Set.isdisjoint of SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}))> = SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})).isdisjoint\n + where SortedKeysView(SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})) = <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})>()\n + where <bound method SortedDict.keys of SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>})> = SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}).keys\n + where SortedDict({'tcp://127.0.0.1:38573': <WorkerState 'tcp://127.0.0.1:38573', name: 0, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 3, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:39271': <WorkerState 'tcp://127.0.0.1:39271', name: 2, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:41387': <WorkerState 'tcp://127.0.0.1:41387', name: 4, status: running, memory: 0, processing: 0>, 'tcp://127.0.0.1:46443': <WorkerState 'tcp://127.0.0.1:46443', name: 1, status: running, memory: 0, processing: 0>}) = <Scheduler 'tcp://127.0.0.1:37253', workers: 5, cores: 5, tasks: 0>.workers\n + and dict_keys(['tcp://127.0.0.1:33567', 'tcp://127.0.0.1:37071', 'tcp://127.0.0.1:38955', 'tcp://127.0.0.1:39951', 'tcp://127.0.0.1:46081']) = <built-in method keys of dict object at 0x7f139cf1f640>()\n + where <built-in method keys of dict object at 0x7f139cf1f640> = {'tcp://127.0.0.1:33567': <WorkerState 'tcp://127.0.0.1:33567', name: 2, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:37071': <WorkerState 'tcp://127.0.0.1:37071', name: 3, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:38955': <WorkerState 'tcp://127.0.0.1:38955', name: 1, status: closed, memory: 0, processing: 0>, 'tcp://127.0.0.1:39951': <WorkerState 'tcp://127.0.0.1:39951', name: 0, status: closed, memory: 0, processing: 0>, ...}.keys'] ['test_restart_waits_for_new_workers']
assert not b"Future excep ['assert not b"Future exception was never retrieved\nfuture: <Future finished exception=TimeoutError('Timeout')>\ntornado.util.TimeoutError: Timeout\n"'] ['test_quiet_close_process[False]']
asyncio.exceptions.TimeoutErro ['asyncio.exceptions.TimeoutError'] ['test_reconnect' 'test_shutdown_localcluster' 'test_wait_for_scheduler']
failed on teardown with " ['failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7fe0dce7d0e0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f47a9344240>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f9e60358b40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f139614b940>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited"' 'failed on teardown with "pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3f7afa7c40>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited"'] ['test_broken_worker' 'test_local_cluster_redundant_kwarg[True]']
pytest.PytestUnraisableExcepti ['pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f776c0c2c00>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7ff32edb3b50>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f3c97bdccc0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object TCP.write at 0x7f3c988d12c0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'TCP.write' was never awaited' 'pytest.PytestUnraisableExceptionWarning: Exception ignored in: <coroutine object InProc.write at 0x7f37f936aac0>\n\nTraceback (most recent call last):\n File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.8/warnings.py", line 506, in _warn_unawaited_coroutine\n warn(msg, category=RuntimeWarning, stacklevel=2, source=coro)\nRuntimeWarning: coroutine 'InProc.write' was never awaited'] ['test_local_client_warning' 'test_release_retry' 'test_client_cluster_synchronous' 'test_run_spec' 'test_dont_select_closed_worker']

@fjetter
Copy link
Member Author

fjetter commented Aug 11, 2022

The timeout errors in row three appear to throw a couple of distributed.core.AsyncTaskGroupClosedError: Cannot schedule a new coroutine function as the group is already closed. errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants