Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_bad_disk flaky #7208

Closed
jrbourbeau opened this issue Oct 27, 2022 · 1 comment · Fixed by #7300
Closed

test_bad_disk flaky #7208

jrbourbeau opened this issue Oct 27, 2022 · 1 comment · Fixed by #7300
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration

Comments

@jrbourbeau
Copy link
Member

distributed/shuffle/tests/test_shuffle.py::test_bad_disk has started failing on main with the traceback below. See this CI run for an example.

________________________________ test_bad_disk _________________________________
1 thread(s) were leaked from test

------ Call stack of leaked thread 1/1: <Thread(ThreadPoolExecutor-69_0, started 140170015799040)> ------
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 937, in _bootstrap
	self._bootstrap_inner()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 980, in _bootstrap_inner
	self.run()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/threading.py", line 917, in run
	self._target(*self._args, **self._kwargs)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/concurrent/futures/thread.py", line 81, in _worker
	work_item = work_queue.get(block=True)
----------------------------- Captured stderr call -----------------------------
2022-10-27 14:34:53,950 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 7)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 7, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,951 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 1)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 1, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,955 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 0)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 0, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 2)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 2, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 5)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 5, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,987 - distributed.worker - WARNING - Compute Failed
Key:       ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 3)
Function:  shuffle_unpack
args:      ('3110a8a90a5b642409b0a20f83b03722', 3, None)
kwargs:    {}
Exception: "FileNotFoundError(2, 'No such file or directory')"

2022-10-27 14:34:53,990 - distributed.worker - ERROR - Exception during execution of task ('shuffle-p2p-4651c82ee05b6682b3f73b2b18ad74e5', 4).
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2341, in _prepare_args_for_execution
    data[k] = self.data[k]
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.9/site-packages/zict/buffer.py", line 108, in __getitem__
    raise KeyError(key)
KeyError: 'shuffle-barrier-3110a8a90a5b642409b0a20f83b03722'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2239, in execute
    args2, kwargs2 = self._prepare_args_for_execution(ts, args, kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2345, in _prepare_args_for_execution
    data[k] = Actor(type(self.state.actors[k]), self.address, k, self)
KeyError: 'shuffle-barrier-3110a8a90a5b642409b0a20f83b03722'
2022-10-27 14:34:53,996 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-worker-space/worker-dt9g6wgx' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-dt9g6wgx'
2022-10-27 14:34:53,996 - distributed.diskutils - ERROR - Failed to remove '/tmp/dask-worker-space/worker-ihvksve8' (failed in <built-in function lstat>): [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-ihvksve8'
------------------------------ Captured log call -------------------------------
ERROR    asyncio:base_events.py:1753 Task exception was never retrieved
future: <Task finished name='Task-65302' coro=<Shuffle.receive() done, defined at /home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py:142> exception=FileNotFoundError(2, 'No such file or directory')>
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py", line 148, in receive
    raise self._exception
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_shuffle_extension.py", line 172, in receive
    await self.multi_file.put(groups)
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_multi_file.py", line 124, in put
    raise self._exception
  File "/home/runner/work/distributed/distributed/distributed/shuffle/_multi_file.py", line 202, in process
    with open(
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/dask-worker-space/worker-ihvksve8/shuffle-3110a8a90a5b642409b0a20f83b03722/1'
- generated xml file: /home/runner/work/distributed/distributed/reports/pytest.xml -

cc @fjetter as I know you've made some shuffle-related changes recently (not sure if they're related though)

@jrbourbeau jrbourbeau added flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration labels Oct 27, 2022
@fjetter
Copy link
Member

fjetter commented Oct 28, 2022

Thank you. I am aware and am on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test Intermittent failures on CI. tests Unit tests and/or continuous integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants