Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation error in transition_flight_missing for test_chaos_rechunk #6535

Open
gjoseph92 opened this issue Jun 8, 2022 · 3 comments
Open
Labels
bug Something is broken flaky test Intermittent failures on CI.

Comments

@gjoseph92
Copy link
Collaborator

It looks like test_chaos_rechunk started failing for the first time today: https://dask.github.io/distributed/test_report.html, https://github.com/dask/distributed/actions/runs/2461256827. The failure is a validation in transition_flight_missing:

def transition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
        ) -> RecsInstrs:
>           assert ts.done
E           AssertionError

I also had this fail in CI for my PR in the same way: https://github.com/dask/distributed/runs/6798289115

Here's stderr from one of the tests:

deftransition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
        ) -> RecsInstrs:
>           assert ts.done
E           AssertionError
distributed/worker.py:2098: AssertionError
----------------------------- Captured stdout call -----------------------------
Failed worker tcp://127.0.0.1:43415
----------------------------- Captured stderr call -----------------------------
2022-06-08 12:07:20,441 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO -          dashboard at:            127.0.0.1:41491
2022-06-08 12:07:20,442 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,442 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,442 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,442 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-gz003weh
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO -          dashboard at:            127.0.0.1:46843
2022-06-08 12:07:20,555 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,555 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,555 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-b5f1xbmq
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO -          dashboard at:            127.0.0.1:39555
2022-06-08 12:07:20,588 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,588 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,588 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-pjuqkoxn
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34351
2022-06-08 12:07:20,607 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,607 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,608 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,608 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-6f46yon2
2022-06-08 12:07:20,608 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,644 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO -          dashboard at:            127.0.0.1:45091
2022-06-08 12:07:20,645 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,645 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,645 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,645 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-5tt6nwgi
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO -          dashboard at:            127.0.0.1:45945
2022-06-08 12:07:20,652 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:20,652 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:20,652 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-szyisuy3
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,585 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,585 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,586 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,593 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,594 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,594 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,614 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,615 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,615 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,642 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,642 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,643 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,685 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,686 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,687 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,689 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:21,690 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,690 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:22,608 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:23,195 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:23,837 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:24,763 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:26,070 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,145 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,811 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33785 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,813 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 236, in read
    n = await stream.read_into(chunk)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57702 remote=tcp://127.0.0.1:35611>: Stream is closed
2022-06-08 12:07:26,810 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,816 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53242'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
    result = await result
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,817 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:51330'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
    result = await result
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,828 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57706 remote=tcp://127.0.0.1:35611>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43080 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43078 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:27,229 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,262 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,364 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO -          dashboard at:            127.0.0.1:[3998](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:3999)7
2022-06-08 12:07:27,364 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:27,364 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:27,364 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:27,364 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:27,364 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-tgv53yl3
2022-06-08 12:07:27,365 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,775 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO -          dashboard at:            127.0.0.1:38389
2022-06-08 12:07:28,778 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,778 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:28,778 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:28,778 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-msra9yft
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,003 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:29,003 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:29,004 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,004 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,115 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:30,115 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:30,115 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:30,116 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,865 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,085 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO -          dashboard at:            127.0.0.1:35729
2022-06-08 12:07:31,085 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,086 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:31,086 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:31,086 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-u_1l8ay9
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,094 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:43767
2022-06-08 12:07:31,094 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:43767
2022-06-08 12:07:31,095 - distributed.worker - INFO -          dashboard at:            127.0.0.1:34993
2022-06-08 12:07:31,095 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,095 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:31,095 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:31,095 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-kh1fk7cc
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,287 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:33785
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,287 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53240'
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
    result = await result
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
    response = await comm.read(deserializers=serializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,288 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51332 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,289 - distributed.worker - ERROR - 
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
    recs, instructions = func(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
    assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - ERROR - 
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 193, in wrapper
    return await method(self, *args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
    self.transitions(recommendations, stimulus_id=stimulus_id)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
    a_recs, a_instructions = self._transition(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
    recs, instructions = func(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
    assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43415
2022-06-08 12:07:31,292 - distributed.worker - INFO - Not waiting on executor to close
2022-06-08 12:07:31,295 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51338 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,297 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
    bytes_read = self.read_from_fd(buf)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
    return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
    response = await send_recv(
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
    convert_stream_closed_error(self, e)
  File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51336 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1532, in close
    _, pending = await asyncio.wait(self._async_instructions, timeout=timeout)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - ERROR - 
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
    return await func(*args, **kwargs)
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
    response = await get_data_from_worker(
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
    return await retry(
  File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
    return await coro()
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4588, in _get_data
    comm = await rpc.connect(worker)
  File "/home/runner/work/distributed/distributed/distributed/core.py", line 1193, in connect
    await done.wait()
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/locks.py", line 214, in wait
    await fut
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - CRITICAL - Error trying close worker in response to broken internal state. Forcibly exiting worker NOW
Traceback (most recent call last):
  File "/home/runner/work/distributed/distributed/distributed/worker.py", line 225, in _force_close
    await asyncio.wait_for(self.close(nanny=False, executor_wait=False), 30)
  File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 432, in wait_for
    await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,433 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:31,497 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,878 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,147 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,429 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:32,430 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:32,430 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:32,430 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,546 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:33,546 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:43603
2022-06-08 12:07:33,546 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:33,546 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,610 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43767
2022-06-08 12:07:33,611 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-dd7d2c45-b24a-4c24-befd-357a020e0609 Address tcp://127.0.0.1:43767 Status: Status.closing
2022-06-08 12:07:33,650 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42643
2022-06-08 12:07:33,652 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-45028c9a-7a2f-4bd0-8bfb-f735f60d8d1c Address tcp://127.0.0.1:42643 Status: Status.closing
2022-06-08 12:07:33,906 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:[4196](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4197)3
2022-06-08 12:07:33,934 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-b0e2dbe7-7ea9-4457-be3e-5eaf0ad02333 Address tcp://127.0.0.1:41963 Status: Status.closing
2022-06-08 12:07:34,555 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,556 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,569 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO -          dashboard at:            127.0.0.1:45525
2022-06-08 12:07:34,569 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:43603
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,569 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:34,569 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:34,569 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-9em9f50m
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:[4322](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4323)7
2022-06-08 12:07:34,630 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:43227
2022-06-08 12:07:34,630 - distributed.worker - INFO -          dashboard at:            127.0.0.1:38507
2022-06-08 12:07:34,630 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:[4360](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4361)3
2022-06-08 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO -               Threads:                          2
2022-06-08 12:07:34,630 - distributed.worker - INFO -                Memory:                   6.78 GiB
2022-06-08 12:07:34,630 - distributed.worker - INFO -       Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-vptuzvhr
2022-06-08 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,633 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43227
2022-06-08 12:07:34,634 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,704 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,704 - distributed.worker - INFO - Closed worker has not yet started: Status.init

@crusaderky @fjetter what might have landed recently that could have affected this?

@gjoseph92 gjoseph92 added bug Something is broken flaky test Intermittent failures on CI. labels Jun 8, 2022
@mrocklin
Copy link
Member

mrocklin commented Jun 8, 2022 via email

@crusaderky
Copy link
Collaborator

You will fail that assertion if you have a task that is in worker.has_what and has state=flight, but it's not in in_flight_workers[worker].

This has been tripped:

except OSError:
logger.exception("Worker stream died during communication: %s", worker)
has_what = self.has_what.pop(worker)
self.data_needed_per_worker.pop(worker)
self.log.append(
("receive-dep-failed", worker, has_what, stimulus_id, time())
)
for d in has_what:
ts = self.tasks[d]
ts.who_has.remove(worker)
if not ts.who_has and ts.state in (
"fetch",
"flight",
"resumed",
"cancelled",
):
recommendations[ts] = "missing"

but not this:
for d in self.in_flight_workers.pop(worker):
ts = self.tasks[d]
ts.done = True

which in turn got you to fail the assertion here:
def transition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:
assert ts.done
return self.transition_generic_missing(ts, stimulus_id=stimulus_id)

I think the culprit is this, although I'm fuzzy about the details:

def transition_cancelled_fetch(
self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:
if ts.done:
return {ts: "released"}, []
elif ts._previous == "flight":
ts.state = ts._previous
return {}, []
else:
assert ts._previous == "executing"
ts.state = "resumed"
ts._next = "fetch"
return {}, []

I'm not aware of any recent changes to that bit of machinery, but I might be wrong.
Do you have a cluster dump?

CC @fjetter

@gjoseph92
Copy link
Collaborator Author

Do you have a cluster dump?

I would have expected the CI failure to capture one, but I don't see anything in the logs about it: https://github.com/dask/distributed/runs/6796844522?check_suite_focus=true#step:11:1834. I don't actually know where the dumps from CI end up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken flaky test Intermittent failures on CI.
Projects
None yet
Development

No branches or pull requests

3 participants