-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation error in transition_flight_missing
for test_chaos_rechunk
#6535
Comments
This could also be random chance. I'm not sure that this test has ever
been 100% solid.
…On Wed, Jun 8, 2022 at 3:32 PM Gabe Joseph ***@***.***> wrote:
It looks like test_chaos_rechunk started failing for the first time
today: https://dask.github.io/distributed/test_report.html,
https://github.com/dask/distributed/actions/runs/2461256827. The failure
is a validation in transition_flight_missing:
def transition_flight_missing(self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:> assert ts.doneE AssertionError
I also had this fail in CI for my PR in the same way:
https://github.com/dask/distributed/runs/6798289115
Here's stderr from one of the tests:
deftransition_flight_missing(
self, ts: TaskState, *, stimulus_id: str
) -> RecsInstrs:
> assert ts.done
E AssertionError
distributed/worker.py:2098: AssertionError
----------------------------- Captured stdout call -----------------------------
Failed worker tcp://127.0.0.1:43415
----------------------------- Captured stderr call -----------------------------
2022-06-08 12:07:20,441 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:37429
2022-06-08 12:07:20,442 - distributed.worker - INFO - dashboard at: 127.0.0.1:41491
2022-06-08 12:07:20,442 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,442 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,442 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,442 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-gz003weh
2022-06-08 12:07:20,442 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:36149
2022-06-08 12:07:20,555 - distributed.worker - INFO - dashboard at: 127.0.0.1:46843
2022-06-08 12:07:20,555 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,555 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,555 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,555 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-b5f1xbmq
2022-06-08 12:07:20,555 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:40519
2022-06-08 12:07:20,588 - distributed.worker - INFO - dashboard at: 127.0.0.1:39555
2022-06-08 12:07:20,588 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,588 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,588 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,588 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-pjuqkoxn
2022-06-08 12:07:20,588 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:35611
2022-06-08 12:07:20,607 - distributed.worker - INFO - dashboard at: 127.0.0.1:34351
2022-06-08 12:07:20,607 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,607 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,607 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,608 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,608 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-6f46yon2
2022-06-08 12:07:20,608 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,644 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:33785
2022-06-08 12:07:20,645 - distributed.worker - INFO - dashboard at: 127.0.0.1:45091
2022-06-08 12:07:20,645 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,645 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,645 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,645 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-5tt6nwgi
2022-06-08 12:07:20,645 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43415
2022-06-08 12:07:20,652 - distributed.worker - INFO - dashboard at: 127.0.0.1:45945
2022-06-08 12:07:20,652 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:20,652 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:20,652 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:20,652 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-szyisuy3
2022-06-08 12:07:20,652 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,585 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,585 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,586 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,593 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,594 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,594 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,614 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,615 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,615 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,642 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,642 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,643 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,685 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,686 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,687 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,689 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:21,690 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:21,690 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,716 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,718 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:21,720 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:22,608 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:23,195 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:23,837 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:24,763 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:26,070 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,145 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:26,811 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:33785 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,813 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 236, in read
n = await stream.read_into(chunk)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 150, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57702 remote=tcp://127.0.0.1:35611>: Stream is closed
2022-06-08 12:07:26,810 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,816 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53242'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53242>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,817 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:51330'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:33785 remote=tcp://127.0.0.1:51330>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,828 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:35611
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:57706 remote=tcp://127.0.0.1:35611>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43080 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:26,870 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36149
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:43078 remote=tcp://127.0.0.1:36149>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:27,229 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,262 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:27,364 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:41963
2022-06-08 12:07:27,364 - distributed.worker - INFO - dashboard at: 127.0.0.1:[3998](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:3999)7
2022-06-08 <https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:3999)72022-06-08> 12:07:27,364 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:27,364 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:27,364 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:27,364 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:27,364 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-tgv53yl3
2022-06-08 12:07:27,365 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,775 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:44643
2022-06-08 12:07:28,778 - distributed.worker - INFO - dashboard at: 127.0.0.1:38389
2022-06-08 12:07:28,778 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:28,778 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:28,778 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:28,778 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-msra9yft
2022-06-08 12:07:28,778 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,003 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:29,003 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:29,004 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:29,004 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,115 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:30,115 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:30,115 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:30,116 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:30,865 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,085 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:42643
2022-06-08 12:07:31,085 - distributed.worker - INFO - dashboard at: 127.0.0.1:35729
2022-06-08 12:07:31,085 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,086 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:31,086 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:31,086 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-u_1l8ay9
2022-06-08 12:07:31,086 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,094 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:43767
2022-06-08 12:07:31,094 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43767
2022-06-08 12:07:31,095 - distributed.worker - INFO - dashboard at: 127.0.0.1:34993
2022-06-08 12:07:31,095 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,095 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:31,095 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:31,095 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-kh1fk7cc
2022-06-08 12:07:31,095 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:31,287 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:43415 -> tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,287 - distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:53240'
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/core.py", line 597, in handle_comm
result = await result
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1746, in get_data
response = await comm.read(deserializers=serializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) local=tcp://127.0.0.1:43415 remote=tcp://127.0.0.1:53240>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,288 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51332 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,289 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
self.transitions(recommendations, stimulus_id=stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
a_recs, a_instructions = self._transition(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
recs, instructions = func(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 193, in wrapper
return await method(self, *args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3393, in gather_dep
self.transitions(recommendations, stimulus_id=stimulus_id)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2840, in transitions
a_recs, a_instructions = self._transition(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2773, in _transition
recs, instructions = func(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 2098, in transition_flight_missing
assert ts.done
AssertionError
2022-06-08 12:07:31,292 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43415
2022-06-08 12:07:31,292 - distributed.worker - INFO - Not waiting on executor to close
2022-06-08 12:07:31,295 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51338 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
2022-06-08 12:07:31,297 - distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33785
Traceback (most recent call last):
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 867, in _read_to_buffer
bytes_read = self.read_from_fd(buf)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/site-packages/tornado/iostream.py", line 1140, in read_from_fd
return self.socket.recv_into(buf, len(buf))
ConnectionResetError: [Errno 104] Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4591, in _get_data
response = await send_recv(
File "/home/runner/work/distributed/distributed/distributed/core.py", line 748, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 242, in read
convert_stream_closed_error(self, e)
File "/home/runner/work/distributed/distributed/distributed/comm/tcp.py", line 148, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc.__class__.__name__}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) Ephemeral Worker->Worker for gather local=tcp://127.0.0.1:51336 remote=tcp://127.0.0.1:33785>: ConnectionResetError: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 1532, in close
_, pending = await asyncio.wait(self._async_instructions, timeout=timeout)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 384, in wait
return await _wait(fs, timeout, return_when, loop)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 491, in _wait
await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - ERROR -
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/utils.py", line 761, in wrapper
return await func(*args, **kwargs)
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 3290, in gather_dep
response = await get_data_from_worker(
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4611, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 381, in retry_operation
return await retry(
File "/home/runner/work/distributed/distributed/distributed/utils_comm.py", line 366, in retry
return await coro()
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 4588, in _get_data
comm = await rpc.connect(worker)
File "/home/runner/work/distributed/distributed/distributed/core.py", line 1193, in connect
await done.wait()
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,410 - distributed.worker - CRITICAL - Error trying close worker in response to broken internal state. Forcibly exiting worker NOW
Traceback (most recent call last):
File "/home/runner/work/distributed/distributed/distributed/worker.py", line 225, in _force_close
await asyncio.wait_for(self.close(nanny=False, executor_wait=False), 30)
File "/usr/share/miniconda3/envs/dask-distributed/lib/python3.10/asyncio/tasks.py", line 432, in wait_for
await waiter
asyncio.exceptions.CancelledError
2022-06-08 12:07:31,433 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:31,497 - distributed.nanny - ERROR - Worker process died unexpectedly
2022-06-08 12:07:31,878 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,147 - distributed.nanny - WARNING - Restarting worker
2022-06-08 12:07:32,429 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:32,430 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:32,430 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:32,430 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,546 - distributed.worker - INFO - Starting Worker plugin kill
2022-06-08 12:07:33,546 - distributed.worker - INFO - Registered to: tcp://127.0.0.1:43603
2022-06-08 12:07:33,546 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:33,546 - distributed.core - INFO - Starting established connection
2022-06-08 12:07:33,610 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43767
2022-06-08 12:07:33,611 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-dd7d2c45-b24a-4c24-befd-357a020e0609 Address tcp://127.0.0.1:43767 Status: Status.closing
2022-06-08 12:07:33,650 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:42643
2022-06-08 12:07:33,652 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-45028c9a-7a2f-4bd0-8bfb-f735f60d8d1c Address tcp://127.0.0.1:42643 Status: Status.closing
2022-06-08 12:07:33,906 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:[4196](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4197)3
2022-06-08 <https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4197)32022-06-08> 12:07:33,934 - distributed.worker - INFO - Connection to scheduler broken. Closing without reporting. ID: Worker-b0e2dbe7-7ea9-4457-be3e-5eaf0ad02333 Address tcp://127.0.0.1:41963 Status: Status.closing
2022-06-08 12:07:34,555 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,556 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,569 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:46495
2022-06-08 12:07:34,569 - distributed.worker - INFO - dashboard at: 127.0.0.1:45525
2022-06-08 12:07:34,569 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:43603
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,569 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:34,569 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:34,569 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-9em9f50m
2022-06-08 12:07:34,569 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO - Start worker at: tcp://127.0.0.1:[4322](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4323)7
2022-06-08 <https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4323)72022-06-08> 12:07:34,630 - distributed.worker - INFO - Listening to: tcp://127.0.0.1:43227
2022-06-08 12:07:34,630 - distributed.worker - INFO - dashboard at: 127.0.0.1:38507
2022-06-08 12:07:34,630 - distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:[4360](https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4361)3
2022-06-08 <https://github.com/dask/distributed/runs/6792817097?check_suite_focus=true#step:11:4361)32022-06-08> 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,630 - distributed.worker - INFO - Threads: 2
2022-06-08 12:07:34,630 - distributed.worker - INFO - Memory: 6.78 GiB
2022-06-08 12:07:34,630 - distributed.worker - INFO - Local Directory: /tmp/tmpvge9p9co/dask-worker-space/worker-vptuzvhr
2022-06-08 12:07:34,630 - distributed.worker - INFO - -------------------------------------------------
2022-06-08 12:07:34,633 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43227
2022-06-08 12:07:34,634 - distributed.worker - INFO - Closed worker has not yet started: Status.init
2022-06-08 12:07:34,704 - distributed.worker - INFO - Stopping worker
2022-06-08 12:07:34,704 - distributed.worker - INFO - Closed worker has not yet started: Status.init
@crusaderky <https://github.com/crusaderky> @fjetter
<https://github.com/fjetter> what might have landed recently that could
have affected this?
—
Reply to this email directly, view it on GitHub
<#6535>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCCVH3VVLI5VQSNHWDVOD7NJANCNFSM5YHZGD6Q>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
You will fail that assertion if you have a task that is in worker.has_what and has state=flight, but it's not in in_flight_workers[worker]. This has been tripped: distributed/distributed/worker.py Lines 3310 to 3326 in 9e4e3ab
but not this: distributed/distributed/worker.py Lines 3360 to 3362 in 9e4e3ab
which in turn got you to fail the assertion here: distributed/distributed/worker.py Lines 2095 to 2099 in 9e4e3ab
I think the culprit is this, although I'm fuzzy about the details: distributed/distributed/worker.py Lines 2420 to 2432 in 9e4e3ab
I'm not aware of any recent changes to that bit of machinery, but I might be wrong. CC @fjetter |
I would have expected the CI failure to capture one, but I don't see anything in the logs about it: https://github.com/dask/distributed/runs/6796844522?check_suite_focus=true#step:11:1834. I don't actually know where the dumps from CI end up. |
It looks like
test_chaos_rechunk
started failing for the first time today: https://dask.github.io/distributed/test_report.html, https://github.com/dask/distributed/actions/runs/2461256827. The failure is a validation intransition_flight_missing
:I also had this fail in CI for my PR in the same way: https://github.com/dask/distributed/runs/6798289115
Here's stderr from one of the tests:
@crusaderky @fjetter what might have landed recently that could have affected this?
The text was updated successfully, but these errors were encountered: