Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError in WorkerState._transition_cancelled_error #6877

Closed
hendrikmakait opened this issue Aug 11, 2022 · 3 comments · Fixed by #6699
Closed

AssertionError in WorkerState._transition_cancelled_error #6877

hendrikmakait opened this issue Aug 11, 2022 · 3 comments · Fixed by #6699
Assignees
Labels
bug Something is broken

Comments

@hendrikmakait
Copy link
Member

hendrikmakait commented Aug 11, 2022

Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Exception in callback Worker._handle_stimulus_from_task(<Task finishe... to SSL")\n')>) at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py:1844
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: handle: <Handle Worker._handle_stimulus_from_task(<Task finishe... to SSL")\n')>) at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py:1844>
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/events.py", line 80, in _run
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     self._context.run(self._callback, *self._args)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 193, in wrapper
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return method(self, *args, **kwargs)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1854, in _handle_stimulus_from_task
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     super()._handle_stimulus_from_task(task)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 3341, in _handle_stimulus_from_task
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     self.handle_stimulus(stim)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 193, in wrapper
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return method(self, *args, **kwargs)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1866, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     super().handle_stimulus(*stims)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 3354, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     instructions = self.state.handle_stimulus(*stims)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 1279, in handle_stimulus
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     instructions += self._transitions(recs, stimulus_id=stim.stimulus_id)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2531, in _transitions
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     process_recs(recommendations.copy())
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2525, in process_recs
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     a_recs, a_instructions = self._transition(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 2443, in _transition
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     recs, instructions = func(self, ts, *args, stimulus_id=stimulus_id)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker_state_machine.py", line 1921, in _transition_cancelled_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     assert ts.previous in ("executing", "long-running")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: AssertionError

(https://cloud.coiled.io/dask-engineering/clusters/48635/details on hendrik-debug-worker-057e4c65bd)

Notably, the following log messages directly preceded the error:

Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,215 - distributed.core - INFO - Event loop was unresponsive in Worker for 45.99s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,225 - distributed.worker - INFO - Stopping worker at tls://10.0.3.173:42801
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,240 - distributed.core - ERROR - Exception while handling op get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     frames_nbytes = await stream.read_bytes(fmt_size)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: tornado.iostream.StreamClosedError: Stream is closed
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 769, in _handle_comm
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     result = await result
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await comm.read(deserializers=serializers)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     convert_stream_closed_error(self, e)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     if "UNKNOWN_CA" in exc.reason:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: TypeError: argument of type 'NoneType' is not iterable
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: 2022-08-11 10:00:05,268 - distributed.worker - ERROR - IOStream is not idle; cannot convert to SSL
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1362, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await done.wait()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/locks.py", line 214, in wait
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await fut
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: asyncio.exceptions.CancelledError
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1983, in gather_dep
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await get_data_from_worker(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 2725, in get_data_from_worker
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await retry_operation(_get_data, operation="get_data_from_worker")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/utils_comm.py", line 383, in retry_operation
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await retry(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/utils_comm.py", line 368, in retry
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return await coro()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 2702, in _get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await rpc.connect(worker)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1367, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     await connect_attempt
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 1307, in _connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await connect(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/core.py", line 291, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     comm = await asyncio.wait_for(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     return fut.result()
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 449, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     stream = await self.client.connect(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/tornado/tcpclient.py", line 288, in connect
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     stream = await stream.start_tls(
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/tornado/iostream.py", line 1270, in start_tls
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     raise ValueError("IOStream is not idle; cannot convert to SSL")
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: ValueError: IOStream is not idle; cannot convert to SSL
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Task exception was never retrieved
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: future: <Task finished name='Task-338' coro=<Server._handle_comm() done, defined at /opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py:675> exception=TypeError("argument of type 'NoneType' is not iterable")>
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 223, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     frames_nbytes = await stream.read_bytes(fmt_size)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: tornado.iostream.StreamClosedError: Stream is closed
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: During handling of the above exception, another exception occurred:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: Traceback (most recent call last):
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/core.py", line 769, in _handle_comm
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     result = await result
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/worker.py", line 1692, in get_data
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     response = await comm.read(deserializers=serializers)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 239, in read
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     convert_stream_closed_error(self, e)
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:   File "/opt/conda/envs/coiled/lib/python3.10/site-packages/distributed/comm/tcp.py", line 140, in convert_stream_closed_error
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]:     if "UNKNOWN_CA" in exc.reason:
Aug 11 10:00:05 ip-10-0-3-173 cloud-init[1264]: TypeError: argument of type 'NoneType' is not iterable
@hendrikmakait hendrikmakait added the bug Something is broken label Aug 11, 2022
@hendrikmakait
Copy link
Member Author

cc @crusaderky

@crusaderky
Copy link
Collaborator

ts.previous == "flight".
This is tightly related to #6705:
The network stack raised, which caused gather_dep to return GatherDepFailureEvent, which is quite uncommon in real life and virtually untested.

@crusaderky
Copy link
Collaborator

Note that the scope of this issue is specifically about the state machine.
The issue in the network stack is tracked separately, in #6875.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken
Projects
None yet
2 participants