Restart worker on CommClosedError #4884

chrisroat · 2021-06-05T23:08:55Z

On a dask-gateway GKE cluster, I have a few workers that stop processing and the logs show some combination of CommClosedError, missing dependency warnings, and garbage collection. The scheduler seems happy with the worker, as the "Last seen" remains up to date.

I've attached an example log. This worker was still in "processing" for its current task, which I think is sub-second, after 30 minutes. I killed the worker and the graph backed up to redo the lost work, and eventually completed.

I have written dask-cluster-manager jobs which restart schedulers leaking memory. I see there is a client.get_scheduler_logs(), which could be parsed to detect this. Or is there some way to detect this and restart the worker (besides me searching the GKE dashboard and manually doing it)? Is there a setting that can be used to somehow mitigate this? The lifetime.{duration,stagger,restart} seem like a last resort.

Logs

2021-06-05T22:10:08.218039578Z + '[' '' ']'
2021-06-05T22:10:08.218953428Z + '[' -e /opt/app/environment.yml ']'
2021-06-05T22:10:08.219179401Z no environment.yml
2021-06-05T22:10:08.219232124Z + echo 'no environment.yml'
2021-06-05T22:10:08.219250341Z + '[' '' ']'
2021-06-05T22:10:08.219255783Z + '[' '' ']'
2021-06-05T22:10:08.219260917Z + exec dask-worker tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786 --dashboard-address :8787 --name dask-worker-e85d926daa1646c2a782e2cf4ea00f25-kp6lp --nthreads 1 --memory-limit 7516192768
2021-06-05T22:10:09.406322052Z distributed.nanny - INFO -         Start Nanny at: 'tls://10.4.91.2:44355'
2021-06-05T22:10:11.295639846Z distributed.worker - INFO -       Start worker at:      tls://10.4.91.2:46259
2021-06-05T22:10:11.296092345Z distributed.worker - INFO -          Listening to:      tls://10.4.91.2:46259
2021-06-05T22:10:11.296168743Z distributed.worker - INFO -          dashboard at:             10.4.91.2:8787
2021-06-05T22:10:11.296178429Z distributed.worker - INFO - Waiting to connect to: tls://dask-e85d926daa1646c2a782e2cf4ea00f25.starmap:8786
2021-06-05T22:10:11.296184092Z distributed.worker - INFO - -------------------------------------------------
2021-06-05T22:10:11.296189147Z distributed.worker - INFO -               Threads:                          1
2021-06-05T22:10:11.296222187Z distributed.worker - INFO -                Memory:                   7.00 GiB
2021-06-05T22:10:11.296260851Z distributed.worker - INFO -       Local Directory: /workdir/dask-worker-space/worker-jrhd2vn0
2021-06-05T22:10:11.296284668Z distributed.worker - INFO - -------------------------------------------------
2021-06-05T22:10:11.344733302Z distributed.worker - INFO - Starting Worker plugin : Stream is closed
2021-06-05T22:14:45.191741414Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-merge-transpose-654fb163d718a0167f4124e19d50f10c', 5, 2, 3, 0)
2021-06-05T22:14:45.202056511Z distributed.worker - INFO - Can't find dependencies for key ('rechunk-split-1661c99969731fa78455440409ca42b2', 5746)
2021-06-05T22:14:46.034760748Z distributed.worker - INFO - Dependent not found: ('getitem-421b304eb1d88a03fa4a18a7ad28ada7', 1, 0, 12, 15) 0 .  Asking scheduler
2021-06-05T22:15:14.977635691Z distributed.utils_perf - INFO - full garbage collection released 60.92 MiB from 667 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:16.577678892Z distributed.worker - ERROR - Worker stream died during communication: tls://10.4.86.2:44013
2021-06-05T22:15:16.577729765Z OSError: [Errno 113] No route to host

2021-06-05T22:15:16.577771666Z The above exception was the direct cause of the following exception:
2021-06-05T22:15:16.577776322Z 
2021-06-05T22:15:16.577781272Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 285, in connect
    comm = await asyncio.wait_for(
  File "/opt/conda/lib/python3.8/asyncio/tasks.py", line 494, in wait_for
    return fut.result()
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 391, in connect
    convert_stream_closed_error(self, e)
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error
    raise CommClosedError(
distributed.comm.core.CommClosedError: in : OSError: [Errno 113] No route to host
2021-06-05T22:15:16.577837957Z 
2021-06-05T22:15:16.577842845Z The above exception was the direct cause of the following exception:
2021-06-05T22:15:16.577850215Z 
2021-06-05T22:15:16.577860139Z Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2189, in gather_dep
    response = await get_data_from_worker(
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3475, in get_data_from_worker
    return await retry_operation(_get_data, operation="get_data_from_worker")
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation
    return await retry(
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry
    return await coro()
  File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3452, in _get_data
    comm = await rpc.connect(worker)
  File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1010, in connect
    comm = await connect(
  File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 309, in connect
    raise IOError(
OSError: Timed out trying to connect to tls://10.4.86.2:44013 after 10 s
2021-06-05T22:15:24.435351939Z distributed.utils_perf - INFO - full garbage collection released 229.12 MiB from 278 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:41.568841350Z distributed.utils_perf - INFO - full garbage collection released 66.56 MiB from 373 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:15:52.363204788Z distributed.utils_perf - INFO - full garbage collection released 143.41 MiB from 114 reference cycles (threshold: 9.54 MiB)
2021-06-05T22:16:00.234860246Z distributed.comm.tcp - WARNING - Listener on 'tls://10.4.91.2:46259': TLS handshake failed with remote 'tls://10.4.95.4:41278': EOF occurred in violation of protocol (_ssl.c:1131)

The text was updated successfully, but these errors were encountered:

fjetter · 2021-06-07T12:14:02Z

I am currently investigating a deadlock issue which might be connected to this. Can you reproduce this?

See #4784 (the issue is long and grew in scope)

chrisroat · 2021-06-08T06:08:03Z

Reproduction can be tricky. I see this sporadically on a GKE cluster. It'll take me a day or two to find time to test this usefully.

fjetter · 2021-06-08T14:16:59Z

Reproduction can be tricky.

That's what I thought so don't worry. I still wanted to ask in case this was my lucky day :)

If you can afford this without slowing down your system too much, debug logs might indicate one or two useful things. Either way, I'm working on getting #4784 merged which fixed two deadlock situations. Before you dive into too sophisticated testing, I would suggest to wait for this to be released

fjetter · 2021-06-18T09:36:39Z

We've recently merged an important PR addressing a few error handling edge cases which caused unrecoverable deadlocks.
These deadlocks where associated with failing worker, connection failures or host co-located workers. All of these issues could be connected to fetching dependencies, therefore dense, highly connected task graphs were more likely to be affected. Ultimately, the deadlocks where caused by subtle race conditions which made them hard to reproduce and some of them cannot be correlated to any user facing logs which is why I cannot say for certain whether your issue could be fixed.
I would encourage you to try out the latest changes on main and/or wait for the upcoming release later today. Feedback on whether your issue could be resolved is highly appreciated!

Deadlock fix #4784
Upcoming release dask/community#165

chrisroat · 2021-07-01T02:05:12Z

Just wanted to say that things seem to have gotten better! I upgraded to the latest version, and haven't seen this issue yet! Fingers crossed.

Thank you!

fjetter added bug Something is broken needs info Needs further information from the user labels Jun 18, 2021

chrisroat closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart worker on CommClosedError #4884

Restart worker on CommClosedError #4884

chrisroat commented Jun 5, 2021

fjetter commented Jun 7, 2021

chrisroat commented Jun 8, 2021

fjetter commented Jun 8, 2021

fjetter commented Jun 18, 2021

chrisroat commented Jul 1, 2021

Restart worker on CommClosedError #4884

Restart worker on CommClosedError #4884

Comments

chrisroat commented Jun 5, 2021

fjetter commented Jun 7, 2021

chrisroat commented Jun 8, 2021

fjetter commented Jun 8, 2021

fjetter commented Jun 18, 2021

chrisroat commented Jul 1, 2021