Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify lack of scalability in gwas_linear_regression #390

Open
eric-czech opened this issue Nov 17, 2020 · 60 comments
Open

Identify lack of scalability in gwas_linear_regression #390

eric-czech opened this issue Nov 17, 2020 · 60 comments

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Nov 17, 2020

It appears that this function does not scale well when run on a cluster.

Notes from my most recent attempt:

CPU utilization across worker VMs
Screen Shot 2020-11-17 at 1 10 08 PM

Status Page
Screen Shot 2020-11-17 at 12 35 39 PM

  • Drilling in on one of the workers that is running all the tasks, I see that the only not obviously parallelizable task it seems to be running is "solve-triangular":

Screen Shot 2020-11-17 at 1 36 57 PM

Full Task List


The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".

Several of the workers show log messages like this:

distributed.worker - INFO - Dependent not found: ('rechunk-merge-66cac011d34e1c66cde96678a9e011b5', 0, 21) 0 . Asking scheduler

Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.

I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).

@eric-czech
Copy link
Collaborator Author

Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds):

(renamed to .txt to avoid github attachment filter)
gwas-height-performance-report.html.txt

This zoomed-out view of the task stream in the report doesn't strike me as very healthy:

Screen Shot 2020-11-17 at 2 07 57 PM

Task graph:

Screen Shot 2020-11-17 at 1 54 52 PM

@mrocklin (cc: @ravwojdyla) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow?

@mrocklin
Copy link

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

I've only looked very briefly at it, but the thing that stands out the most is the 760s transfer times leading up to sum tasks. At expected bandwidths, these would be 150GB payloads, which I'm assuming is higher than you're expecting. It's also odd in how synchronized these transfers are, they all end within a few seconds of each other.

@mrocklin
Copy link

ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs

I'm curious, do worker logs report anything strange?

Also I'm curious, which version of distributed are you running? (adding this to the performance report here: dask/distributed#4249)

cc'ing @quasiben , who cares a bit about this space and has a lot of experience tracking down similar performance problems.

@mrocklin
Copy link

mrocklin commented Nov 17, 2020 via email

@quasiben
Copy link

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

Screen Shot 2020-11-17 at 3 32 36 PM

@mrocklin
Copy link

mrocklin commented Nov 17, 2020 via email

@eric-czech
Copy link
Collaborator Author

eric-czech commented Nov 17, 2020

Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ .

👍

I'm curious, do worker logs report anything strange?

I see messages like "Worker stream died during communication" so I'm sure a couple workers in the cluster had died. At the end I was down to 18 instead of 20. Full log from one worker is here jic.

When I looked I saw some fairly lengthy disk-read/writes which I assume to be dask spilling. Which might correspond to the workers under memory pressure

Thanks Ben, what do you make of the "transfer-sub" tasks (the long red bars)? Do you have any intuition for what's happening in those?

Memory pressure does seem to be part of the problem -- I rechunked my input to 1/16th of the original chunk size and the whole job has progressed further. Parallel utilization is still pretty disappointing across the whole cluster:

CPU utilization across workers after rechunking input to 1/16th of original
Screen Shot 2020-11-17 at 3 42 33 PM

@eric-czech
Copy link
Collaborator Author

Looking at the worker/thread ratio, I wonder if it would make more sense to
have far more workers with fewer threads each. Perhaps try four threads
per worker?

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

@mrocklin
Copy link

mrocklin commented Nov 17, 2020 via email

@quasiben
Copy link

quasiben commented Nov 17, 2020

Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider?

That's a good question. I don't think this is supported but shouldn't be too hard. You can control the number of threads with worker_options={"nthreads": 2 }. As @mrocklin suggest, this often is tuned with smaller VMs. Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

@eric-czech
Copy link
Collaborator Author

Can I ask you to file an issue on dask-cloudprovider to support multiple workers per VM ?

You bet! dask/dask-cloudprovider#173

@eric-czech
Copy link
Collaborator Author

eric-czech commented Nov 18, 2020

The larger dataset with smaller chunks did ultimately finish with no errors. Here are a couple readouts:

Screen Shot 2020-11-18 at 6 56 28 AM

Performance report (26M): https://drive.google.com/file/d/1feWLKNrjQkslKDIZ7T39fPCNQQDBrQFs/view?usp=sharing

It doesn't seem like any of network, disk, or cpu are even close to being saturated so I assume there is some room for improvement.

As a very rough estimate, this task takes about 5 hrs on a single 64 vCPU VM and 3 hours on a cluster of 20 8 vCPU VMs (160 vCPUs). That would imply ~2 hours with perfect scaling so this should be an approximate ceiling for improvements.

Log: dask_gwas_chr21_log.txt

I will try again with smaller VMs and see if there are any major differences.

@eric-czech
Copy link
Collaborator Author

FYI @ravwojdyla and I have been talking a bit about some similar observations on a much simpler workflow in https://github.com/related-sciences/data-team/issues/38 (private). One conclusion there was that the individual objects on GCS are so small that API requests aren't efficient yet bigger chunk sizes would start to make the workloads fail. The GCS objects for a chunk are roughly 2MiB on disk but >100MB in memory. Another potential explanation for this behavior is that the GCS objects aren't being loaded asynchronously. We're still investigating both.

@mrocklin
Copy link

I don't know if you're aware of the work done by @martindurant on nicer async IO for remote storage and zarr, but he might be good to talk to.

It looks like you're still primarily blocked by a few oddly long transfers. They're less prominent now, which is good, but probably worth investigating further.

@martindurant
Copy link

(sorry, wrong thread)

@mrocklin
Copy link

mrocklin commented Nov 18, 2020 via email

@martindurant
Copy link

Indeed - but I did want to comment here too.

Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects. In the case that these are to be expanded in memory, you would of course have to deal with the blocks of bytes one at a time, to not exceed RAM. For zarr, this does mean transiently higher memory usage during decompression, depending on the number of storage chunks per dask partition - the best tradeoff would be workload-dependent.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Nov 18, 2020

In trying this on a cluster of 40 4 vCPU machines (instead of 20 8 vCPUs), the workflow failed with some similar errors to my first attempt on this issue. It doesn't look like memory pressure was an issue this time but I'm not certain. I didn't see log messages about it anyhow, and this was with chunks 1/16th the size of the original, or ~6MB in memory. The client-side error I hit was again:

ValueError: Could not find dependent ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 130, 105). Check worker logs

Here are some worker logs:

Worker 1 distributed.worker - INFO - Start worker at: tcp://10.142.15.198:37407

distributed.worker - INFO - Listening to: tcp://10.142.15.198:37407

distributed.worker - INFO - dashboard at: 10.142.15.198:38769

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-jst4aey2

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.198:37407 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 0, 13)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 0, 13) 0 . Asking scheduler

Worker 2 distributed.worker - INFO - Start worker at: tcp://10.142.15.201:43081

distributed.worker - INFO - Listening to: tcp://10.142.15.201:43081

distributed.worker - INFO - dashboard at: 10.142.15.201:43701

distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 27.34 GB

distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-9plktoms

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 33, 40)

distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 988, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1169, in write_to_fd return self.socket.send(data) # type: ignore ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1281, in get_data compressed = await comm.write(msg, serializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 256, in write convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 32, 40)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 34, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 32, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 33, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 35, 40)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 34, 40) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 35, 40) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 137, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 136, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 136, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 172, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 138, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 173, 22)

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 173, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 175, 22)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 148, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 149, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 225, 32)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 174, 22) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 225, 32) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 174, 22)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 150, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 244, 94)

distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 151, 94)

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 151, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 244, 94) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - INFO - Stopping worker at tcp://10.142.15.201:43081

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time

distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler

Logs for most of the workers looked like "Worker 1" above and I didn't notice anything particularly noteworthy in perusing a bunch of them. What is odd about this run is that I still had 40 nodes in the cluster at the end. It appears that one of them become temporarily unavailable or was otherwise unreachable long enough to crash the job. @mrocklin what should Dask do in a scenario where one worker is unreachable? Does it try to reschedule the work elsewhere or fail the whole job?

@eric-czech
Copy link
Collaborator Author

Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects

Thanks @martindurant. Does dask need to do anything in particular to use that (presumably what was in zarr-developers/zarr-python#536)? @ravwojdyla mentioned that you need at least zarr 2.5.0, but we weren't sure if there was also more that needs to be done in dask or xarray to integrate it.

@martindurant
Copy link

Zarr 2.5 is enough - but you need more than one zarr block per dask task, else you see no benefit.

@eric-czech
Copy link
Collaborator Author

I see, thanks @martindurant. Well I'm down to <10MB chunks being necessary to make this workflow run without OOM errors so this seems like an important point of contention we're likely to run into again. Even if I wrote the zarr chunks to be small enough such that multiple of them would fit in one dask chunk, I can't imagine that parallel reading of <100k chunks (~1M in memory) would provide much of a benefit.

I'll see if the workload will tolerate large but uneven chunks (i.e. tall-skinny, short-fat). There are a lot of multiplications in it and given https://github.com/pystatgen/sgkit/issues/375, it stands to reason that we should have to rethink chunking in every workflow as a function of the number of columns involved.

@mrocklin
Copy link

mrocklin commented Nov 18, 2020 via email

@mrocklin
Copy link

mrocklin commented Nov 18, 2020 via email

@eric-czech
Copy link
Collaborator Author

Eric, if you're able to reproduce this same problem but with a bit less of
the machinery here that would also make it easier for some of the other
performance experts to try things out on their own and weigh in

Hey @mrocklin, here is a notebook that isolates the dask code being used here: https://gist.github.com/eric-czech/daae30d54a5c96fd09f13ffa58a3bafe.

I'm fairly certain the problem is https://stackoverflow.com/questions/64774771/does-blockwise-allow-iteration-over-out-of-core-arrays, or rather the lack of scalability of matrix multiplication in dask. I was able to get this workflow to run on the 40 node cluster by reducing the chunk size in the variants dimension to something far smaller, since variant_chunk_size x n_samples arrays are being loaded into memory by blockwise. At my original chunking (5216 variants, 5792 samples), these arrays should have been about 5216 variants * 365941 samples * 4 bytes = 7.5 GB which is bigger than the 6.5 GB of RAM available per vCPU on n1-highmem-* instances.

@eric-czech
Copy link
Collaborator Author

Also, I don't think using a different worker/core ratio changed much. The job was a good bit slower on 40 nodes instead of 20 (3hr 20m vs 2h 50m) but that may be more attributable to the different chunking needed. Either way, a few GCP monitoring readouts looked like this for the cluster (about the same as before):

Screen Shot 2020-11-18 at 5 55 54 PM

@mrocklin
Copy link

mrocklin commented Nov 19, 2020 via email

@eric-czech
Copy link
Collaborator Author

I think it's ok to publish it now. This workflow only uses lstsq and matrix multiplication, not any compiled functions or blockwise like the pairwise functions we were talking about in the context of array reductions on the call.

@eric-czech
Copy link
Collaborator Author

Wow, rechunking to very short-fat chunks and re-running on a cluster of 40 n1-highmem-8 instances resulted in the whole workflow finishing in ~10 minutes (as opposed to 2 hours on a 20 node cluster) and I saw utilization like this:

Screen Shot 2020-11-19 at 3 25 00 PM

Performance report: https://drive.google.com/file/d/1vLZEwY0xea6Jc3VT_mS9HiZrv6NXeF2X/view?usp=sharing

This seems to summarize the differences fairly well:

Screen Shot 2020-11-19 at 4 43 13 PM

Based on only the performance reports though, I'm not sure how we could have known there was so much room for improvement. Nothing else jumps out to me as being predictive of that yet.

@eric-czech
Copy link
Collaborator Author

As a negative control and because I was a little incredulous about this latest improvement, I reran this once again with small square chunks and saw memory usage swell followed by a severe slowdown in task processing rates. There really is something magical about the short-fat chunking. To summarize what I've tried (w.r.t chunking) and the results so far:

input chunks result
(5216, 5792) - original OOM
(1304, 1448) - "1/16th" as big works but takes hours
(5216, 724) - tall-skinny OOM
(652, 5792) - short-fat works, takes minutes

@tomwhite
Copy link
Collaborator

For the perf numbers in https://github.com/pystatgen/sgkit/issues/390#issuecomment-758708221, I was inadvertently running the client in a different GCP region to the cluster. When I re-ran with everything in the same region and zone, the white space almost completely disappears from the performance report (so the pauses were presumably due to the client-scheduler communication).

1000GB disk

Performance report
Duration: 81.65 s
Tasks Information
number of tasks: 13805
compute time: 460.41 s
deserialize time: 363.37 ms
disk-read time: 5.82 ms

(For this run I also switched to using Dask 2.30 from 2020.12.0 (and similarly for distributed), since the latter seems to be less stable for this workload as Eric mentioned earlier in this issue. So it's possible that the change is down to that difference. Still, we have a configuration where task utilization seems to be good.)

@ravwojdyla
Copy link
Collaborator

@tomwhite nice. I can see that in the most recent report there isn't much spilling happening anymore. Do we still need the 1TB disk (assuming network shuffle and no spilling)?

@tomwhite
Copy link
Collaborator

Do we still need the 1TB disk (assuming network shuffle and no spilling)?

No, as it turns out! I ran it again with the smaller disk (with everything in the same zone) and it ran fine with full task utilization. Sorry for the wild goose chase.

50GB disk
Performance report
Duration: 78.85 s
Tasks Information
number of tasks: 13805
compute time: 462.74 s
deserialize time: 325.14 ms
disk-read time: 383.54 ms

Is it fair to say that the larger workload that @eric-czech ran was spilling to disk and was suffering from slow disk performance? Although perhaps the goal is to tune it to avoid spilling (as far as possible) using the techniques outlined by @ravwojdyla in https://github.com/pystatgen/sgkit/issues/437#issuecomment-762395287. If spilling is not avoidable, however, then having more performant disks should help though.

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Jan 19, 2021

@tomwhite nice, good to see that.

Is it fair to say that the larger workload that @eric-czech ran was spilling to disk and was suffering from slow disk performance?

That isn't entirely true, AFAIU looking at the performance reports above (provided by @eric-czech):

  • https://github.com/pystatgen/sgkit/issues/390#issuecomment-730660134 (pulling in the image below):

    image

    compares "square" to "short-fat" chunks, notice that in the square chunk screenshot we definitely see the impact of the spilling (disk-{read/write} time) - again because of suboptimal chunk scheme. Switching to "short-fat" @eric-czech observes significant improvement in wall time, in the screenshot notice that there we do NOT see much spilling. On the same screenshots notice the impact of the "transfer time", that is certainly something we could further optimise (more on that below). So suboptimal chunking scheme in those two cases leads to spilling which adds extra cumulative ~1.2H (!), and in the case of "square" chunks we certainly see the impact of spilling and disk size, BUT in the case of more optimal "short-fat" chunks the total overhead of disk usage is about 6 seconds (which means in that case having larger disks would not help, and only cost more).

Some other points:

  • in https://github.com/pystatgen/sgkit/issues/390#issuecomment-730469023 @eric-czech you say:

    Also, I don't think using a different worker/core ratio changed much.

    @eric-czech correct me if I'm wrong, but based on the comment I believe you are adjusting the number of workers (and VMs) but not the worker-thread? As pointed out in https://github.com/pystatgen/sgkit/issues/437#issuecomment-758849588, in GCE/GCP worker/core ratio doesn't change much apart from more overall processing power (if you utilize max thread/worker, like cloud-provider would do by default), and in some cases it might make things worse by increasing the "transfer time" (since there is more VM/workers). More interesting tuning point would be worker-thread/core ratio, which is more akin to the way you would control memory per "executor" in MR/Spark realm.

  • looking at the report in https://github.com/pystatgen/sgkit/issues/390#issuecomment-730660134 "short-fat" chunks reduce spilling, but there is still quite a bit of cumulative "transfer time" 2.3hrs (~3x the computation time(!)). I would say this is the next point that we should optimize, and it would likely require searching for the chunking scheme that minimizes the matmul communication. A hypothesis I would explore: tune the size of chunks in a way that reduces the transfer needed in the "contraction"/reduction axis. If the chunks need to be larger, which might lead to less communication ("transfer time"), it might require adjusting the worker-thread/core ratio to accommodate for extra memory usage per thread (which without the adjustment would lead to spilling, see more context/options here: https://github.com/pystatgen/sgkit/issues/437#issuecomment-758849588). And I would use the new matmul implementation for these tests, since it allows you to have better control over the memory usage. I also wonder if some of the white space we see in the reports can be due to the communication overhead, where a task would wait for all the necessary upstream data.

  • @eric-czech was there a performance report for this workflow: https://github.com/pystatgen/sgkit/issues/390#issuecomment-748205731 ?

Although perhaps the goal is to tune it to avoid spilling (as far as possible) using the techniques outlined by @ravwojdyla in #437 (comment). If spilling is not avoidable, however, then having more performant disks should help though.

As you know, spilling is a very expensive operation (IO+serde roundtrip), and I would argue for important pipeline (like this one), we should optimise the data + pipeline + chunking scheme + cluster spec to avoid spilling.

@tomwhite what do you think? And also please let me know if I can help in any way.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Jan 20, 2021

@ravwojdyla

@eric-czech correct me if I'm wrong, but based on the comment I believe you are adjusting the number of workers (and VMs) but not the worker-thread?

That's right, I was only increasing or decreasing the number of workers for Dask CP. I never explicitly set the number of threads per worker.

@eric-czech was there a performance report for this workflow: #390 (comment) ?

Unfortunately no, I disabled the report generation since so many individual jobs were being run. In retrospect though, I wish I had saved them all so we could at least look at the last two.

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Jan 20, 2021

That's right, I was only increasing or decreasing the number of workers for Dask CP. I never explicitly set the number of threads per worker.

Thanks for the confirmation @eric-czech. https://github.com/pystatgen/sgkit/issues/437#issuecomment-762395287 shows how to adjust that for vanilla Dask distributed Client (which is trivial). In the dask-cloudprovider realm it's a bit more complicated, dask/dask-cloudprovider#173 is about support for multiple workers per VM, but that's actually not the case we need (since most computation is numpy based, it should be fine (and even more optimal) to have a single worker per VM, unless we believe GIL is a problem), what we need is a way to adjust how many worker-threads a single worker has, here's how you can do that for dask-cloudprovider's GCP cluster:

GCPCluster(projectid="foobar",
           n_workers=1,
           # here you could also disable spilling etc if you are in the "tuning mode"
           worker_options={"nthreads": 2, "memory_limit": 1.0},
           machine_type="n1-standard-4")

and now instead of 1 worker with 4 threads (avg 3.75GB per thread), you get 1 worker with avg 7.5GB per thread. "memory_limit": 1.0 is paramount (without that option you would get 2 threads with avg 3.75GB per thread). And thus you get more memory per thread, and potentially can use larger chunks (less tasks overhead) or avoid spilling. Here I would also mentioned that you might want to adjust the numpy threads at this point: dask/dask-cloudprovider#230

Here I would also reiterate that https://github.com/pystatgen/sgkit/issues/437#issuecomment-758849588 has other options to adjust "memory limits": worker resources and more recently layer annotations dask/dask#6701 (but I haven't tried them with GCPCluster).

@ravwojdyla
Copy link
Collaborator

Some summary and takeaways from the meeting today:

  • "square" chunks {variant: 5216, sample: 5792} result in spilling and poor performance
  • "short and fat" chunks {variant: 652, sample: 5792} mitigate the spilling but we still see significant transfer time overhead
  • in this workflow matmul does contraction in the sample axis
  • I start to develop this heuristic: in matmul chunking of the contraction axis (sample in our case) will have major impact on the amount of communication required, and in our case variant size/chunking will influence memory. I run a couple of matmul tests to validate that, you can see reports here:

A hypothesis I would explore: tune the size of chunks in a way that reduces the transfer needed in the "contraction"/reduction axis.

To be more concrete, we saw benefit of going from 5216 -> 652 in the variant axis, specifically no spilling (less memory overhead), and increasing the size of samples should reduce the communication/transfer time. So overall to be more precise, I believe we should try to:

  • increase the chunk size in the sample axis as much as practical
  • and if we start spilling reduce the chunk size in the variant axis
  • if we need to increase WT/Memory ratio previous comments describe that
  • so a "degenerated" case would be chunking: {variant: 1, sample: MAX}.

Looking at the https://github.com/pystatgen/sgkit/issues/390#issuecomment-748205731, I annotate the moment we switch from chr. 21 to chr. 11 and observe slow down:

spilling

  • notice that it leads to spilling
  • chr. 11 has has 530k variants compared to 140k for chr21
  • again size/chunking in the variant axis had influence on the memory required, and led to spilling and slow downs
  • so for a "constant" cluster, maybe there exists a chunk size that is optimal for all chromosomes, but we might end up with chunking that depends on the chromosome, fortunately we have all the information to make that decision and it would be just a matter of coming up with a function that goes from chromosome size to chunking

@tomwhite
Copy link
Collaborator

Thanks @ravwojdyla - that's very interesting and useful.

I have now reproduced the slowdown using @eric-czech's simulated data from #438. Running on a cluster of 8 n1-standard-8 workers I get the following execution times for multiples of the XY dataset:

Multiple Time (s) Perf report
1 30.91 pr_8444_365941_25_dist_1_gs.html.txt
2 40.35 pr_16888_365941_25_dist_1_gs.html.txt
4 80.84 pr_33776_365941_25_dist_1_gs.html.txt
8 696.48 pr_67552_365941_25_dist_1_gs.html.txt

Notice that going from 4x to 8x produces a disproportionate increase in time taken. And this correlates with spilling, as shown on the GCP monitoring snapshot (spilling starts at 12:32). It also matches the whitespace in the taskstream.

cluster_perf

This is using @ravwojdyla's new matmul implementation. The code I used to run the benchmark is here: https://github.com/tomwhite/gwas-benchmark

I think the next thing to try is different chunk sizes as suggested in Rafal's last comment.

@tomwhite
Copy link
Collaborator

I had a look at the shapes and chunk sizes of the intermediate variables in the gwas function. To do this I broke out each computation into a separate cell in this "chunk report" notebook, so you can get an idea of what each operation is doing.

One thing that stuck out was the simple fact that the total memory needed for the computation (for 8x data) exceeds the cluster memory (see the bottom of the notebook) - but it doesn't for 4x data. So this explains why disk spilling is inevitable, and why we see a big slowdown from 4x to 8x. When I doubled the cluster size to 16 nodes, there was barely any spilling, and the 8x computation took ~200s as opposed to ~700s (on the 8 node cluster). A good improvement, but still lots of whitespace in the task stream. (See the performance report)

Next, I tried matching the output of the first, "outer" matmul (XC @ LS) to have same chunk sizes as the XL array, which it is later combined with. If XC is not chunked in the first dimension, then XC @ LS will not be chunked in the first dimension, and when XL - XC @ LS is computed there is a lot of data transfer since XL has roughly square chunks, while XC @ LS has tall skinny chunks. (It's easier to see this graphically in the chunk report notebook.) I think this is why there are a lot of "transfer-sum" bars in the task stream that has lots of whitespace.

This helped a lot - the time is now down to 110s, so only 2.75x slower than the 4x data case from before. The task graph has very little whitespace now, which is a significant improvement. (See the performance report)

(Side note: I get a Dask warning saying 'PerformanceWarning: Increasing number of chunks by factor of 64' which actually is a good thing in this case!)

The notebook I used for the computation is here: https://nbviewer.jupyter.org/github/tomwhite/gwas-benchmark/blob/1a73ff865d100da7ea51c73456aa1a8526ff29b7/gwas_simulation.ipynb

Summary:

  • Use a cluster large enough to ensure all of the intermediate arrays can fit in cluster memory to avoid (very expensive) disk spilling.
  • Be careful about "outer" matmul operations since the output chunking probably isn't what you expect.

@tomwhite
Copy link
Collaborator

I ran the benchmark on 16x data (135,104 variants) using the XC rechunking change from the previous comment. It took 185s on a 24 node n1-standard-8 cluster.

This data is comparable in size to chr21 (141,910 variants), and according to related-sciences/ukb-gwas-pipeline-nealelab#32, it took 150s on a 60 node n1-highmem-16 cluster.

  • 150 s * 960 cores / phenotype = 144,000 core seconds/phenotype
  • 185 s * 192 cores / phenotype = 35,520 core seconds/phenotype

So this is a ~4x speedup.

@hammer
Copy link
Contributor

hammer commented Jan 27, 2021

What is the cost improvement?

What would it cost if we used preemptible instances? I know Eric had problems with our workloads running reliably on preemptible instances, so I'm curious if that's even an option for us.

@tomwhite
Copy link
Collaborator

tomwhite commented Jan 27, 2021

Cost improvement is probably a bit better than 4x since I wasn't using highmem instances (more like 5x maybe) - but these are all estimates, so there's likely to be some variation. (Also we don't know if the scaling is linear all the way up to chr1 size data.)

I'm not sure about how well preemptible instances work with Dask in general, or for our workload.

There is still potentially a lot of scope for reducing transfer times (e.g. for the second perf report in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568, the compute time is ~5000s and the transfer time is ~3000s), by improving the chunking further. My change looks like it fixed one of the more egregious cases, but there are likely other things we can do along the lines that @ravwojdyla has suggested. This would involve going through the chunk report in more detail.

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 9, 2021

To summarize the state of this issue, I think the main ways to improve performance have been identified (and fixed in a couple of cases):

  1. Improve scalability of Dask matmul (fixed in Rewrite matmul as blockwise without concatenated contraction dask/dask#7000)
  2. Rechunking the covariates array (fixed in Adjust chunksize in gwas_linear_regression to reduce data transfer between workers #454), ~4x speedup
  3. Persisting the input dataset in cluster memory (Investigate persisting input dataset in cluster memory on GWAS performance #449), ~1.5x speedup
  4. Using preemptible instances (Investigate use of preemptible GCP instances for GWAS #453), up to ~5x cost saving
  5. Using standard linear regression, not a mixed model Using map_blocks to do independent linear regressions (Understand Hail GWAS regression implementation #448), ~1.4x speedup
  6. Further rechunking improvements (Investigate further chunking improvements for better GWAS performance #461), unknown speedup

@mrocklin
Copy link

mrocklin commented Feb 10, 2021 via email

@tomwhite
Copy link
Collaborator

Another summary of the latest state. I think these three changes should solve the problem for the time being:

  1. Improve scalability of Dask matmul (fixed in dask/dask#7000)
  2. Rechunking so there is no chunking in the samples dimension (https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217), ~7x speedup
  3. Using preemptible instances (Investigate use of preemptible GCP instances for GWAS #453), up to ~5x cost saving

The next step would be to run on a suitable subset of the original UKB data to see if they work well enough in practice. (Note that we also have map_blocks as a fallback if 2 above doesn't scale well enough for our purposes.)

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Feb 19, 2021

@tomwhite thanks for the summary and hard work! If I may, I would also like to document some potential experiments to further improve the performance (when we get back to this issue):

  • the https://github.com/pystatgen/sgkit/issues/448#issuecomment-780655217 uses chunking {variants: 16, samples: -1}. -1/MAX reduces the transfer time, that said cumulative transfer time is still roughly 25% of the computation time. As mentioned in https://github.com/pystatgen/sgkit/issues/390#issuecomment-764950336, at this time we could explore maximising variant size (without triggering spilling!). In a sense if we freeze some parameters like cluster type/size etc, we could frame this problem as an optimisation of a function of variant and sample chunk sizes with some standard techniques and an educated starting point, but at this point it might be enough to just continue manual optimisation (amusingly, it also reminded me of Herodotou, Herodotos, et al. "Starfish: A Self-tuning System for Big Data Analytics." Cidr. Vol. 11. No. 2011. 2011). So this is one area to explore.
  • zarr/IO still takes significant amount of time, which is another area to experiment with, see knowledge dump in Genetics data IO performance stats/doc #437. First step here should be to capture more detailed metrics (including profiles of this step, for example I would be curious to see how much time we spend decompressing data) and making an educated decision. This could be as simple as experimenting with the native zarr chunk size, changing compression parameters, all the way to exploring an alternative layouts or formats that could better serve the specific properties of genetic data or computation (like high compressibility, limited data types and values etc), or using cache/hot storage layer (which btw could also be useful with preemptive VM if used in a creative way).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants