-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify lack of scalability in gwas_linear_regression #390
Comments
Notes from a more detailed performance report resulting from running this for a smaller dataset (that succeeds): (renamed to .txt to avoid github attachment filter) This zoomed-out view of the task stream in the report doesn't strike me as very healthy: Task graph: @mrocklin (cc: @ravwojdyla) do you have any suggestions on how to identify why the work isn't being distributed well on larger datasets for this workflow? |
Thank you for producing the performance report. If you want to publish these in the future then you may also want to look into gist.github.com and https://raw.githack.com/ . I've only looked very briefly at it, but the thing that stands out the most is the 760s transfer times leading up to |
I'm curious, do worker logs report anything strange? Also I'm curious, which version of distributed are you running? (adding this to the performance report here: dask/distributed#4249) cc'ing @quasiben , who cares a bit about this space and has a lot of experience tracking down similar performance problems. |
Also it looks like some of the workers are under memory pressure. I wonder
if maybe they're struggling to accept a large-ish payload.
…On Tue, Nov 17, 2020 at 11:13 AM Eric Czech ***@***.***> wrote:
Notes from a more detailed performance report resulting from running this
for a smaller dataset (that succeeds):
(renamed to .txt to avoid github attachment filter)
gwas-height-performance-report.html.txt
<https://github.com/pystatgen/sgkit/files/5555640/gwas-height-performance-report.html.txt>
This zoomed-out view of the task stream in the report doesn't strike me as
very healthy:
[image: Screen Shot 2020-11-17 at 2 07 57 PM]
<https://user-images.githubusercontent.com/6130352/99435626-62c72080-28de-11eb-9627-560ead8131e8.png>
Task graph:
[image: Screen Shot 2020-11-17 at 1 54 52 PM]
<https://user-images.githubusercontent.com/6130352/99435949-d406d380-28de-11eb-8097-2d3ba8e9bdaf.png>
@mrocklin <https://github.com/mrocklin> (cc: @ravwojdyla
<https://github.com/ravwojdyla>) do you have any suggestions on how to
identify why the work isn't being distributed well on larger datasets for
this workflow?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-729142413>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEPOAG7V7AMMB4M3GTSQLDPHANCNFSM4TY6UJFA>
.
|
Two minutes is a long time for a disk read/write. My initial guess is that
there is some other external factor causing contention. That's just a wild
guess though.
Looking at the worker/thread ratio, I wonder if it would make more sense to
have far more workers with fewer threads each. Perhaps try four threads
per worker?
…On Tue, Nov 17, 2020 at 12:39 PM Benjamin Zaitlen ***@***.***> wrote:
When I looked I saw some fairly lengthy disk-read/writes which I assume to
be dask spilling. Which might correspond to the workers under memory
pressure
[image: Screen Shot 2020-11-17 at 3 32 36 PM]
<https://user-images.githubusercontent.com/1403768/99444378-20a3dc00-28ea-11eb-8890-fe2cf70240fd.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-729193609>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTGNBX6YOZINCQP2AZDSQLNO3ANCNFSM4TY6UJFA>
.
|
👍
I see messages like "Worker stream died during communication" so I'm sure a couple workers in the cluster had died. At the end I was down to 18 instead of 20. Full log from one worker is here jic.
Thanks Ben, what do you make of the "transfer-sub" tasks (the long red bars)? Do you have any intuition for what's happening in those? Memory pressure does seem to be part of the problem -- I rechunked my input to 1/16th of the original chunk size and the whole job has progressed further. Parallel utilization is still pretty disappointing across the whole cluster: CPU utilization across workers after rechunking input to 1/16th of original |
Alright, I can try that. Hey @quasiben, how do you set the number of workers per VM in Cloud Provider? |
Maybe you can choose smaller VMs?
…On Tue, Nov 17, 2020, 1:05 PM Eric Czech ***@***.***> wrote:
Looking at the worker/thread ratio, I wonder if it would make more sense to
have far more workers with fewer threads each. Perhaps try four threads
per worker?
Alright, I can try that. Hey @quasiben <https://github.com/quasiben>, how
do you set the number of workers per VM in Cloud Provider?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-729213000>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDB4HWUPL3D2RSO7C3SQLQQJANCNFSM4TY6UJFA>
.
|
That's a good question. I don't think this is supported but shouldn't be too hard. You can control the number of threads with |
You bet! dask/dask-cloudprovider#173 |
The larger dataset with smaller chunks did ultimately finish with no errors. Here are a couple readouts: Performance report (26M): https://drive.google.com/file/d/1feWLKNrjQkslKDIZ7T39fPCNQQDBrQFs/view?usp=sharing It doesn't seem like any of network, disk, or cpu are even close to being saturated so I assume there is some room for improvement. As a very rough estimate, this task takes about 5 hrs on a single 64 vCPU VM and 3 hours on a cluster of 20 8 vCPU VMs (160 vCPUs). That would imply ~2 hours with perfect scaling so this should be an approximate ceiling for improvements. I will try again with smaller VMs and see if there are any major differences. |
FYI @ravwojdyla and I have been talking a bit about some similar observations on a much simpler workflow in https://github.com/related-sciences/data-team/issues/38 (private). One conclusion there was that the individual objects on GCS are so small that API requests aren't efficient yet bigger chunk sizes would start to make the workloads fail. The GCS objects for a chunk are roughly 2MiB on disk but >100MB in memory. Another potential explanation for this behavior is that the GCS objects aren't being loaded asynchronously. We're still investigating both. |
I don't know if you're aware of the work done by @martindurant on nicer async IO for remote storage and zarr, but he might be good to talk to. It looks like you're still primarily blocked by a few oddly long transfers. They're less prominent now, which is good, but probably worth investigating further. |
(sorry, wrong thread) |
@martindurant I suspect that this is not the issue for which you intended
this comment :)
…On Wed, Nov 18, 2020 at 8:14 AM Martin Durant ***@***.***> wrote:
I had tracebacks off for bisect, but now it luckily failed on the third
run :)
(py37) /Users/mdurant/code/distributed ((dcb46d07...)|BISECTING)> pytest --count=20 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation -x
========================================================================================== test session starts ===========================================================================================
platform darwin -- Python 3.7.8, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /Users/mdurant/code/distributed, configfile: setup.cfg
plugins: repeat-0.8.0, flaky-3.7.0, hypothesis-5.26.0, cov-2.10.1, profiling-1.7.0
collected 20 items
distributed/tests/test_failed_workers.py ..F
================================================================================================ FAILURES ================================================================================================
______________________________________________________________________________ test_broken_worker_during_computation[3-20] _______________________________________________________________________________
def test_func():
result = None
workers = []
with clean(timeout=active_rpc_timeout, **clean_kwargs) as loop:
async def coro():
with dask.config.set(config):
s = False
for i in range(5):
try:
s, ws = await start_cluster(
nthreads,
scheduler,
loop,
security=security,
Worker=Worker,
scheduler_kwargs=scheduler_kwargs,
worker_kwargs=worker_kwargs,
)
except Exception as e:
logger.error(
"Failed to start gen_cluster, retrying",
exc_info=True,
)
await asyncio.sleep(1)
else:
workers[:] = ws
args = [s] + workers
break
if s is False:
raise Exception("Could not start cluster")
if client:
c = await Client(
s.address,
loop=loop,
security=security,
asynchronous=True,
**client_kwargs,
)
args = [c] + args
try:
future = func(*args)
if timeout:
future = asyncio.wait_for(future, timeout)
result = await future
if s.validate:
s.validate_state()
finally:
if client and c.status not in ("closing", "closed"):
await c._close(fast=s.status == Status.closed)
await end_cluster(s, workers)
await asyncio.wait_for(cleanup_global_workers(), 1)
try:
c = await default_client()
except ValueError:
pass
else:
await c._close(fast=True)
def get_unclosed():
return [c for c in Comm._instances if not c.closed()] + [
c
for c in _global_clients.values()
if c.status != "closed"
]
try:
start = time()
while time() < start + 5:
gc.collect()
if not get_unclosed():
break
await asyncio.sleep(0.05)
else:
if allow_unclosed:
print(f"Unclosed Comms: {get_unclosed()}")
else:
raise RuntimeError("Unclosed Comms", get_unclosed())
finally:
Comm._instances.clear()
_global_clients.clear()
return result
result = loop.run_sync(
> coro, timeout=timeout * 2 if timeout else timeout
)
distributed/utils_test.py:954:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../conda/envs/py37/lib/python3.7/site-packages/tornado/ioloop.py:532: in run_sync
return future_cell[0].result()
distributed/utils_test.py:912: in coro
result = await future
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
fut = <Task cancelled coro=<test_broken_worker_during_computation() done, defined at /Users/mdurant/code/distributed/distributed/tests/test_failed_workers.py:315>>, timeout = 60
async def wait_for(fut, timeout, *, loop=None):
"""Wait for the single Future or coroutine to complete, with timeout.
Coroutine will be wrapped in Task.
Returns result of the Future or coroutine. When a timeout occurs,
it cancels the task and raises TimeoutError. To avoid the task
cancellation, wrap it in shield().
If the wait is cancelled, the task is also cancelled.
This function is a coroutine.
"""
if loop is None:
loop = events.get_event_loop()
if timeout is None:
return await fut
if timeout <= 0:
fut = ensure_future(fut, loop=loop)
if fut.done():
return fut.result()
fut.cancel()
raise futures.TimeoutError()
waiter = loop.create_future()
timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
cb = functools.partial(_release_waiter, waiter)
fut = ensure_future(fut, loop=loop)
fut.add_done_callback(cb)
try:
# wait until the future completes or the timeout
try:
await waiter
except futures.CancelledError:
fut.remove_done_callback(cb)
fut.cancel()
raise
if fut.done():
return fut.result()
else:
fut.remove_done_callback(cb)
# We must ensure that the task is not running
# after wait_for() returns.
# See https://bugs.python.org/issue32751
await _cancel_and_wait(fut, loop=loop)
> raise futures.TimeoutError()
E concurrent.futures._base.TimeoutError
../../conda/envs/py37/lib/python3.7/asyncio/tasks.py:449: TimeoutError
------------------------------------------------------------------------------------------ Captured stderr call ------------------------------------------------------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:52676
distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52677
distributed.worker - INFO - Listening to: tcp://127.0.0.1:52677
distributed.worker - INFO - dashboard at: 127.0.0.1:52678
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 17.18 GB
distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-_1zsf66b
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52679
distributed.worker - INFO - Listening to: tcp://127.0.0.1:52679
distributed.worker - INFO - dashboard at: 127.0.0.1:52680
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 17.18 GB
distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/worker-cuz87qov
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52677
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52679
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Receive client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122
distributed.core - INFO - Starting established connection
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:52685'
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52686
distributed.worker - INFO - Listening to: tcp://127.0.0.1:52686
distributed.worker - INFO - dashboard at: 127.0.0.1:52687
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 2.86 GB
distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-322ii8r5
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52686
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Run out-of-band function '_exit'
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52686', name: tcp://127.0.0.1:52686, memory: 10, processing: 80>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52686
distributed.core - INFO - Lost connection to 'tcp://127.0.0.1:52684': in <closed TCP>: Stream is closed
distributed.nanny - INFO - Worker process 71091 exited with status 1
distributed.nanny - WARNING - Restarting worker
distributed.worker - INFO - Start worker at: tcp://127.0.0.1:52714
distributed.worker - INFO - Listening to: tcp://127.0.0.1:52714
distributed.worker - INFO - dashboard at: 127.0.0.1:52716
distributed.worker - INFO - Waiting to connect to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 2
distributed.worker - INFO - Memory: 2.86 GB
distributed.worker - INFO - Local Directory: /Users/mdurant/code/distributed/dask-worker-space/dask-worker-space/worker-nlpop5co
distributed.worker - INFO - -------------------------------------------------
distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0>
distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:52714
distributed.core - INFO - Starting established connection
distributed.worker - INFO - Registered to: tcp://127.0.0.1:52676
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep
self.rpc, deps, worker, who=self.address
File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation
operation=operation,
File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data
comm = await rpc.connect(worker)
File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect
**self.connection_args,
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b03d250>: ConnectionRefusedError: [Errno 61] Connection refused
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:52686
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/worker.py", line 1993, in gather_dep
self.rpc, deps, worker, who=self.address
File "/Users/mdurant/code/distributed/distributed/worker.py", line 3196, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 390, in retry_operation
operation=operation,
File "/Users/mdurant/code/distributed/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/Users/mdurant/code/distributed/distributed/worker.py", line 3173, in _get_data
comm = await rpc.connect(worker)
File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect
**self.connection_args,
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x17b36e210>: ConnectionRefusedError: [Errno 61] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 322, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/mdurant/code/distributed/distributed/core.py", line 528, in handle_comm
result = await result
File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3040, in broadcast
[send_message(address) for address in addresses if address is not None]
File "/Users/mdurant/code/distributed/distributed/utils.py", line 230, in All
result = await tasks.next()
File "/Users/mdurant/code/distributed/distributed/scheduler.py", line 3031, in send_message
comm = await self.rpc.connect(addr)
File "/Users/mdurant/code/distributed/distributed/core.py", line 1035, in connect
**self.connection_args,
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 334, in connect
_raise(error)
File "/Users/mdurant/code/distributed/distributed/comm/core.py", line 275, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: Timed out trying to connect to 'tcp://127.0.0.1:52686' after 5 s: in <distributed.comm.tcp.TCPConnector object at 0x111a7e550>: ConnectionRefusedError: [Errno 61] Connection refused
distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122
distributed.scheduler - INFO - Remove client Client-b80cc33a-29b8-11eb-9578-acde48001122
distributed.scheduler - INFO - Close client connection: Client-b80cc33a-29b8-11eb-9578-acde48001122
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52677
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52679
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52677', name: 0, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52677
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52679', name: 1, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52679
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:52714', name: tcp://127.0.0.1:52714, memory: 0, processing: 0>
distributed.core - INFO - Removing comms to tcp://127.0.0.1:52714
distributed.scheduler - INFO - Lost all workers
distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:52714
distributed.nanny - INFO - Worker closed
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:52685'
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-729785585>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTD2GWBR2GGYIOVSBRTSQPXF3ANCNFSM4TY6UJFA>
.
|
Indeed - but I did want to comment here too. Yes, gcsfs allows fetching of many objects from the store with a single call, and they will be processed concurrently, returning back the set of bytes objects. In the case that these are to be expanded in memory, you would of course have to deal with the blocks of bytes one at a time, to not exceed RAM. For zarr, this does mean transiently higher memory usage during decompression, depending on the number of storage chunks per dask partition - the best tradeoff would be workload-dependent. |
In trying this on a cluster of 40 4 vCPU machines (instead of 20 8 vCPUs), the workflow failed with some similar errors to my first attempt on this issue. It doesn't look like memory pressure was an issue this time but I'm not certain. I didn't see log messages about it anyhow, and this was with chunks 1/16th the size of the original, or ~6MB in memory. The client-side error I hit was again:
Here are some worker logs: Worker 1distributed.worker - INFO - Start worker at: tcp://10.142.15.198:37407distributed.worker - INFO - Listening to: tcp://10.142.15.198:37407 distributed.worker - INFO - dashboard at: 10.142.15.198:38769 distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 4 distributed.worker - INFO - Memory: 27.34 GB distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-jst4aey2 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - ERROR - failed during get data with tcp://10.142.15.198:37407 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 0, 13) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 0, 13) 0 . Asking scheduler Worker 2distributed.worker - INFO - Start worker at: tcp://10.142.15.201:43081distributed.worker - INFO - Listening to: tcp://10.142.15.201:43081 distributed.worker - INFO - dashboard at: 10.142.15.201:43701 distributed.worker - INFO - Waiting to connect to: tcp://10.142.0.13:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 4 distributed.worker - INFO - Memory: 27.34 GB distributed.worker - INFO - Local Directory: /dask-worker-space/dask-worker-space/worker-9plktoms distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Registered to: tcp://10.142.0.13:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.207:45671 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1282, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer distributed.worker - ERROR - Worker stream died during communication: tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 882, in _read_to_buffer bytes_read = self.read_from_fd(buf) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1158, in read_from_fd return self.socket.recv_into(buf, len(buf)) ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1979, in gather_dep response = await get_data_from_worker( File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3255, in get_data_from_worker return await retry_operation(_get_data, operation="get_data_from_worker") File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 3235, in _get_data response = await send_recv( File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 666, in send_recv response = await comm.read(deserializers=deserializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 201, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 33, 40) distributed.worker - ERROR - failed during get data with tcp://10.142.15.201:43081 -> tcp://10.142.15.192:44791 Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 988, in _handle_write num_bytes = self.write_to_fd(self._write_buffer.peek(size)) File "/opt/conda/lib/python3.8/site-packages/tornado/iostream.py", line 1169, in write_to_fd return self.socket.send(data) # type: ignore ConnectionResetError: [Errno 104] Connection reset by peer The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1281, in get_data compressed = await comm.write(msg, serializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 256, in write convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 121, in convert_stream_closed_error raise CommClosedError( distributed.comm.core.CommClosedError: in : ConnectionResetError: [Errno 104] Connection reset by peer distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 32, 40) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 34, 40) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 32, 40) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 33, 40) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 35, 40) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 34, 40) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 35, 40) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 137, 22) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 136, 22) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 136, 22) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 172, 22) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 138, 22) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 173, 22) distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 173, 22) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 175, 22) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 148, 94) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 149, 94) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 225, 32) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 174, 22) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 225, 32) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 174, 22) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 150, 94) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 244, 94) distributed.worker - INFO - Can't find dependencies for key ('sub-898e395113aba3815870801c86a3e2c0', 151, 94) distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 151, 94) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 244, 94) 0 . Asking scheduler distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler distributed.worker - INFO - Stopping worker at tcp://10.142.15.201:43081 distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 150, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 148, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 137, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 172, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 138, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 149, 94) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler distributed.worker - ERROR - Handle missing dep failed, retrying Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 322, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 2119, in handle_missing_dep who_has = await retry_operation(self.scheduler.who_has, keys=list(deps)) File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 385, in retry_operation return await retry( File "/opt/conda/lib/python3.8/site-packages/distributed/utils_comm.py", line 370, in retry return await coro() File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 880, in send_recv_from_rpc comm = await self.pool.connect(self.addr) File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 1031, in connect comm = await connect( File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 334, in connect _raise(error) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/core.py", line 275, in _raise raise IOError(msg) OSError: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: Timed out trying to connect to 'tcp://10.142.0.13:8786' after 10 s: connect() didn't finish in time distributed.worker - INFO - Dependent not found: ('rechunk-merge-368a2d33a31a45529317897615ed51b0', 175, 22) 0 . Asking scheduler Logs for most of the workers looked like "Worker 1" above and I didn't notice anything particularly noteworthy in perusing a bunch of them. What is odd about this run is that I still had 40 nodes in the cluster at the end. It appears that one of them become temporarily unavailable or was otherwise unreachable long enough to crash the job. @mrocklin what should Dask do in a scenario where one worker is unreachable? Does it try to reschedule the work elsewhere or fail the whole job? |
Thanks @martindurant. Does dask need to do anything in particular to use that (presumably what was in zarr-developers/zarr-python#536)? @ravwojdyla mentioned that you need at least zarr 2.5.0, but we weren't sure if there was also more that needs to be done in dask or xarray to integrate it. |
Zarr 2.5 is enough - but you need more than one zarr block per dask task, else you see no benefit. |
I see, thanks @martindurant. Well I'm down to <10MB chunks being necessary to make this workflow run without OOM errors so this seems like an important point of contention we're likely to run into again. Even if I wrote the zarr chunks to be small enough such that multiple of them would fit in one dask chunk, I can't imagine that parallel reading of <100k chunks (~1M in memory) would provide much of a benefit. I'll see if the workload will tolerate large but uneven chunks (i.e. tall-skinny, short-fat). There are a lot of multiplications in it and given https://github.com/pystatgen/sgkit/issues/375, it stands to reason that we should have to rethink chunking in every workflow as a function of the number of columns involved. |
Eric, if you're able to reproduce this same problem but with a bit less of
the machinery here that would also make it easier for some of the other
performance experts to try things out on their own and weigh in. I'm not
sure if that's easy for you, but if you can recreate this with a random
dataset and some tensordot calls that would be ideal.
…On Wed, Nov 18, 2020 at 10:36 AM Eric Czech ***@***.***> wrote:
I see, thanks @martindurant <https://github.com/martindurant>. Well I'm
down to <10MB chunks being necessary to make this workflow run without OOM
errors so this seems like an important point of contention we're likely to
run into again. Even if I wrote the zarr chunks to be small enough such
that multiple of them would fit in one dask chunk, I can't imagine that
parallel reading of <100k chunks (~1M in memory) would provide much of a
benefit.
I'll see if the workload will tolerate large but uneven chunks (i.e.
tall-skinny, short-fat). There are a lot of multiplications in it and given
#375 <https://github.com/pystatgen/sgkit/issues/375>, it stands to reason
that we should have to rethink chunking in every workflow as a function of
the number of columns involved.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-729875174>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTEVUBEEAAIBFFGTRY3SQQH2RANCNFSM4TY6UJFA>
.
|
We could even turn it into a performance case study and try to get a few
folks interested in the problem
…On Wed, Nov 18, 2020 at 10:54 AM Matthew Rocklin ***@***.***> wrote:
Eric, if you're able to reproduce this same problem but with a bit less of
the machinery here that would also make it easier for some of the other
performance experts to try things out on their own and weigh in. I'm not
sure if that's easy for you, but if you can recreate this with a random
dataset and some tensordot calls that would be ideal.
On Wed, Nov 18, 2020 at 10:36 AM Eric Czech ***@***.***>
wrote:
> I see, thanks @martindurant <https://github.com/martindurant>. Well I'm
> down to <10MB chunks being necessary to make this workflow run without OOM
> errors so this seems like an important point of contention we're likely to
> run into again. Even if I wrote the zarr chunks to be small enough such
> that multiple of them would fit in one dask chunk, I can't imagine that
> parallel reading of <100k chunks (~1M in memory) would provide much of a
> benefit.
>
> I'll see if the workload will tolerate large but uneven chunks (i.e.
> tall-skinny, short-fat). There are a lot of multiplications in it and given
> #375 <https://github.com/pystatgen/sgkit/issues/375>, it stands to
> reason that we should have to rethink chunking in every workflow as a
> function of the number of columns involved.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/pystatgen/sgkit/issues/390#issuecomment-729875174>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACKZTEVUBEEAAIBFFGTRY3SQQH2RANCNFSM4TY6UJFA>
> .
>
|
Hey @mrocklin, here is a notebook that isolates the dask code being used here: https://gist.github.com/eric-czech/daae30d54a5c96fd09f13ffa58a3bafe. I'm fairly certain the problem is https://stackoverflow.com/questions/64774771/does-blockwise-allow-iteration-over-out-of-core-arrays, or rather the lack of scalability of matrix multiplication in dask. I was able to get this workflow to run on the 40 node cluster by reducing the chunk size in the variants dimension to something far smaller, since variant_chunk_size x n_samples arrays are being loaded into memory by blockwise. At my original chunking (5216 variants, 5792 samples), these arrays should have been about 5216 variants * 365941 samples * 4 bytes = 7.5 GB which is bigger than the 6.5 GB of RAM available per vCPU on n1-highmem-* instances. |
Also, I don't think using a different worker/core ratio changed much. The job was a good bit slower on 40 nodes instead of 20 (3hr 20m vs 2h 50m) but that may be more attributable to the different chunking needed. Either way, a few GCP monitoring readouts looked like this for the cluster (about the same as before): |
Eric, should I wait on publishing the problem as a dask issue until after
you've had a chance to digest dask array reductions ? Or was the
performance here a different issue?
…On Thu, Nov 19, 2020 at 7:57 AM Eric Czech ***@***.***> wrote:
Also, I don't think using a different worker/core ratio changed much. The
job was a good bit slower on 40 nodes instead of 20 (3hr 20m vs 2h 50m) but
that may be more attributable to the different chunking needed. Either way,
a few GCP monitoring readouts looked like this for the cluster (about the
same as before):
[image: Screen Shot 2020-11-18 at 5 55 54 PM]
<https://user-images.githubusercontent.com/6130352/99690145-cf186000-2a55-11eb-911a-25fb00795194.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-730469023>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTC76UYRYV3CEVTQOI3SQU55VANCNFSM4TY6UJFA>
.
|
I think it's ok to publish it now. This workflow only uses lstsq and matrix multiplication, not any compiled functions or blockwise like the pairwise functions we were talking about in the context of array reductions on the call. |
Wow, rechunking to very short-fat chunks and re-running on a cluster of 40 n1-highmem-8 instances resulted in the whole workflow finishing in ~10 minutes (as opposed to 2 hours on a 20 node cluster) and I saw utilization like this: Performance report: https://drive.google.com/file/d/1vLZEwY0xea6Jc3VT_mS9HiZrv6NXeF2X/view?usp=sharing This seems to summarize the differences fairly well: Based on only the performance reports though, I'm not sure how we could have known there was so much room for improvement. Nothing else jumps out to me as being predictive of that yet. |
As a negative control and because I was a little incredulous about this latest improvement, I reran this once again with small square chunks and saw memory usage swell followed by a severe slowdown in task processing rates. There really is something magical about the short-fat chunking. To summarize what I've tried (w.r.t chunking) and the results so far:
|
For the perf numbers in https://github.com/pystatgen/sgkit/issues/390#issuecomment-758708221, I was inadvertently running the client in a different GCP region to the cluster. When I re-ran with everything in the same region and zone, the white space almost completely disappears from the performance report (so the pauses were presumably due to the client-scheduler communication). 1000GB diskPerformance report (For this run I also switched to using Dask 2.30 from 2020.12.0 (and similarly for distributed), since the latter seems to be less stable for this workload as Eric mentioned earlier in this issue. So it's possible that the change is down to that difference. Still, we have a configuration where task utilization seems to be good.) |
@tomwhite nice. I can see that in the most recent report there isn't much spilling happening anymore. Do we still need the 1TB disk (assuming network shuffle and no spilling)? |
No, as it turns out! I ran it again with the smaller disk (with everything in the same zone) and it ran fine with full task utilization. Sorry for the wild goose chase. 50GB disk Is it fair to say that the larger workload that @eric-czech ran was spilling to disk and was suffering from slow disk performance? Although perhaps the goal is to tune it to avoid spilling (as far as possible) using the techniques outlined by @ravwojdyla in https://github.com/pystatgen/sgkit/issues/437#issuecomment-762395287. If spilling is not avoidable, however, then having more performant disks should help though. |
@tomwhite nice, good to see that.
That isn't entirely true, AFAIU looking at the performance reports above (provided by @eric-czech):
Some other points:
As you know, spilling is a very expensive operation (IO+serde roundtrip), and I would argue for important pipeline (like this one), we should optimise the data + pipeline + chunking scheme + cluster spec to avoid spilling. @tomwhite what do you think? And also please let me know if I can help in any way. |
That's right, I was only increasing or decreasing the number of workers for Dask CP. I never explicitly set the number of threads per worker.
Unfortunately no, I disabled the report generation since so many individual jobs were being run. In retrospect though, I wish I had saved them all so we could at least look at the last two. |
Thanks for the confirmation @eric-czech. https://github.com/pystatgen/sgkit/issues/437#issuecomment-762395287 shows how to adjust that for vanilla Dask distributed Client (which is trivial). In the dask-cloudprovider realm it's a bit more complicated, dask/dask-cloudprovider#173 is about support for multiple workers per VM, but that's actually not the case we need (since most computation is numpy based, it should be fine (and even more optimal) to have a single worker per VM, unless we believe GIL is a problem), what we need is a way to adjust how many worker-threads a single worker has, here's how you can do that for dask-cloudprovider's GCP cluster: GCPCluster(projectid="foobar",
n_workers=1,
# here you could also disable spilling etc if you are in the "tuning mode"
worker_options={"nthreads": 2, "memory_limit": 1.0},
machine_type="n1-standard-4") and now instead of 1 worker with 4 threads (avg 3.75GB per thread), you get 1 worker with avg 7.5GB per thread. Here I would also reiterate that https://github.com/pystatgen/sgkit/issues/437#issuecomment-758849588 has other options to adjust "memory limits": worker resources and more recently layer annotations dask/dask#6701 (but I haven't tried them with |
Some summary and takeaways from the meeting today:
To be more concrete, we saw benefit of going from 5216 -> 652 in the variant axis, specifically no spilling (less memory overhead), and increasing the size of samples should reduce the communication/transfer time. So overall to be more precise, I believe we should try to:
Looking at the https://github.com/pystatgen/sgkit/issues/390#issuecomment-748205731, I annotate the moment we switch from chr. 21 to chr. 11 and observe slow down:
|
Thanks @ravwojdyla - that's very interesting and useful. I have now reproduced the slowdown using @eric-czech's simulated data from #438. Running on a cluster of 8
Notice that going from 4x to 8x produces a disproportionate increase in time taken. And this correlates with spilling, as shown on the GCP monitoring snapshot (spilling starts at 12:32). It also matches the whitespace in the taskstream. This is using @ravwojdyla's new matmul implementation. The code I used to run the benchmark is here: https://github.com/tomwhite/gwas-benchmark I think the next thing to try is different chunk sizes as suggested in Rafal's last comment. |
I had a look at the shapes and chunk sizes of the intermediate variables in the One thing that stuck out was the simple fact that the total memory needed for the computation (for 8x data) exceeds the cluster memory (see the bottom of the notebook) - but it doesn't for 4x data. So this explains why disk spilling is inevitable, and why we see a big slowdown from 4x to 8x. When I doubled the cluster size to 16 nodes, there was barely any spilling, and the 8x computation took ~200s as opposed to ~700s (on the 8 node cluster). A good improvement, but still lots of whitespace in the task stream. (See the performance report) Next, I tried matching the output of the first, "outer" This helped a lot - the time is now down to 110s, so only 2.75x slower than the 4x data case from before. The task graph has very little whitespace now, which is a significant improvement. (See the performance report) (Side note: I get a Dask warning saying 'PerformanceWarning: Increasing number of chunks by factor of 64' which actually is a good thing in this case!) The notebook I used for the computation is here: https://nbviewer.jupyter.org/github/tomwhite/gwas-benchmark/blob/1a73ff865d100da7ea51c73456aa1a8526ff29b7/gwas_simulation.ipynb Summary:
|
I ran the benchmark on 16x data (135,104 variants) using the This data is comparable in size to chr21 (141,910 variants), and according to related-sciences/ukb-gwas-pipeline-nealelab#32, it took 150s on a 60 node n1-highmem-16 cluster.
So this is a ~4x speedup. |
What is the cost improvement? What would it cost if we used preemptible instances? I know Eric had problems with our workloads running reliably on preemptible instances, so I'm curious if that's even an option for us. |
Cost improvement is probably a bit better than 4x since I wasn't using highmem instances (more like 5x maybe) - but these are all estimates, so there's likely to be some variation. (Also we don't know if the scaling is linear all the way up to chr1 size data.) I'm not sure about how well preemptible instances work with Dask in general, or for our workload. There is still potentially a lot of scope for reducing transfer times (e.g. for the second perf report in https://github.com/pystatgen/sgkit/issues/390#issuecomment-768332568, the compute time is ~5000s and the transfer time is ~3000s), by improving the chunking further. My change looks like it fixed one of the more egregious cases, but there are likely other things we can do along the lines that @ravwojdyla has suggested. This would involve going through the chunk report in more detail. |
To summarize the state of this issue, I think the main ways to improve performance have been identified (and fixed in a couple of cases):
|
Thanks for the update Tom. This has been a very effective case study to
motivate general performance enhancements. Should you ever find some free
time (ha!) I encourage you to write things up . I think that there are
some folks out there that would find it interesting.
…On Tue, Feb 9, 2021 at 4:10 AM Tom White ***@***.***> wrote:
To summarize the state of this issue, I think the main ways to improve
performance have been identified (and fixed in a couple of cases):
1. Improve scalability of Dask matmul (fixed in dask/dask#7000
<dask/dask#7000>)
2. Rechunking the covariates array (fixed in #454
<https://github.com/pystatgen/sgkit/pull/454>), ~4x speedup
3. Persisting the input dataset in cluster memory (#449
<https://github.com/pystatgen/sgkit/issues/449>), ~1.5x speedup
4. Using preemptible instances (#453
<https://github.com/pystatgen/sgkit/issues/453>), up to ~5x cost saving
5. Using standard linear regression, not a mixed model (#448
<https://github.com/pystatgen/sgkit/issues/448>), ~2x speedup
6. Further rechunking improvements (#461
<https://github.com/pystatgen/sgkit/issues/461>), unknown speedup
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/pystatgen/sgkit/issues/390#issuecomment-775822748>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCU5DAM3HRIJLQIW73S6EC2BANCNFSM4TY6UJFA>
.
|
Another summary of the latest state. I think these three changes should solve the problem for the time being:
The next step would be to run on a suitable subset of the original UKB data to see if they work well enough in practice. (Note that we also have |
@tomwhite thanks for the summary and hard work! If I may, I would also like to document some potential experiments to further improve the performance (when we get back to this issue):
|
It appears that this function does not scale well when run on a cluster.
Notes from my most recent attempt:
CPU utilization across worker VMs
Status Page
Full Task List
The job ultimately failed with the error "ValueError: Could not find dependent ('transpose-e1c6cc7244771a105b73686cc88c4e43', 42, 21). Check worker logs".
Several of the workers show log messages like this:
Perhaps this is what happens when one node unexpectedly becomes unreachable? I'm not sure.
I will run this again on a smaller dataset that didn't fail to get a performance report and task graph screenshot (which doesn't work on this data because the UI won't render so many nodes).
The text was updated successfully, but these errors were encountered: