Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray groupby monthly mean fail case #99

Closed
rabernat opened this issue Feb 6, 2018 · 25 comments
Closed

xarray groupby monthly mean fail case #99

rabernat opened this issue Feb 6, 2018 · 25 comments

Comments

@rabernat
Copy link
Member

rabernat commented Feb 6, 2018

I made a notebook for pangeo.pydata.org that reproduces part of the dask / xarray groupby fail case discussed in pydata/xarray#1832.

https://gist.github.com/rabernat/7fe92f2a41dbfe651493d6864e46031a

Would be great if anyone (e.g. @mrocklin) wants to have a look at this and try to debug...

@rabernat
Copy link
Member Author

rabernat commented Feb 6, 2018

Here is the error that shows up in the notebook

tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-vgos-getitem-29a065de9258b983a53e9570b918b0c6', 65, 0, 0)" processing>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-sla-getitem-453ed0cf1aab33146a7d8b830a0f63af', 70, 0, 0)" memory>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-err-getitem-b4ef20c160bc0e632d685a6c35581650', 82, 0, 0)" released>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-err-getitem-56fe902ed1eff1d89793bc1e03b65b9b', 9, 0, 0)" released>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-sla-getitem-453ed0cf1aab33146a7d8b830a0f63af', 144, 0, 0)" memory>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-sla-getitem-ef09b564f2602108276c7193a05bef9d', 134, 0, 0)" processing>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-err-getitem-ac0f88df71851092cb63dc607d305fe7', 133, 0, 0)" released>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-vgos-getitem-29a065de9258b983a53e9570b918b0c6', 162, 0, 0)" released>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-vgos-getitem-bde6e2bbcfa88594415ca26d6bb0e3da', 96, 0, 0)" released>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-sla-getitem-80db18454a7927074b4b6a9dd7053b8a', 113, 0, 0)" memory>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-vgosa-getitem-6be14088263b14471b4e8ffb060a8929', 10, 0, 0)" erred>
tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 349, in balance
    duration, cost_multiplier)
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 274, in perhaps_move_task
    self.move_task_request(ts, sat, idl)
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 176, in move_task_request
    self.scheduler.worker_comms[victim.address].send(
KeyError: 'tcp://10.20.11.26:33793'

@jhamman
Copy link
Member

jhamman commented Feb 6, 2018

@rabernat et al. I just wanted to add that I've also been seeing this error, plus the one you mentioned in dask/distributed#1736, on Cheyenne as well.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

@jhamman do you run into this in the context of workers dying?

@jhamman
Copy link
Member

jhamman commented Feb 7, 2018

Yes, that is generally what I was assuming was a byproduct of these two sets of errors I was getting. I should say, I've been spinning up a new workflow over the past few days and I hadn't gotten to the point of determining the source of my lack of success.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

Glad to hear you're trying new things. My apologies for the frustration. I think I've identified the source of the KeyErrors above. I would hope that they wouldn't halt execution though. We did see some odd behavior on the pangeo.pydata.org setup though.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

Actually no, I haven't identified the source. If anyone can check the scheduler for the transition story of such a key I would be grateful.

cluster.scheduler.story(key_name)
# like 
cluster.scheduler.story("('zarr-vgosa-getitem-6be14088263b14471b4e8ffb060a8929', 10, 0, 0)")

@jhamman
Copy link
Member

jhamman commented Feb 7, 2018

Okay, so my workers do die but I'm not convinced for the exact same reasons as @rabernat's.

My notebook gets a traceback like this one:

distributed.utils - ERROR - ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122', 'tcp://10.148.9.198:42458')
Traceback (most recent call last):
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/utils.py", line 236, in f
    result[0] = yield make_coro()
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/client.py", line 1343, in _gather
    traceback)
  File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
distributed.scheduler.KilledWorker: ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122', 'tcp://10.148.9.198:42458')

with a reraise of:

---------------------------------------------------------------------------
KilledWorker                              Traceback (most recent call last)
<ipython-input-12-6b7b12cb4545> in <module>()
      1 test = plot_data[stat]['RCP 8.5']['bcsd']
      2 
----> 3 test.plot(robust=True, col='gcm')

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/plot/plot.py in __call__(self, **kwargs)
    316 
    317     def __call__(self, **kwargs):
--> 318         return plot(self._da, **kwargs)
    319 
    320     @functools.wraps(hist)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/plot/plot.py in plot(darray, row, col, col_wrap, ax, rtol, subplot_kws, **kwargs)
    148     kwargs['ax'] = ax
    149 
--> 150     return plotfunc(darray, **kwargs)
    151 
    152 

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/plot/plot.py in newplotfunc(darray, x, y, figsize, size, aspect, ax, row, col, col_wrap, xincrease, yincrease, add_colorbar, add_labels, vmin, vmax, cmap, center, robust, extend, levels, infer_intervals, colors, subplot_kws, cbar_ax, cbar_kwargs, **kwargs)
    498             allargs['plotfunc'] = globals()[plotfunc.__name__]
    499 
--> 500             return _easy_facetgrid(**allargs)
    501 
    502         plt = import_matplotlib_pyplot()

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/plot/plot.py in _easy_facetgrid(darray, plotfunc, x, y, row, col, col_wrap, sharex, sharey, aspect, size, subplot_kws, **kwargs)
     78                   sharex=sharex, sharey=sharey, figsize=figsize,
     79                   aspect=aspect, size=size, subplot_kws=subplot_kws)
---> 80     return g.map_dataarray(plotfunc, x, y, **kwargs)
     81 
     82 

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/plot/facetgrid.py in map_dataarray(self, func, x, y, **kwargs)
    223         """
    224         # These should be consistent with xarray.plot._plot2d
--> 225         cmap_kwargs = {'plot_data': self.data.values,
    226                        # MPL default
    227                        'levels': 7 if 'contour' in func.__name__ else None,

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/core/dataarray.py in values(self)
    409     def values(self):
    410         """The array's data as a numpy.ndarray"""
--> 411         return self.variable.values
    412 
    413     @values.setter

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/core/variable.py in values(self)
    390     def values(self):
    391         """The variable's data as a numpy.ndarray"""
--> 392         return _as_array_or_item(self._data)
    393 
    394     @values.setter

~/anaconda/envs/pangeo/lib/python3.6/site-packages/xarray/core/variable.py in _as_array_or_item(data)
    214     TODO: remove this (replace with np.asarray) once these issues are fixed
    215     """
--> 216     data = np.asarray(data)
    217     if data.ndim == 0:
    218         if data.dtype.kind == 'M':

~/anaconda/envs/pangeo/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    529 
    530     """
--> 531     return array(a, dtype, copy=False, order=order)
    532 
    533 

~/anaconda/envs/pangeo/lib/python3.6/site-packages/dask/array/core.py in __array__(self, dtype, **kwargs)
   1186 
   1187     def __array__(self, dtype=None, **kwargs):
-> 1188         x = self.compute()
   1189         if dtype and x.dtype != dtype:
   1190             x = x.astype(dtype)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/dask/base.py in compute(self, **kwargs)
    133         dask.base.compute
    134         """
--> 135         (result,) = compute(self, traverse=False, **kwargs)
    136         return result
    137 

~/anaconda/envs/pangeo/lib/python3.6/site-packages/dask/base.py in compute(*args, **kwargs)
    331     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    332                     else (None, a) for a in args]
--> 333     results = get(dsk, keys, **kwargs)
    334     results_iter = iter(results)
    335     return tuple(a if f is None else f(next(results_iter), *a)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/client.py in get(self, dsk, keys, restrictions, loose_restrictions, resources, sync, asynchronous, **kwargs)
   2025                 secede()
   2026             try:
-> 2027                 results = self.gather(packed, asynchronous=asynchronous)
   2028             finally:
   2029                 for f in futures.values():

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/client.py in gather(self, futures, errors, maxsize, direct, asynchronous)
   1463             return self.sync(self._gather, futures, errors=errors,
   1464                              direct=direct, local_worker=local_worker,
-> 1465                              asynchronous=asynchronous)
   1466 
   1467     @gen.coroutine

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/client.py in sync(self, func, *args, **kwargs)
    601             return future
    602         else:
--> 603             return sync(self.loop, func, *args, **kwargs)
    604 
    605     def __repr__(self):

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    250             e.wait(1000000)
    251     if error[0]:
--> 252         six.reraise(*error[0])
    253     else:
    254         return result[0]

~/anaconda/envs/pangeo/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/utils.py in f()
    234             yield gen.moment
    235             thread_state.asynchronous = True
--> 236             result[0] = yield make_coro()
    237         except Exception as exc:
    238             logger.exception(exc)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

~/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

~/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

~/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/client.py in _gather(self, futures, errors, direct, local_worker)
   1341                             six.reraise(type(exception),
   1342                                         exception,
-> 1343                                         traceback)
   1344                     if errors == 'skip':
   1345                         bad_keys.add(key)

~/anaconda/envs/pangeo/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

KilledWorker: ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122', 'tcp://10.148.9.198:42458')

Looking at a worker log, I see:

distributed.worker - INFO - Start worker at: tcp://10.148.9.198:39212

distributed.worker - INFO - Listening to: tcp://10.148.9.198:39212

distributed.worker - INFO - nanny at: 10.148.9.198:40642

distributed.worker - INFO - bokeh at: 10.148.9.198:42492

distributed.worker - INFO - Waiting to connect to: tcp://10.148.7.111:60879

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Threads: 4

distributed.worker - INFO - Memory: 6.00 GB

distributed.worker - INFO - Local Directory: /glade/p/work/jhamman/loca/notebooks/dask-worker-space/worker-d9vnmevh

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Registered to: tcp://10.148.7.111:60879

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 221, in write yield future File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 533, in get_data compressed = yield comm.write(msg) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 225, in write convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - failed during get data Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 221, in write yield future File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 533, in get_data compressed = yield comm.write(msg) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 225, in write convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.148.9.196:46445 Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 174, in read n_frames = yield stream.read_bytes(8) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 1754, in gather_dep who=self.address) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 518, in send_recv_from_rpc result = yield send_recv(comm=comm, op=key, **kwargs) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 352, in send_recv response = yield comm.read() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 188, in read convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Connection to scheduler broken. Reregistering

distributed.worker - INFO - -------------------------------------------------

distributed.worker - INFO - Comm closed

distributed.worker - INFO - Comm closed

distributed.worker - INFO - Registered to: tcp://10.148.7.111:60879

distributed.worker - INFO - -------------------------------------------------

distributed.worker - ERROR - failed during get data Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 221, in write yield future File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 533, in get_data compressed = yield comm.write(msg) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 225, in write convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - ERROR - Worker stream died during communication: tcp://10.148.9.196:51572 Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 174, in read n_frames = yield stream.read_bytes(8) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 1754, in gather_dep who=self.address) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 518, in send_recv_from_rpc result = yield send_recv(comm=comm, op=key, **kwargs) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 352, in send_recv response = yield comm.read() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 188, in read convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 124, in convert_stream_closed_error raise CommClosedError("in %s: %s: %s" % (obj, exc.__class__.__name__, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: ConnectionResetError: [Errno 104] Connection reset by peer

distributed.worker - INFO - Can't find dependencies for key ('getitem-mean_agg-aggregate-mean_chunk-stack-transpose-nanmin-82cc2d7ffbd2a676cbd37ef152971123', 1, 4, 0, 0)

distributed.worker - ERROR - Worker stream died during communication: tcp://10.148.9.196:40104 Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 182, in read frame = yield stream.read_bytes(length) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info tornado.iostream.StreamClosedError: Stream is closed During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/worker.py", line 1754, in gather_dep who=self.address) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 518, in send_recv_from_rpc result = yield send_recv(comm=comm, op=key, **kwargs) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/core.py", line 352, in send_recv response = yield comm.read() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run value = future.result() File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result raise_exc_info(self._exc_info) File "<string>", line 4, in raise_exc_info File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run yielded = self.gen.throw(*exc_info) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 188, in read convert_stream_closed_error(self, e) File "/glade/u/home/jhamman/anaconda/envs/pangeo/lib/python3.6/site-packages/distributed/comm/tcp.py", line 126, in convert_stream_closed_error raise CommClosedError("in %s: %s" % (obj, exc)) distributed.comm.core.CommClosedError: in <closed TCP>: Stream is closed

distributed.worker - INFO - Can't find dependencies for key ('getitem-mean_agg-aggregate-mean_chunk-stack-transpose-nanmin-b462b34ffe29b17359b6e287ce5173c1', 1, 5, 0, 0)

distributed.worker - INFO - Dependent not found: ('getitem-a00d7597b60afe6cb9c6c04adfd2e13a', 1, 19, 0, 0) 0 . Asking scheduler

distributed.worker - INFO - Dependent not found: ('getitem-82287b2ccc50e670f2dd9ce85333933a', 1, 16, 0, 0) 0 . Asking scheduler

Finally,

cluster.scheduler.story(('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122'))
[('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'processing',
  'released',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'waiting')]),
  1517975493.3357968),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'released',
  'waiting',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'processing')]),
  1517975493.3358078),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'waiting',
  'processing',
  {},
  1517975493.3358288),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'processing',
  'released',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'waiting')]),
  1517975494.0983822),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'released',
  'waiting',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'processing')]),
  1517975494.0983922),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'waiting',
  'processing',
  {},
  1517975494.098415),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'processing',
  'released',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'waiting')]),
  1517975633.8895779),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'released',
  'waiting',
  OrderedDict([('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
                'processing')]),
  1517975633.8895876),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'waiting',
  'processing',
  {},
  1517975633.8896098),
 ('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122',
  'processing',
  'erred',
  {"('open_dataset-concatenate-fc4e88c5e6d82a68d04406b24582ceca', 25, 0, 0)": 'erred'},
  1517975633.943889)]

Sorry for the length of this post but I wanted to be clear about what I was seeing.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

I agree that this seems to be different. Your workers seem to be dying. The most common reason we've seen for this recently is over-use of memory. Is this happening in your situation?

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

Also, long posts are fine in my book. If you really wanted to make things clean you could optionally use the <details> tag like so:

<details>

```
some long traceback
```

</details>

Rendered below

some long traceback

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

I'm just reporting some things here on that notebook as I see them:

This line oddly causes some computation. Two zarr tasks are fired off as a result of running this cell

ds_mm = ds.groupby('time.month').mean(dim='time')
ds_mm

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

It currently takes 35 seconds to open this zarr dataset:

%%prun
import gcsfs
#gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3')
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt')
ds = xr.open_zarr(gcsmap)
ds
         21265779 function calls (21260090 primitive calls) in 35.502 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    12059   20.756    0.002   20.756    0.002 {built-in method _openssl.SSL_read}
      100    7.389    0.074   30.072    0.301 core.py:435(ls)
 19540431    3.778    0.000    3.778    0.000 {method 'startswith' of 'str' objects}
      153    0.664    0.004    0.676    0.004 decoder.py:345(raw_decode)
      185    0.372    0.002    0.372    0.002 {method 'read' of '_io.FileIO' objects}
        9    0.186    0.021    0.220    0.024 core.py:131(<listcomp>)
      100    0.111    0.001   18.760    0.188 core.py:374(_list_bucket)
     1513    0.076    0.000    0.076    0.000 {built-in method posix.stat}
    12059    0.068    0.000   20.908    0.002 SSL.py:1598(recv_into)
      5/4    0.064    0.013    0.068    0.017 {built-in method _imp.create_dynamic}
        1    0.061    0.061    0.081    0.081 core.py:471(<listcomp>)
     8947    0.058    0.000    1.003    0.000 response.py:346(read)
    53836    0.053    0.000    0.144    0.000 os.py:664(__getitem__)
     1507    0.050    0.000    1.080    0.001 {method 'join' of 'bytes' objects}
    56741    0.049    0.000    0.248    0.000 _collections_abc.py:742(__iter__)
    12217    0.045    0.000    0.072    0.000 SSL.py:1417(_raise_ssl_error)
      101    0.040    0.000    0.040    0.000 models.py:830(text)
        1    0.036    0.036    0.036    0.036 {built-in method _openssl.SSL_CTX_load_verify_locations}
   105913    0.035    0.000    0.035    0.000 {method 'decode' of 'bytes' objects}
        4    0.033    0.008    0.033    0.008 {built-in method _socket.getaddrinfo}
   105108    0.032    0.000    0.066    0.000 os.py:746(decode)
      157    0.032    0.000    0.032    0.000 {built-in method _openssl.SSL_write}
    53836    0.031    0.000    0.057    0.000 os.py:742(encode)
      318    0.031    0.000    0.274    0.001 request.py:2458(getproxies_environment)
     8949    0.029    0.000    0.814    0.000 client.py:438(read)
    12063    0.028    0.000   20.990    0.002 socket.py:572(readinto)
    56995    0.026    0.000    0.026    0.000 core.py:131(<genexpr>)
    52871    0.025    0.000    0.057    0.000 os.py:687(__iter__)
86547/85977    0.024    0.000    0.044    0.000 {built-in method builtins.isinstance}

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

Yeah, other small operations are triggering computation

averaged = ds - ds_mm_p

My first guess is that some object doesn't have a dtype or something and so we're computing the dtype on a small task?

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

It looks like even after finishing ds_mm_p there is still a lot of data in memory use in the workers. It's not clear to me where this is coming from. The result we're holding in the workers is relatively small. Are there any caches in zarr or xarray that could account for this?

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

I was surprised to find this thing in a task graph:

>>> dep
<Task 'zarr-adt-0f90b3f56f247f966e5ef01277f31374' memory>
>>> Future(dep.key).result()
ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=<xarray.backends.zarr.ZarrArrayWrapper object at 0x7fa921fec278>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))

>>> len(dep.dependents)
1781

Ideally we wouldn't need results like this, and we could instead just have our leaf tasks generate numpy arrays directly.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

For whatever reason moving these things around is taking 10s of seconds

@jhamman
Copy link
Member

jhamman commented Feb 7, 2018

I was surprised to find this thing in a task graph...

@mrocklin - if you think this is a serious problem, let's open an xarray issue on the subject. These array wrappers are fairly new and I image some of the xarray devs will have more to say on the subject than @rabernat and I.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

pydata/xarray#1895

@rabernat
Copy link
Member Author

rabernat commented Feb 7, 2018

Yes! I am convinced that this is the root of many of our problems. Most xarray backends (zarr and netcdf) wrap the raw numpy arrays in layers of "adapter" classes. We are trying to factor this out or xarray.

@mrocklin
Copy link
Member

mrocklin commented Feb 7, 2018

I'm not at all confident that these are at fault, but there are some difficult-to-explain things happening around them.

@jhamman
Copy link
Member

jhamman commented Feb 13, 2018

@mrocklin - I'm sorry if I missed it but was there a solution to the following stealing issues?

tornado.application - ERROR - Exception in callback <bound method WorkStealing.balance of <distributed.stealing.WorkStealing object at 0x7f0747afe780>>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tornado/ioloop.py", line 1026, in _run
    return self.callback()
  File "/opt/conda/lib/python3.6/site-packages/distributed/stealing.py", line 332, in balance
    stealable.remove(ts)
KeyError: <Task "('zarr-sla-getitem-453ed0cf1aab33146a7d8b830a0f63af', 70, 0, 0)" memory>

@mrocklin
Copy link
Member

Not yet? Are these actually affecting your work or are they just noisy?

@jhamman
Copy link
Member

jhamman commented Feb 13, 2018

Yes and no. Computations continue to run but for long running notebook applications, they tend to pile up make the notebook pretty clumsy to navigate.

@mrocklin
Copy link
Member

mrocklin commented Feb 13, 2018 via email

@stale
Copy link

stale bot commented Jun 25, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 25, 2018
@stale
Copy link

stale bot commented Jul 2, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jul 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants