-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray groupby monthly mean fail case #99
Comments
Here is the error that shows up in the notebook
|
@rabernat et al. I just wanted to add that I've also been seeing this error, plus the one you mentioned in dask/distributed#1736, on Cheyenne as well. |
@jhamman do you run into this in the context of workers dying? |
Yes, that is generally what I was assuming was a byproduct of these two sets of errors I was getting. I should say, I've been spinning up a new workflow over the past few days and I hadn't gotten to the point of determining the source of my lack of success. |
Glad to hear you're trying new things. My apologies for the frustration. I think I've identified the source of the KeyErrors above. I would hope that they wouldn't halt execution though. We did see some odd behavior on the pangeo.pydata.org setup though. |
Actually no, I haven't identified the source. If anyone can check the scheduler for the transition story of such a key I would be grateful. cluster.scheduler.story(key_name)
# like
cluster.scheduler.story("('zarr-vgosa-getitem-6be14088263b14471b4e8ffb060a8929', 10, 0, 0)") |
Okay, so my workers do die but I'm not convinced for the exact same reasons as @rabernat's. My notebook gets a traceback like this one:
with a reraise of:
Looking at a worker log, I see:
Finally, cluster.scheduler.story(('open_dataset-9fd2c1a157b50dbacfd6eb5462d93a4ftotal runoff-106687d7775cb8f7409fcc2c9aeb8122'))
Sorry for the length of this post but I wanted to be clear about what I was seeing. |
I agree that this seems to be different. Your workers seem to be dying. The most common reason we've seen for this recently is over-use of memory. Is this happening in your situation? |
Also, long posts are fine in my book. If you really wanted to make things clean you could optionally use the
Rendered below
|
I'm just reporting some things here on that notebook as I see them: This line oddly causes some computation. Two ds_mm = ds.groupby('time.month').mean(dim='time')
ds_mm |
It currently takes 35 seconds to open this zarr dataset: %%prun
import gcsfs
#gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3')
gcsmap = gcsfs.mapping.GCSMap('pangeo-data/dataset-duacs-rep-global-merged-allsat-phy-l4-v3-alt')
ds = xr.open_zarr(gcsmap)
ds
|
Yeah, other small operations are triggering computation
My first guess is that some object doesn't have a dtype or something and so we're computing the dtype on a small task? |
It looks like even after finishing |
I was surprised to find this thing in a task graph: >>> dep
<Task 'zarr-adt-0f90b3f56f247f966e5ef01277f31374' memory>
>>> Future(dep.key).result()
ImplicitToExplicitIndexingAdapter(array=LazilyIndexedArray(array=<xarray.backends.zarr.ZarrArrayWrapper object at 0x7fa921fec278>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))))
>>> len(dep.dependents)
1781 Ideally we wouldn't need results like this, and we could instead just have our leaf tasks generate numpy arrays directly. |
For whatever reason moving these things around is taking 10s of seconds |
Yes! I am convinced that this is the root of many of our problems. Most xarray backends (zarr and netcdf) wrap the raw numpy arrays in layers of "adapter" classes. We are trying to factor this out or xarray. |
I'm not at all confident that these are at fault, but there are some difficult-to-explain things happening around them. |
@mrocklin - I'm sorry if I missed it but was there a solution to the following stealing issues?
|
Not yet? Are these actually affecting your work or are they just noisy? |
Yes and no. Computations continue to run but for long running notebook applications, they tend to pile up make the notebook pretty clumsy to navigate. |
OK, I'll add it back onto my todo list. Last time I checked it out I
wasn't able to determine how that issue was triggering. I'll give it
another shot, hopefully sometime today.
…On Tue, Feb 13, 2018 at 11:19 AM, Joe Hamman ***@***.***> wrote:
Yes and no. Computations continue to run but for long running notebook
applications, they tend to pile up make the notebook pretty clumsy to
navigate.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#99 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszHxmoMPcQ6Sboa5iYiGH4MqDxO4Iks5tUbYSgaJpZM4R7xVk>
.
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
I made a notebook for pangeo.pydata.org that reproduces part of the dask / xarray groupby fail case discussed in pydata/xarray#1832.
https://gist.github.com/rabernat/7fe92f2a41dbfe651493d6864e46031a
Would be great if anyone (e.g. @mrocklin) wants to have a look at this and try to debug...
The text was updated successfully, but these errors were encountered: