open_mfdataset() memory error in v0.10 #1745

nick-weber · 2017-11-28T21:08:23Z

Code Sample

import xarray

ncfiles = '/example/path/to/wrf/netcdfs/*'
dropvars = ['list', 'of', 'many', 'vars', 'to', 'drop']

dset = xarray.open_mfdataset(ncfiles, drop_variables=dropvars, concat_dim='Time',  
                                                autoclose=True, decode_cf=False)

Problem description

I am trying to load 73 model (WRF) output files using open_mfdataset(). (Thus, 'Time' is a new dimension). Each netcdf has dimensions {'x' : 405, 'y' : 282, 'z': 37} and roughly 20 variables (excluding the other ~20 in dropvars).

When I run the above code with v0.9.6, it completes in roughly 7 seconds. But with v0.10, it crashes with the following error:

*** Error in `~/anaconda3/bin/python': corrupted size vs. prev_size: 0x0000560e9b6ca7b0 ***

which, as I understand, means I'm exceeding my memory allocation. Any thoughts on what could be the source of this issue?

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.0-3-amd64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C LANG: C LOCALE: None.None

xarray: 0.10.0
pandas: 0.20.3
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.2.4
h5netcdf: 0.5.0
Nio: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
matplotlib: 2.0.2
cartopy: None
seaborn: 0.8.0
setuptools: 27.2.0
pip: 9.0.1
conda: 4.3.29
pytest: 3.1.3
IPython: 6.1.0
sphinx: 1.6.2

The text was updated successfully, but these errors were encountered:

shoyer · 2017-11-29T10:03:51Z

I think this was introduced by #1551, where we started loading coordinates that are compared for equality into memory. This speeds up open_mfdataset, but does increase memory usage.

We might consider adding an option for reduced memory usage at the price of speed. @crusaderky @jhamman @rabernat any thoughts?

crusaderky · 2017-11-29T10:19:52Z

It sounds weird. Even if all the 20 variables he's dropping were coords on the longest dim, and the code was loading them up into memory and then dropping them (that would be wrong - but I didn't check the code yet to verify if that's the case), then we're talking about... 405*20*73=~690k points? That's about 5mb of RAM if they're float64?

@njweber2 how large are these files? Is it feasible to upload them somewhere? If not, could you write a script that generates equivalent dummy data and reproduce the problem with that?

shoyer · 2017-11-29T10:34:25Z

(405*282*37)*20*8 bytes = 676 MB, so running out of memory here seems plausible to me.

crusaderky · 2017-11-29T13:15:29Z

Only if the coords are tridimensional..

nick-weber · 2017-11-29T18:09:39Z

Thank you for the responses.

Turns out my eyeball estimation of the dropped/kept variables was way off. My dropvars list is actually 88 variables and I am keeping 58 variables. Most of these have dimensions (time, y, x) and many are full-dimensional (time, z, y, x).

The size of one netcdf file (which only contains one time step) is ~335 MB. You can look at one of these files here. It's a hefty dataset overall.

braaannigan · 2017-12-13T09:49:13Z

I'm getting a similar error. The file size is very small (Kbs), so I don't think it's the size issue above. Instead, the error I get is due to something strange happening in core.utils.is_remote_uri(path). The error occurs when I'm reading netcdf3 files with the default netcdf4 engine (which should be able to handle netcdf3 of course).
There is a workaround in that I can use the scipy reader to read netcdf3 files with no problems. Note that whenever I refer to "error" below it means the error that gives the following output rather than a python exception.

The error message is:
*** Error in `/path/anaconda2/envs/base3/bin/python': corrupted size vs. prev_size: 0x0000000001814930 ***
Aborted (core dumped)

The function where the problem arises is:

def is_remote_uri(path):
    return bool(re.search('^https?\://', path))

The function is called a few times during the open_dataset (or open_mfdataset, I get the same error). On the third or fourth call it triggers the error. As I'm not using remote datasets, I can hard-code the output of the function to be

return False

and then the file reads with no problems.

The is_remote_uri(path) call is made a few times. However, it's only on line 233 of netCDF4_.py with is_remote_uri(self._filename) that the error is triggered.

I've output the argument to the is_remote_uri() function for each time it's called. In the first call the argument is the filename, in the second call the argument is the filename with the absolute path and in the third (and fatal) call the argument is also the filename with the absolute path.

I can't see any difference between the arguments to the function on the second and third call. When I copy them, assign them to variables and check equality in python it evaluates to True.

I've added in a simpler call to re.search in the function:

def is_remote_uri(path):
    print((re.search('.nc','.nc')))
    return bool(re.search('^https?\://', path))

This also triggers the error on the third call to the function. As such we can rule out something to do with the path name.

I've played around with the print((re.search('.nc','.nc'))) line that I've added in. It only triggers an error on the third call when the first argument of re.search has a dot in the string, so re.search('.nc','.nc') causes the error, but re.search('nc','.nc') doesn't. The error isn't dependent on .nc in any way, '.AAA' in the arguments will cause the same error. The error doesn't replicate if I simply import re in ipython.

The error does not occur in xarray 0.9.6. The same function is called in a similar way and the function evaluates to False each time.

I'm not really sure what to do next, though. The obvious workaround is to set engine='scipy' if you're working with netcdf3 files.

Can anyone replicate this error?

braaannigan · 2017-12-13T10:11:50Z

I've played around with it a bit more. It seems like it's the ^ character in the re.search term that's causing the issue. If this is removed and the function is simply:

def is_remote_uri(path):
return bool(re.search('https?\://', path))

then I can load the file.

shoyer · 2017-12-13T17:54:54Z

@braaannigan Can you share the name of your problematic file?

One possibility is that re.search() is not thread-safe, even though I don't think we call is_remote_uri from multiple threads. We can test that by adding a lock, and seeing if that resolves the issue. Try replacing is_remote_uri with:

import threading

LOCK = threading.Lock()

def is_remote_uri(path):
    with LOCK:
        return bool(re.search('^https?\://', path))

braaannigan · 2017-12-14T14:51:36Z

Hi @shoyer, thanks for getting back to me.

That hasn't worked unfortunately. The only difference including the with LOCK statement makes is that the file load seems to work, but then the core dump happens when you try to access the object, e.g. with the ds line below:

import xarray as xr
ds = xr.open_dataset('grid.nc')
ds

As above, removing the ^ avoids the crash when the with LOCK statement is used.

braaannigan · 2017-12-14T14:56:04Z

There is also some filename dependence. The file load works for g.nc, gr.nc, gri.nc and then fails for grid.nc. The file load also works for grida.nc

shoyer · 2017-12-14T16:41:19Z

@braaannigan what about replacing re.search('^https?\://', path) with re.match('https?\://', path)? Can you share the output of running python -c 'import sys; print(sys.getfilesystemencoding())' at the command line? Also, please try engine='scipy' or engine='h5netcdf' with open_dataset. The output of xarray.show_versions() would also be helpful.

braaannigan · 2017-12-14T16:50:43Z

Hi @shoyer

The crash does not occur when the ^ is removed.

When I run python -c 'import sys; print(sys.getfilesystemencoding())
The output is: utf-8

The file loads with the scipy engine. I get a module import error with h5netcdf, even though conda list shows that I have version 0.5 installed.

`xr.show_versions()` gives:
INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-101-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.0
pandas: 0.21.0
numpy: 1.13.1
scipy: 0.19.1
netCDF4: 1.2.4
h5netcdf: None
Nio: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.16.0
matplotlib: 2.0.2
cartopy: None
seaborn: 0.7.1
setuptools: 36.7.1
pip: 9.0.1
conda: None
pytest: 3.2.1
IPython: 6.2.1
sphinx: None

braaannigan · 2017-12-14T17:07:49Z

If the ^ isn't strictly necessary I'm happy to put together a PR with it removed.

shoyer · 2017-12-14T17:24:40Z

re.match(pattern, string) is equivalent to re.search('^' + pattern, string), so arguably this is a cleaner solution anyways. But ideally I'd like to understand why this is a problem for you, so we can fix the underlying cause and not do it again.

shoyer · 2017-12-14T17:28:37Z

@braaannigan can you try adding print(repr(path)) to is_remote_uri() so we can see exactly what these offending strings look like?

braaannigan · 2017-12-14T17:32:04Z

With print(repr(path)) I get:
'grid.nc'
'/path/verification/cabl/y2d/mnc_test_0008_1day_restoring/grid.nc'
'/path/verification/cabl/y2d/mnc_test_0008_1day_restoring/grid.nc'
where I've edited the changed the first part of the filename to "/path/"

braaannigan · 2017-12-14T17:35:24Z

I've also now tried out the re.match approach you suggest above, but it generates the same core dump as the re.search('^...') approach

shoyer · 2017-12-14T17:41:05Z

I think there is probably a bug buried inside the netCDF4.Dataset.filepath() method somewhere. For example, on netCDF4-python 1.2.4, this would crash if you have any non-ASCII characters in the path. But that doesn't seem to be the issue here.

braaannigan · 2017-12-14T17:51:11Z

Interesting. I've tried to look at this a bit more by in netCDF4_.py running:

self._filename = self.ds.filepath()
print(self.ds)
self.is_remote = is_remote_uri(self._filename)

So, all I did was add a print statement print(self.ds).

In this case the open_dataset call worked fine.

shoyer · 2017-12-14T17:58:05Z

Can you reproduce this just using netCDF4-python?

Try:

import netCDF4
ds = netCDF4.Dataset(path)
# print(ds)
print(ds.filepath())

If so, it would be good to file a bug upstream.

Actually, it looks like this might be Unidata/netcdf4-python#506

braaannigan · 2017-12-15T08:30:57Z

Hi @shoyer
I've tried this print(ds.filepath()) suggestion and it reproduces when I use the full length file path which has 88 characters.
Again, the segfault doesn't arise if I add or subtract a character to the file path (after copying the underlying file to a new name).

This dependence on 88 characters is consistent with the bug here:
Unidata/netcdf4-python#585

shoyer · 2017-12-16T01:58:02Z

If upgrating to a newer version of netcdf4-python isn't an option we might need to figure out a workaround for xarray....

It seems that anaconda is still distributing netCDF4 1.2.4, which doesn't help here.

braaannigan · 2018-01-09T11:17:12Z

Hi @shoyer

Updating netcdf4 to version 1.3.1 solves the problem. I'm trying to think what the potential solutions are. Essentially, we would need to modify the function ds.filepath(). However, this isn't possible inside xarray.

Is there anything we can do other than add a warning message with the recommendation to upgrade netcdf4 when the file path has 88 characters and netcdf4 is version 1.2.4?

Should we also submit an issue to anaconda to get the default package updates?

Happy to prepare these if you think it's the best way to proceed.

Liam

shoyer · 2018-01-09T19:36:10Z

Both the warning message and the upstream anaconda issue seem like good ideas to me.

braaannigan mentioned this issue Jan 16, 2018

Add warning for netCDF4 bug #1835

Closed

shoyer mentioned this issue May 26, 2018

open_dataset crash with long filenames #2187

Closed

dcherian closed this as completed Jan 13, 2019

Uh oh!

open_mfdataset() memory error in v0.10 #1745

open_mfdataset() memory error in v0.10 #1745

Comments

nick-weber commented Nov 28, 2017

Code Sample

Problem description

Output of xr.show_versions()

shoyer commented Nov 29, 2017

Uh oh!

crusaderky commented Nov 29, 2017 • edited by shoyer Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Nov 29, 2017

Uh oh!

crusaderky commented Nov 29, 2017

Uh oh!

nick-weber commented Nov 29, 2017

Uh oh!

braaannigan commented Dec 13, 2017

Uh oh!

braaannigan commented Dec 13, 2017

Uh oh!

shoyer commented Dec 13, 2017

Uh oh!

braaannigan commented Dec 14, 2017

Uh oh!

braaannigan commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Dec 14, 2017

Uh oh!

braaannigan commented Dec 14, 2017

xr.show_versions() gives: INSTALLED VERSIONS

Uh oh!

braaannigan commented Dec 14, 2017

Uh oh!

shoyer commented Dec 14, 2017

Uh oh!

shoyer commented Dec 14, 2017

Uh oh!

braaannigan commented Dec 14, 2017

Uh oh!

braaannigan commented Dec 14, 2017

Uh oh!

shoyer commented Dec 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

braaannigan commented Dec 14, 2017

Uh oh!

shoyer commented Dec 14, 2017

Uh oh!

braaannigan commented Dec 15, 2017

Uh oh!

shoyer commented Dec 16, 2017

Uh oh!

braaannigan commented Jan 9, 2018

Uh oh!

shoyer commented Jan 9, 2018

Uh oh!

Output of `xr.show_versions()`

crusaderky commented Nov 29, 2017 •

edited by shoyer

Loading

braaannigan commented Dec 14, 2017 •

edited

Loading

`xr.show_versions()` gives:
INSTALLED VERSIONS

shoyer commented Dec 14, 2017 •

edited

Loading