-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset memory leak, very simple case. v0.12 #3200
Comments
Hi, xarray doesn't have any global objects that I know of that can cause the leak - I'm willing to bet on the underlying libraries.
|
Oh but first and foremost - CPython memory management is designed so that, when PyMem_Free() is invoked, CPython will hold on to it and not invoke the underlying free() syscall, hoping to reuse it on the next PyMem_Alloc(). |
Thanks for the profiling script. I ran a few permutations of this:
Here are some plots:
So in conclusion, it looks like there are memory leaks:
(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files. (2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using Note: I modified your script to xarray's file cache size to 1, which helps smooth out the memory usage: def CreateTestFiles():
# create a bunch of files
xlen = int(1e2)
ylen = int(1e2)
xdim = np.arange(xlen)
ydim = np.arange(ylen)
nfiles = 100
for i in range(nfiles):
data = np.random.rand(xlen, ylen, 1)
datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
datafile.to_netcdf('testfile_{}.nc'.format(i))
@profile
def ReadFiles():
# for i in range(100):
# ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4')
# ds.close()
ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time')
ds.close()
if __name__ == '__main__':
# write out files for testing
CreateTestFiles()
xr.set_options(file_cache_maxsize=1)
# loop thru file read step
for i in range(100):
ReadFiles() |
Also, if you're having memory issues I also would definitely recommend upgrading to a newer version of xarray. There was a recent fix that helps ensure that files get automatically closed when they are garbage collected, even if you don't call |
Awesome, thanks @shoyer and @crusaderky for looking into this. I've tested it with the h5netcdf engine and it the leak is mostly mitigated... for the simple case at least. Unfortunately the actual model files that I'm working with do not appear to be compatible with h5py (I believe related to this issue h5py/h5py#719). But that's another problem entirely! @crusaderky, I will hopefully get to trying your suggestions 3) and 4). As for your last point, I haven't tested explicitly, but yes I believe that it does continue to grow linearly more iterations. |
I have observed a similar memleak (config see below). Example for loading a 1.2GB netCDF file: Line # Mem usage Increment Line Contents
================================================
31 168.9 MiB 168.9 MiB @profile
32 def load_and_unload_ds():
33 173.0 MiB 4.2 MiB ds = xr.open_dataset(LFS_DATA_DIR + '/dist2coast_1deg_merged.nc')
34 2645.4 MiB 2472.4 MiB ds.load()
35 2645.4 MiB 0.0 MiB ds.close()
36 173.5 MiB 0.0 MiB del ds
Output of
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.15.0-62-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.6.2
xarray: 0.12.3 |
I was iterating today over a large dataset loaded with I can confirm that
|
MCVE Code Sample
usage:
mprof run simplest_case.py
mprof plot
(mprof is a python memory profiling library)
Problem Description
dask version 1.1.4
xarray version 0.12
python 3.7.3
There appears to be a persistent memory leak in open_mfdataset. I'm creating a model calibration script that runs for ~1000 iterations, opening and closing the same set of files (dimensions are the same, but the data is different) with each iteration. I eventually run out of memory because of the leak. This simple case captures the same behavior. Closing the files with .close() does not fix the problem.
Is there a work around for this? I've perused some of the issues but cannot tell if this has been resolved.
Output of
xr.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.17.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2
xarray: 0.12.0
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: 1.5.5
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: None
setuptools: 41.0.1
pip: 19.1.1
conda: None
pytest: None
IPython: 7.3.0
sphinx: None
The text was updated successfully, but these errors were encountered: