open_mfdataset memory leak, very simple case. v0.12 #3200

bsu-wrudisill · 2019-08-09T22:38:39Z

MCVE Code Sample

import glob
import xarray as xr
import numpy as np
from memory_profiler import profile

def CreateTestFiles():
        # create a bunch of files
        xlen = int(1e2)
        ylen = int(1e2)
        xdim = np.arange(xlen)
        ydim = np.arange(ylen)

        nfiles = 100
        for i in range(nfiles):
                data = np.random.rand(xlen, ylen, 1)
                datafile = xr.DataArray(data, coords=[xdim, ydim, i], dims=['x', 'y', 'time'])
                datafile.to_netcdf('testfiles/datafile_{}.nc'.format(i))

@profile
def ReadFiles():
        xr.open_mfdataset(glob.glob('testfiles/*'), concat_dim='time')

if __name__ == '__main__':
        # write out files for testing 
        CreateTestFiles()

        # loop thru file read step
        for i in range(100):
                ReadFiles()
~
~

usage:

mprof run simplest_case.py
mprof plot

(mprof is a python memory profiling library)

Problem Description

dask version 1.1.4
xarray version 0.12
python 3.7.3

There appears to be a persistent memory leak in open_mfdataset. I'm creating a model calibration script that runs for ~1000 iterations, opening and closing the same set of files (dimensions are the same, but the data is different) with each iteration. I eventually run out of memory because of the leak. This simple case captures the same behavior. Closing the files with .close() does not fix the problem.

Is there a work around for this? I've perused some of the issues but cannot tell if this has been resolved.

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.17.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.0
pandas: 0.24.2
numpy: 1.16.2
scipy: 1.2.1
netCDF4: 1.4.2
pydap: None
h5netcdf: None
h5py: None
Nio: 1.5.5
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: 1.2.1
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.2
cartopy: 0.17.0
seaborn: None
setuptools: 41.0.1
pip: 19.1.1
conda: None
pytest: None
IPython: 7.3.0
sphinx: None

The text was updated successfully, but these errors were encountered:

crusaderky · 2019-08-10T10:06:07Z

Hi,

xarray doesn't have any global objects that I know of that can cause the leak - I'm willing to bet on the underlying libraries.

given your installed packages, open_mfdataset should be defaulting NetCDF4. Please try your measure again after setting it explicitly open_mfdataset(..., engine='netcdf4')
See if the problem disappears if you pass engine='h5netcdf'
Once you have confirmed the actual underlying library, try using it directly without xarray in your ReadFiles test: for every file returned by glob, open it with the netCDF4 package and load into memory all coords (not the data).
Once NetCDF4 is confirmed to be the culprit, if you can it would be great if you could rewrite the test (only the read part) in C using the NetCDF C library to figure out if the leak is in it or in the Python wrapper.

crusaderky · 2019-08-10T10:10:11Z

Oh but first and foremost - CPython memory management is designed so that, when PyMem_Free() is invoked, CPython will hold on to it and not invoke the underlying free() syscall, hoping to reuse it on the next PyMem_Alloc().
An increase in RAM usage from 160 to 200MB could very well be explained by this.
Try increasing the number of loops in your test 100-fold and see if you get a 100-fold increase in memory usage too (from 160MB to 1.2GB). If yes, it's a real leak; if it remains much more contained, it's normal CPython behaviour.

shoyer · 2019-08-10T21:51:25Z

Thanks for the profiling script. I ran a few permutations of this:

xarray.open_mfdataset with engine='netcdf4' (default)
xarray.open_mfdataset with engine='h5netcdf'
xarray.open_dataset with engine='netcdf4' (default)
xarray.open_dataset with engine='h5netcdf'

Here are some plots:

xarray.open_mfdataset with engine='netcdf4': pretty noticeable memory leak, about 0.5 MB / open_mfdataset call:

xarray.open_mfdataset with engine='h5netcdf': looks like a small memory leak, about 0.1 MB / open_mfdataset call:

xarray.open_dataset with engine='netcdf4' (default): definitely has a memory leak:

xarray.open_dataset with engine='h5netcdf': does not appear to have a memory leak:

So in conclusion, it looks like there are memory leaks:

when using netCDF4-Python (I was also able to confirm these without using xarray at all, just using netCDF4.Dataset)
when using xarray.open_mfdataset

(1) looks like by far the bigger issue, which you can work around by switching to scipy or h5netcdf to read your files.

(2) is an issue for xarray. We do do some caching, specifically with our backend file manager, but given that issues only seem to appear when using open_mfdataset, I suspect it may have more to do with the interaction with Dask, though to be honest I'm not exactly sure how.

Note: I modified your script to xarray's file cache size to 1, which helps smooth out the memory usage:

def CreateTestFiles():
        # create a bunch of files
        xlen = int(1e2)
        ylen = int(1e2)
        xdim = np.arange(xlen)
        ydim = np.arange(ylen)

        nfiles = 100
        for i in range(nfiles):
                data = np.random.rand(xlen, ylen, 1)
                datafile = xr.DataArray(data, coords=[xdim, ydim, [i]], dims=['x', 'y', 'time'])
                datafile.to_netcdf('testfile_{}.nc'.format(i))

@profile
def ReadFiles():
        # for i in range(100):
        #         ds = xr.open_dataset('testfile_{}.nc'.format(i), engine='netcdf4')
        #         ds.close()
        ds = xr.open_mfdataset(glob.glob('testfile_*'), engine='h5netcdf', concat_dim='time')
        ds.close()

if __name__ == '__main__':
        # write out files for testing 
        CreateTestFiles()

        xr.set_options(file_cache_maxsize=1)

        # loop thru file read step
        for i in range(100):
                ReadFiles()

shoyer · 2019-08-10T21:53:39Z

Also, if you're having memory issues I also would definitely recommend upgrading to a newer version of xarray. There was a recent fix that helps ensure that files get automatically closed when they are garbage collected, even if you don't call close() or use a context manager explicitly.

bsu-wrudisill · 2019-08-12T19:56:09Z

Awesome, thanks @shoyer and @crusaderky for looking into this. I've tested it with the h5netcdf engine and it the leak is mostly mitigated... for the simple case at least. Unfortunately the actual model files that I'm working with do not appear to be compatible with h5py (I believe related to this issue h5py/h5py#719). But that's another problem entirely!

@crusaderky, I will hopefully get to trying your suggestions 3) and 4). As for your last point, I haven't tested explicitly, but yes I believe that it does continue to grow linearly more iterations.

floschl · 2019-09-12T12:24:12Z

I have observed a similar memleak (config see below).
It occurs for both parameters engine=netcdf4 and engine=h5netcdf.

Example for loading a 1.2GB netCDF file:
In contrast, the memory is just released with a del ds on the object, this is the large memory (2.6GB) - a ds.close() has no effect. There is still a "minor" memleak remaining (4MB), when a open_dataset is called. See the output using the memory_profiler package:

Line #    Mem usage    Increment   Line Contents
================================================
    31    168.9 MiB    168.9 MiB   @profile
    32                             def load_and_unload_ds():
    33    173.0 MiB      4.2 MiB       ds = xr.open_dataset(LFS_DATA_DIR + '/dist2coast_1deg_merged.nc')
    34   2645.4 MiB   2472.4 MiB       ds.load()
    35   2645.4 MiB      0.0 MiB       ds.close()
    36    173.5 MiB      0.0 MiB       del ds

there is no difference using open_dataset(file, engine='h5netcdf'), the minor memleak is even larger (~9MB).
memory leak persists, if an additional chunks parameter is used for open_dataset

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.7 | packaged by conda-forge | (default, Jul 2 2019, 02:18:42) [GCC 7.3.0] python-bits: 64 OS: Linux OS-release: 4.15.0-62-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.5 libnetcdf: 4.6.2

xarray: 0.12.3
pandas: 0.25.1
numpy: 1.16.4
scipy: 1.2.1
netCDF4: 1.5.1.2
pydap: None
h5netcdf: 0.7.4
h5py: 2.9.0
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.3.0
distributed: 2.3.2
matplotlib: 3.1.1
cartopy: 0.17.0
seaborn: None
numbagg: None
setuptools: 41.0.1
pip: 19.2.3
conda: None
pytest: 5.0.1
IPython: 7.7.0
sphinx: None

deeplycloudy · 2023-02-03T21:52:57Z

I was iterating today over a large dataset loaded with open_mfdataset, and had been observing memory usage growing from 2GB to 8GB+.

I can confirm that xr.set_options(file_cache_maxsize=1) kept memory use at a steady 2GB, properly releasing memory.

libnetcdf                 4.8.1           nompi_h261ec11_106    conda-forge
netcdf4                   1.6.0           nompi_py310h0a86a1f_103    conda-forge
xarray                    2023.1.0           pyhd8ed1ab_0    conda-forge
dask                      2023.1.0           pyhd8ed1ab_0    conda-forge
dask-core                 2023.1.0           pyhd8ed1ab_0    conda-forge

dcherian added topic-backends bug labels Aug 15, 2019

daliagachc mentioned this issue Oct 17, 2019

xarray leaking memory. NordicESMhub/NEGI-Abisko-2019#26

Closed

dcherian added topic-documentation and removed bug labels Feb 3, 2023

ealonsogzl mentioned this issue Mar 23, 2023

Memory blowout on large domains ArcticSnow/TopoPyScale#67

Closed

durack1 mentioned this issue Feb 1, 2024

Remove copy statements in sea ice metrics PCMDI/pcmdi_metrics#1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_mfdataset memory leak, very simple case. v0.12 #3200

open_mfdataset memory leak, very simple case. v0.12 #3200

bsu-wrudisill commented Aug 9, 2019 •

edited

Loading

INSTALLED VERSIONS

crusaderky commented Aug 10, 2019

crusaderky commented Aug 10, 2019 •

edited

Loading

shoyer commented Aug 10, 2019 •

edited

Loading

shoyer commented Aug 10, 2019

bsu-wrudisill commented Aug 12, 2019

floschl commented Sep 12, 2019 •

edited

Loading

deeplycloudy commented Feb 3, 2023

open_mfdataset memory leak, very simple case. v0.12 #3200

open_mfdataset memory leak, very simple case. v0.12 #3200

Comments

bsu-wrudisill commented Aug 9, 2019 • edited Loading

MCVE Code Sample

usage:

Problem Description

Output of xr.show_versions()

INSTALLED VERSIONS

crusaderky commented Aug 10, 2019

crusaderky commented Aug 10, 2019 • edited Loading

shoyer commented Aug 10, 2019 • edited Loading

shoyer commented Aug 10, 2019

bsu-wrudisill commented Aug 12, 2019

floschl commented Sep 12, 2019 • edited Loading

deeplycloudy commented Feb 3, 2023

bsu-wrudisill commented Aug 9, 2019 •

edited

Loading

Output of `xr.show_versions()`

crusaderky commented Aug 10, 2019 •

edited

Loading

shoyer commented Aug 10, 2019 •

edited

Loading

floschl commented Sep 12, 2019 •

edited

Loading