Memory issue due to flow field data #668

VictorOnink · 2019-10-09T09:11:22Z

I've just updated python to version 3.7, and therefore I reinstalled parcels with the most up to date version (v2.1 that was just released). However, with this new version of parcels I find that my runs crash due to a memory error. Specifically, the error message reads:

MemoryError: Unable to allocate array with shape (1, 3251, 4500) and data type float32

The shape of the array is the same as that of one time step of the HYCOM surface circulation data that I have been using, which makes me suspect that the issue is related to releasing the files from memory once the simulation has passed them. While running a simulation I have also followed the working memory being used by the script, which increases until it is so high that the program crashes. Furthermore, when I run the script using only one flow field data file, then it runs without issue.

I've come across the same issue with parcels installed both on a linux and a windows machine, while it didn't come up when i installed parcels v2.0, so it likely lies somewhere in the more recent additions to the parcels code. It came up already last week, so Philippe already has the code and datafiles that I used that led to the issue on my machines.

The text was updated successfully, but these errors were encountered:

delandmeterp · 2019-10-09T15:00:43Z

Hi Victor, could you provide the exact file you are running?
We've run your file last week with totalKernal=AdvectionRK4+diffusion+Beach , and it was going fine on Macos. I've run it again now, and on gemini as well, with similar fine result (see below)

VictorOnink · 2019-10-14T07:18:49Z

The file I used is the exact same as the one last week, with the only difference being that the paths to the files were changed due to it being run on a different computer. But that is odd then that the error doesn't come up on your servers. I've now been using v2.0 and that works fine without any errors, but then I'm not sure what the issue is with v2.1.

Since posting the issue I've been working with my code so quite a few of the kernels have changed somewhat. I'll try reinstalling v2.1 at some later point and I'll see if a similar error comes up again. It might just be an issue specific to my computer or due to some part of a kernel.

CKehl · 2020-01-17T11:43:11Z

After expanding @delandmeterp 's tests on the field chunking (see https://nbviewer.jupyter.org/github/OceanParcels/parcels/blob/master/parcels/examples/documentation_MPI.ipynb) by tracking memory consumption and running it with MPI. Currently, the given example is a toy example looking at particle number and time steps, but it already shows a trend. The simulations are run with 48 particles, runtime is 7 days, dt is 1h.

testing the chunking without MPI-mode:

testing the chunking with MPI-mode:

Now, it got myself inspired by @delandmeterp and tried out what difference the garbage collection makes.

testing the chunking with Garbage Collection without MPI-mode:

testing the chunking with Garbage collection with MPI-mode:

Clearly, we can see that memory consumption builds up over the run, which should not be the case, and garbage collection actually assures a nye-constant memory consumption for the example.

Still though, aside from using the garbage collection, I still don't know in detail where is malicious memory behaviour has its roots.

PS: the green line tracks the number of open files, which I log too to investigate the errors with respect to job queueing systems. Thus, for the memory, you can ignore that line. The yellow one is important.

CKehl · 2020-01-21T12:42:11Z

I went further into the profiling and it turns out that the old flow fields don't get deleted. For that, we can have a look at a simple example that advects 96 particle with common RK4 (+ deleting out-of-bounds particles) on the CMEMS data. Here, we advect it over a time period of 7 days with a dt=1h. To study the memory behaviour, we first exclude garbage collection (the default) and don't use MPI over more than 1 processor. The result is as such:

if we assess the yellow curve, we see a steady minute increase each hour of a few MB, which comes from the particles. This is, in this case, not a major issue and could also come from the "repeatdt". The major memory leak is seen at the end of each day (which is when new field data are loaded), where the memory accumulates. Basically, this directly shows that new fields are loaded from disk, but the old data in memory are not deleted. This can cause memory overflows quickly when doing long-running simulations over many days and months. From what the memory_profiler in Python tells me, this leakage occurs in the "computeTimeChunk" method of the field. Still, the error needs to be found somewhere else, where the last 2 time steps are actually shifted back to make space for the new data (hypothesis).

Actually, calling the garbage collector after each advection does NOT solve the problem, as the curves with and without GC look exactly the same:

If we compare the same run with 1 processor and then in MPI with 2 processors, we see that the problem compounds (obviously), as the leakage occurs for each processor individually.

CKehl · 2020-01-21T12:53:58Z

Little additional: if we look at 31 days of simulation, we see that the memory is sort-of 'resets' when wrapping around the 30-day period covered by the CMEMS model.

VictorOnink · 2020-01-21T12:56:42Z

That matches with what I remember of the original error. The error I got reported that a memory error arose when the next time step of the data was to be loaded, since it gave the array dimensions of the variable that caused the error and this matched exactly with 1 time slice of flow field data. The simulations were also not with more particles than I had used in previous runs, so that the memory error is due to not deleting the old fields.

As for your little additional, there is a sort of reset of the memory, but then right after the memory jumps up again to a higher level than previous, so is it really a reset? or more a temporary blip?

CKehl · 2020-01-21T13:06:59Z

no, Victor, you are absolutely right: the memory bumps back to the previous high level (I have a deeper look on that right now too). This is what explains the high blue bars at that point: it unload the data and then reloads them somehow in each latter iteration - probably due to data interpolation between timesteps for particles in the various field chunks. But why it loads then ALL the timesteps again is beyond me right now ... looking into it.

Thanks for your confirmation - that helps in tracking the error.

I still need to run all that with SciPy more and with the previous version to see what's what.

CKehl · 2020-01-21T16:16:03Z

33 days - I can't run more locally because my memory taps out beyond that, The spikes are weird, but what is even more weird is that the bar for open files (green one) drops to 2 and stays there. Basically, saying: it loads field after field in memory, file by file, after 'wrapping around' the time domain.

Good thing is: the memory consumption is the same for any number of cores. Basically: splitting a grid into N number of equally-sized grids for MPI works. But the actual data load from file does not.

Comment CK: is run with 'allow_time_extrapolation', which causes the weird spike after 30 days of simulation.

CKehl · 2020-01-22T12:33:18Z

Seems that the core of the issue is the NetcdfFileBuffer itself.

if we do the current version and track what is happening there, the log looks like this (for NetcdfFileBuffer.data):

NetCDF engine: netcdf4
NetCDF dataset[uo] as <class 'xarray.core.dataarray.DataArray'>: <xarray.DataArray 'uo' (time: 1, depth: 50, latitude: 2041, longitude: 4320)>
[440856000 values with dtype=float32]
Coordinates:
  * longitude  (longitude) float32 -180.0 -179.91667 ... 179.83333 179.91667
  * latitude   (latitude) float32 -80.0 -79.916664 -79.833336 ... 89.916664 90.0
  * depth      (depth) float32 0.494025 1.541375 2.645669 ... 5274.784 5727.917
  * time       (time) datetime64[ns] 2016-07-01T12:00:00
Attributes:
    long_name:      Eastward velocity
    standard_name:  eastward_sea_water_velocity
    units:          m s-1
    unit_long:      Meters per second
    valid_min:      -3454
    valid_max:      4455
    cell_methods:   area: mean
Type of actual data: <class 'numpy.ndarray'>
NetCDF indiced: {'lon': range(0, 4320), 'lat': range(0, 2041), 'depth': [0]}
dask-xarray shape: (1, 1, 2041, 4320)

the important bit is the line Type of actual data: <class 'numpy.ndarray'>, meaning that the it makes no difference in the rest of the code to use Dask cause the loaded data from NetCDF are already loaded in full at that stage.

Going through the documentation of xarray and Dask, the following thing turns out:
the xarray.open_dataset() method ONLY uses Dask and lazy allocation if the chunking information is provided in the open-call, as xarray.open_dataset(..., chunks=...). if that is not the case, this NetCDF call WILL allocate the data as nump.ndarray, and thus any later chunking has nearly no effect because the data are already in full in the memory.

This is verified cause, when just adding a "blueprint" chunk statement to the xr.open_dataset(...) call, the log looks like this:

NetCDF engine: netcdf4
NetCDF dataset[uo] as <class 'xarray.core.dataarray.DataArray'>: <xarray.DataArray 'uo' (time: 1, depth: 50, latitude: 2041, longitude: 4320)>
dask.array<open_dataset-9e81bb2c309698efdb108cdc66af94eduo, shape=(1, 50, 2041, 4320), dtype=float32, chunksize=(1, 50, 2041, 4320), chunktype=numpy.ndarray>
Coordinates:
  * longitude  (longitude) float32 -180.0 -179.91667 ... 179.83333 179.91667
  * latitude   (latitude) float32 -80.0 -79.916664 -79.833336 ... 89.916664 90.0
  * depth      (depth) float32 0.494025 1.541375 2.645669 ... 5274.784 5727.917
  * time       (time) datetime64[ns] 2016-07-01T12:00:00
Attributes:
    long_name:      Eastward velocity
    standard_name:  eastward_sea_water_velocity
    units:          m s-1
    unit_long:      Meters per second
    valid_min:      -3454
    valid_max:      4455
    cell_methods:   area: mean
Type of actual data: <class 'dask.array.core.Array'>
NetCDF indiced: {'lon': range(0, 4320), 'lat': range(0, 2041), 'depth': [0]}
dask-xarray shape: (1, 1, 2041, 4320)

As we can see at Type of actual data: <class 'dask.array.core.Array'>, now the xarray NetCDF loader uses lazy-allocation Dask for the data.

I'm memory-profiling the change to see if that makes the impact I expect it to make.

CKehl · 2020-01-22T13:37:27Z

update for a 7-day run:

before bugfix:

after bugfix:

There is still a growing trend, but that can be explained from particles being more and more distributed over time, requiring more and more chunks.

Now running the month-long tests to verify stability.

CKehl · 2020-01-22T13:58:02Z

One drawback on this whole process: if one does NOT want chunking and does want all 3 field dataset in the memory, then that's not gonna work that easy. The problem is: of you concatenate a fixed-allocated array (e.g. via numpy) with dask-concatenate and treat it from then on as Dask, then shifting- and concatenation operations will NOT automatically free unused numpy arrays (which is why this error was there in the first place). In other words: as soon as an array becomes a dask-array, unused data in memory are not freed, because dask-array indexing operations don"t do anything on the memory.

Hence, of one defines his FieldSet by initialisation with "chunking=False", all array calls that are now fixed on Dask need to be replaced with xarray or numpy, and probably the whole chunking (i.e. all functions calls related to that) need to the skipped. So, making that work will require a larger overhaul.

CKehl · 2020-01-22T15:37:03Z

The fix also seems to work on the longer scale, though wrapping fieldsets in time (either by periodic or time extrapolation) still has some flaws. Here again for comparison the 33 days runs. Watch the yellow bar for memory. Keep in mind that the bar without the fix is measured in Gigabytes [=1000 MB], while the bar with the fix is measured in 1/10 of Gigabytes [=100MB]. Thus, though the bars look similar, there is actually a difference of a whole order of magnitude in memory consumption.

before bugfix:

after bugfix:

Comment CK: here, we have a periodic_time of 30 days - allow_time_extrapolation breaks completely for some reason.

CKehl · 2020-01-22T19:23:57Z

I'm fixing some MPI-related problems, but also in MPI it starts working on 33 days:

CKehl · 2020-01-27T19:23:45Z

Here some results from recent runs - mind that I check the graphs for backward optimization mode and MP and submission systems too - they look the same.

Run forward with extrapolation (fieldsize=2048):

Run forward without deferred arrays with extrapolation (fieldsize=2048):

Run forward with periodic wrapping (fieldsize=2048):

Run forward with extrapolation (fieldsize=256):

Run forward with periodic wrapping (fieldsize='auto' while haveing a valid dask config yaml file):

CKehl · 2020-01-28T09:09:02Z

here the normal forward simulation with extrapolation without repeatdt, meaning: without continuous particle addition:

CKehl · 2020-01-28T09:10:12Z

testing and further discussion please at #719

erikvansebille mentioned this issue Jan 9, 2020

Issue with memory exceeding limit #711

Closed

CKehl mentioned this issue Jan 28, 2020

Fix chunking mem leak #719

Merged

cjongedijk mentioned this issue Feb 6, 2020

4D (time evolving) depth S-grids #660

Merged

CKehl added bug discussion labels Feb 13, 2020

CKehl closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue due to flow field data #668

Memory issue due to flow field data #668

VictorOnink commented Oct 9, 2019 •

edited

Loading

delandmeterp commented Oct 9, 2019

VictorOnink commented Oct 14, 2019

CKehl commented Jan 17, 2020

CKehl commented Jan 21, 2020

CKehl commented Jan 21, 2020

VictorOnink commented Jan 21, 2020

CKehl commented Jan 21, 2020

CKehl commented Jan 21, 2020 •

edited

Loading

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020 •

edited

Loading

CKehl commented Jan 22, 2020

CKehl commented Jan 27, 2020

CKehl commented Jan 28, 2020

CKehl commented Jan 28, 2020

Memory issue due to flow field data #668

Memory issue due to flow field data #668

Comments

VictorOnink commented Oct 9, 2019 • edited Loading

delandmeterp commented Oct 9, 2019

VictorOnink commented Oct 14, 2019

CKehl commented Jan 17, 2020

CKehl commented Jan 21, 2020

CKehl commented Jan 21, 2020

VictorOnink commented Jan 21, 2020

CKehl commented Jan 21, 2020

CKehl commented Jan 21, 2020 • edited Loading

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020

CKehl commented Jan 22, 2020 • edited Loading

CKehl commented Jan 22, 2020

CKehl commented Jan 27, 2020

CKehl commented Jan 28, 2020

CKehl commented Jan 28, 2020

VictorOnink commented Oct 9, 2019 •

edited

Loading

CKehl commented Jan 21, 2020 •

edited

Loading

CKehl commented Jan 22, 2020 •

edited

Loading