Memory blowout on large domains #67

joelfiddes · 2023-03-21T16:12:26Z

Im still getting the memory blowout even with IO:split option turned off and for a single year. This is admittedly for a big domain but one I ran no problem several versions ago (pre implementation of IO:split). What confuses me is that if split is turned off then it should be the same as older version, no? Yet something seems to be fundamentally different in memory use, it blows through 15gb of memory in about 15s! S something is scaling up pretty fast......

ArcticSnow · 2023-03-21T16:57:16Z

Oh no sorry. Can you confirm that the version prior to commit e30c720 does not show the same behavior.
Using split or not does not influence the usage of multithreading and multicore. Are you able to pinpoint at what step in the code you witness the leak?
FYI, I am currently running a project of 4000 clusters split on 5years for a total of 70 years on a server. It has been running smoothly over multiple days. I am quite puzzled about this problem.

How big is the DEM file? I may try with a large DEM too, that I can reproduce the bug.

joelfiddes · 2023-03-21T22:13:39Z

Ive gone back to release 0.1.7 and running fine now. I think as before this is not related to cluster number or length but ERA5 domain size. I dont think it is dependent on DEM size as last use case (Naryn KG) that we saw this was a big ERA5 domain with small dem. Once I cropped ERA5 to the dem it was fine. Here ERA5 domain is 21 x 17 and dem is 625 x 555 (500m cells). Are you interested in specifically the version prior to commit e30c720 ?

ArcticSnow · 2023-03-23T07:56:09Z

That commit is prior to the merge with the parallelizing branch. I wonder if it has to do with the changing opening the era5 data from open_mfdataset() with a xr.concat() of a list of filename.

ealonsogzl · 2023-03-23T08:32:49Z

Hey guys, I was reading this thread. There are some memory leaks reported here and there with netcdf at the C level, so the python gc can not handle them. I personally found one some time ago in MFDataset() that forced me to open the files with a loop instead. Try to keep an updated version of the netcdf dependencies you are using, maybe that helps.

This may be relevant pydata/xarray#3200

joelfiddes · 2023-05-09T13:25:18Z

I think the issue is here:

def _open_dataset_climate(flist):

    ds__list = []
    for file in flist:
        ds__list.append(xr.open_dataset(file))

where all the era5 files are loaded and appended. In my case with large domain this blows the memory. How can we make this scale?

joelfiddes · 2023-05-09T13:25:54Z

above is l.149 of topo_scale.py

joelfiddes · 2023-05-09T13:29:58Z

i think this is why it occurs if you specify the toime split option or not as in both cases def downscale_climate is the same and containsthe function above.

Basically what you already said above Simon, I think: #67 (comment)

joelfiddes · 2023-05-09T13:32:11Z

@ArcticSnow can we go back to using open_mfdataset() or is that not working with the new split timeseries code?

joelfiddes · 2023-05-09T13:55:33Z

new split timeseries code

In [19]: flist = flist_PLEV

In [20]:     ds__list = []
    ...:     for file in flist:
    ...:          ds__list.append(xr.open_dataset(file))
    ...: 
    ...:     ds_ = xr.concat(ds__list, dim='time')

In [21]: ds_
Out[21]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 9.28e+04 ... 2.27e+03
    t          (time, level, latitude, longitude) float32 235.7 235.7 ... 286.5
    u          (time, level, latitude, longitude) float32 19.76 19.63 ... 0.6399
    v          (time, level, latitude, longitude) float32 11.27 11.9 ... -0.6594
    r          (time, level, latitude, longitude) float32 37.86 51.65 ... 63.77
    q          (time, level, latitude, longitude) float32 0.0001289 ... 0.003548
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Original open_mfdataset:

In [24]: ds_plev = xr.open_mfdataset(project_directory + 'inputs/climate/PLEV*.nc', parallel=True)

In [25]: ds_plev
Out[25]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Results seem to be identical - a change back should work?

joelfiddes · 2023-05-09T14:08:50Z

another small point why is the time subset done in SURF but not PLEV? Before it was both:

joelfiddes · 2023-05-09T14:16:23Z

edits :

and

seems to work so far

ArcticSnow · 2023-05-10T11:54:31Z

I had changed open_mfdataset() to the other method as open_mfdataset() did not work with the parallelizing system I endup using. So we could have both options then. Can you check if when using your edit you can still spread the computational load on multicore?

Good catch on the line 189. No idea why this happened, strange.

… blowout on large domains. Returned to xr.open_mfdataset see issue: #67 (comment) other small changes: 1. pd.date_range deprecated arg "closed" changed to "inclusive" 2. deprecated np.int -> np.int32 3. added missing temporal subset l.187

ArcticSnow · 2023-05-11T10:18:26Z

Ok, I tested with the edits above, and now it does not parallelize anymore the downscaling. I'll include it again and add an option, then.

joelfiddes · 2023-05-11T11:33:07Z

Im getting the number of points jobs launched as cores specified (6) and a shed load of processes launched during downscaling - do you mean you just have a single process running?

ArcticSnow · 2023-05-11T14:40:22Z

nevermind, I had my config file set for one core. Sorry, it works great!

Should we close the topic then?

joelfiddes added the bug Something isn't working label Mar 21, 2023

ArcticSnow added the topo_scale label Mar 21, 2023

joelfiddes closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory blowout on large domains #67

Memory blowout on large domains #67

joelfiddes commented Mar 21, 2023

ArcticSnow commented Mar 21, 2023 •

edited

Loading

joelfiddes commented Mar 21, 2023

ArcticSnow commented Mar 23, 2023

ealonsogzl commented Mar 23, 2023 •

edited

Loading

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023 •

edited

Loading

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

ArcticSnow commented May 10, 2023

ArcticSnow commented May 11, 2023

joelfiddes commented May 11, 2023

ArcticSnow commented May 11, 2023

Memory blowout on large domains #67

Memory blowout on large domains #67

Comments

joelfiddes commented Mar 21, 2023

ArcticSnow commented Mar 21, 2023 • edited Loading

joelfiddes commented Mar 21, 2023

ArcticSnow commented Mar 23, 2023

ealonsogzl commented Mar 23, 2023 • edited Loading

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023 • edited Loading

joelfiddes commented May 9, 2023

joelfiddes commented May 9, 2023

ArcticSnow commented May 10, 2023

ArcticSnow commented May 11, 2023

joelfiddes commented May 11, 2023

ArcticSnow commented May 11, 2023

ArcticSnow commented Mar 21, 2023 •

edited

Loading

ealonsogzl commented Mar 23, 2023 •

edited

Loading

joelfiddes commented May 9, 2023 •

edited

Loading