Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory blowout on large domains #67

Closed
joelfiddes opened this issue Mar 21, 2023 · 15 comments
Closed

Memory blowout on large domains #67

joelfiddes opened this issue Mar 21, 2023 · 15 comments
Labels
bug Something isn't working topo_scale

Comments

@joelfiddes
Copy link
Collaborator

Im still getting the memory blowout even with IO:split option turned off and for a single year. This is admittedly for a big domain but one I ran no problem several versions ago (pre implementation of IO:split). What confuses me is that if split is turned off then it should be the same as older version, no? Yet something seems to be fundamentally different in memory use, it blows through 15gb of memory in about 15s! S something is scaling up pretty fast......

@joelfiddes joelfiddes added the bug Something isn't working label Mar 21, 2023
@ArcticSnow
Copy link
Owner

ArcticSnow commented Mar 21, 2023

Oh no sorry. Can you confirm that the version prior to commit e30c720 does not show the same behavior.
Using split or not does not influence the usage of multithreading and multicore. Are you able to pinpoint at what step in the code you witness the leak?
FYI, I am currently running a project of 4000 clusters split on 5years for a total of 70 years on a server. It has been running smoothly over multiple days. I am quite puzzled about this problem.

How big is the DEM file? I may try with a large DEM too, that I can reproduce the bug.

@joelfiddes
Copy link
Collaborator Author

Ive gone back to release 0.1.7 and running fine now. I think as before this is not related to cluster number or length but ERA5 domain size. I dont think it is dependent on DEM size as last use case (Naryn KG) that we saw this was a big ERA5 domain with small dem. Once I cropped ERA5 to the dem it was fine. Here ERA5 domain is 21 x 17 and dem is 625 x 555 (500m cells). Are you interested in specifically the version prior to commit e30c720 ?

@ArcticSnow
Copy link
Owner

That commit is prior to the merge with the parallelizing branch. I wonder if it has to do with the changing opening the era5 data from open_mfdataset() with a xr.concat() of a list of filename.

@ealonsogzl
Copy link
Collaborator

ealonsogzl commented Mar 23, 2023

Hey guys, I was reading this thread. There are some memory leaks reported here and there with netcdf at the C level, so the python gc can not handle them. I personally found one some time ago in MFDataset() that forced me to open the files with a loop instead. Try to keep an updated version of the netcdf dependencies you are using, maybe that helps.

This may be relevant pydata/xarray#3200

@joelfiddes
Copy link
Collaborator Author

I think the issue is here:

def _open_dataset_climate(flist):

    ds__list = []
    for file in flist:
        ds__list.append(xr.open_dataset(file))

where all the era5 files are loaded and appended. In my case with large domain this blows the memory. How can we make this scale?

@joelfiddes
Copy link
Collaborator Author

above is l.149 of topo_scale.py

@joelfiddes
Copy link
Collaborator Author

i think this is why it occurs if you specify the toime split option or not as in both cases def downscale_climate is the same and containsthe function above.

Basically what you already said above Simon, I think: #67 (comment)

@joelfiddes
Copy link
Collaborator Author

@ArcticSnow can we go back to using open_mfdataset() or is that not working with the new split timeseries code?

@joelfiddes
Copy link
Collaborator Author

joelfiddes commented May 9, 2023

new split timeseries code

In [19]: flist = flist_PLEV

In [20]:     ds__list = []
    ...:     for file in flist:
    ...:          ds__list.append(xr.open_dataset(file))
    ...: 
    ...:     ds_ = xr.concat(ds__list, dim='time')

In [21]: ds_
Out[21]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 9.28e+04 ... 2.27e+03
    t          (time, level, latitude, longitude) float32 235.7 235.7 ... 286.5
    u          (time, level, latitude, longitude) float32 19.76 19.63 ... 0.6399
    v          (time, level, latitude, longitude) float32 11.27 11.9 ... -0.6594
    r          (time, level, latitude, longitude) float32 37.86 51.65 ... 63.77
    q          (time, level, latitude, longitude) float32 0.0001289 ... 0.003548
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Original open_mfdataset:

In [24]: ds_plev = xr.open_mfdataset(project_directory + 'inputs/climate/PLEV*.nc', parallel=True)

In [25]: ds_plev
Out[25]: 
<xarray.Dataset>
Dimensions:    (time: 1464, longitude: 24, latitude: 11, level: 8)
Coordinates:
  * time       (time) datetime64[ns] 1999-09-01 ... 1999-10-31T23:00:00
  * longitude  (longitude) float32 72.9 73.15 73.4 73.65 ... 78.15 78.4 78.65
  * latitude   (latitude) float32 42.55 42.3 42.05 41.8 ... 40.55 40.3 40.05
  * level      (level) float64 300.0 500.0 600.0 700.0 800.0 850.0 900.0 1e+03
Data variables:
    z          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    t          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    u          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    r          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
    q          (time, level, latitude, longitude) float32 dask.array<chunksize=(720, 8, 11, 24), meta=np.ndarray>
Attributes:
    CDI:          Climate Data Interface version 1.9.9rc1 (https://mpimet.mpg...
    Conventions:  CF-1.6
    history:      Thu Mar 09 22:37:48 2023: cdo sellonlatbox,72.6960777169421...
    CDO:          Climate Data Operators version 1.9.9rc1 (https://mpimet.mpg...

Results seem to be identical - a change back should work?

@joelfiddes
Copy link
Collaborator Author

another small point why is the time subset done in SURF but not PLEV? Before it was both:

image

@joelfiddes
Copy link
Collaborator Author

edits :

image

and

image

seems to work so far

@ArcticSnow
Copy link
Owner

I had changed open_mfdataset() to the other method as open_mfdataset() did not work with the parallelizing system I endup using. So we could have both options then. Can you check if when using your edit you can still spread the computational load on multicore?

Good catch on the line 189. No idea why this happened, strange.

joelfiddes added a commit that referenced this issue May 11, 2023
… blowout on large domains. Returned to xr.open_mfdataset see issue:

#67 (comment)

other small changes:

1. pd.date_range deprecated arg "closed" changed to "inclusive"

2. deprecated np.int -> np.int32

3. added missing temporal subset l.187
@ArcticSnow
Copy link
Owner

Ok, I tested with the edits above, and now it does not parallelize anymore the downscaling. I'll include it again and add an option, then.

@joelfiddes
Copy link
Collaborator Author

Im getting the number of points jobs launched as cores specified (6) and a shed load of processes launched during downscaling - do you mean you just have a single process running?

image

@ArcticSnow
Copy link
Owner

nevermind, I had my config file set for one core. Sorry, it works great!

Should we close the topic then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working topo_scale
Projects
None yet
Development

No branches or pull requests

3 participants