Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17

Thomas-Moore-Creative · 2024-04-29T05:54:54Z

Title: Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox.

Objective: efficiently ****generate the following set of descriptive statistical monthly and daily climatologies ( including “all-time” and gappy data masked monthly by ENSO phase based on ONI index) on BRAN2020 output for a single 3D variable at a daily timestep:
1. mean
2. median
3. standard deviation
4. min
5. max
6. quantile
Data description: BRAN2020 (2020 version of the Bluelink ReANalysis) is an ocean reanalysis that combines observations with an eddy-resolving, near-global ocean general circulation model to produce a four-dimensional estimate of the ocean state. Overall BRAN2020 is order 50TB of float32 data over nearly 9000 netcdf file assets in total. For this problem we’ll focus on only the daily 3D temp variable.
1. data source: 366 individual netcdf files on Australian HPC system Gadi
2. this is accessed directly on the Gadi supercomputer ( national science resources - not public ) using a custom beta catalog I wrote using intake-esm and ecgtools - https://github.com/Thomas-Moore-Creative/BRAN2020-intake-catalog
3. netcdf files are each 5GB, compressed, with the potential temperature temp variable stored in a short data type with native chunking like: temp:_ChunkSizes = 1, 1, 300, 300
4. each individual netcdf file has one month of daily data and so the length of Time varies from 28 to 31 for any one of the 366 files concatenated together. The compressed 5GB single netcdf file becomes 34GB in loaded float32
5. When loaded the temp variable xarray object is **Time**: 11138 **st_ocean**: 51 **yt_ocean**: 1500 **xt_ocean**: 3600 and 11.16TB in float32
6. At native chunk sizes the default chunks load at 351.56 kiB, so xarray_open_kwargs are employed to increase that, however I believe that Time chunks on lazy load can’t be larger than the size of each individual netcdf file?
  1. so with xarray_open_kwargs of Time:31 we’ll get actual chunks like: [31,28,31,30,31,30 . . . ] and some years will be [ …. 31,29,31,30 . . .]
Resources: we are privileged and grateful to have meaningful HPC resources that could make working with 11 or 50TB possible given good code practice.
1. A goal is to make things work for a 28 cpu & 252GB node
2. Larger memory queues are possible but less available
Desired outcome: develop a detailed understanding of best-practice for running all of these statistics (including median and quantile ) as both monthly and daily climatologies from daily data organised as 366 x one-month-per-netcdf file.
Questions on considerations for best-practice?
1. What chunking xarray_open_kwargs do we use on the initial loading .open_mfdataset?
  1. Chosen based on some memory metric? ( some fraction of the memory per worker )?
  2. Chosen to be some mulitple of the native ChunkSizes
  3. Chosen to meet some requirements for better groupby operations via flox?
2. If these calculations will be done in different ways, many times over, and by many people would rechunking and writing to an “analysis-ready” zarr collection be useful. If so what Time chunking to choose? all-time ({’Time’:-1} or a monthly frequency?
3. Which flox methods are best to choose?

The text was updated successfully, but these errors were encountered:

Thomas-Moore-Creative · 2024-04-29T06:00:03Z

Keeping in mind these helpful suggestions from @dcherian here: #16 (comment)

/// quote:@dcherian ///

Some general comments (applicable to latest flox/xarray)

If you are not grouping by multiple variables, you can use the much nicer ds.groupby("time.month").mean() syntax and it will use flox automatically if installed.
Over the past few months, I've worked on setting method automatically looking at how the groups are distributed across chunks. You should not have to set method. This will let flox choose what's appropriate to whatever chunking you have.
That said (2) runs on heuristics so it's possible that it doesn't make the right choice. Details will help, particularly what are you grouping by, and how is the big array chunked along the core dimensions of the groupby operation.
Are you running in to trouble with the groupby reduction, or on calculating anomalies w.r.t the grouped-reduced values? With dask, you want to do those two steps separately, otherwise memory issues will often result.
It's tempting to simply say .chunk(time=30) for e.g. but really you should think about rechunking to a frequency (e.g. monthly or two-monthly). We don't have nice syntax for this yet (Support specifying chunk sizes using labels (e.g. frequency string) pydata/xarray#7559) but you can quite easily figure out the appropriate chunk tuple with ds.time.resample(time="2M").count(). If properly done, flox will then choose either "cohorts" or "blockwise" automatically for you, and save some memory. Here's an example: https://flox.readthedocs.io/en/latest/user-stories/climatology.html#rechunking-data-for-cohorts

Thomas-Moore-Creative · 2024-04-29T22:51:05Z

What chunking `xarray_open_kwargs` do we use on the initial loading `.open_mfdataset`?

See: Dask chunking Best-Practices

Both: Select a good chunk size

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.

and: Orient your chunks

When reading data you should align your chunks with your storage format. Most array storage formats store data in chunks themselves. If your Dask array chunks aren’t multiples of these chunk shapes then you will have to read the same data repeatedly, which can be expensive. Note though that often storage formats choose chunk sizes that are much smaller than is ideal for Dask, closer to 1MB than 100MB. In these cases you should choose a Dask chunk size that aligns with the storage chunk size and that every Dask chunk dimension is a multiple of the storage chunk dimension.

So for example if we have an HDF file that has chunks of size (128, 64), we might choose a chunk shape of (1280, 6400).

Note that if you provide chunks='auto' then Dask Array will look for a .chunks attribute and use that to provide a good chunking.

For the 366 BRAN2020 temp netcdf files chunks='auto' yield 88MB float32 chunks at (4,4,1200,1200)

print_chunks

Time chunks: (4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4, 2)
st_ocean chunks: (4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3)
yt_ocean chunks: (1200, 300)
xt_ocean chunks: (1200, 1200, 1200)

Print # of dask tasks:

num_tasks = len(DS.temp.data.dask)
print(f"Number of tasks : {num_tasks}")

Number of tasks : 453390

ValueError: Zarr requires uniform chunk sizes except for final chunk. Variable named 'temp' has incompatible dask chunks:

However as each of the 366 netcdf files has a different length Time dimension (28,29,30, or 31) which isn't divisible by the chunking {Time:4}. The resulting chunks are not compatible with writing intermediate zarr ARD collections as all chunks must be the same size except for the last.

Thomas-Moore-Creative · 2024-04-30T01:30:30Z

Let's assume that the benefit of working from a re-loaded zarr collection written once from the 366 netcdf files is much more performant than an object loaded via xr.open_mfdataset from the 366 netcdf files. (we'll test this later once we have the zarr collections to compare)

Given that the native netcdf storage format is "monthly" files with 28,29,30, or 31 Time steps and that both 29 and 31 are prime numbers {'Time':1} is the only time chunking available to us for the first write to zarr

xarray_open_kwargs = {"chunks": {"Time": 1, "st_ocean": 20, "xt_ocean": 1200, "yt_ocean": 1200}} yields 109.86 MiB chunks - 200,484 chunks total and number of tasks : 401,334

The number of chunks, number of tasks, and size of chunks for this open with kwargs here is similar to the chunks='auto' job.

Next step: write "analysis ready data" (ARD) zarr collection from the 366 netcdf files using xarray_open_kwargs = {"chunks": {"Time": 1, "st_ocean": 20, "xt_ocean": 1200, "yt_ocean": 1200}}

dcherian · 2024-04-30T03:33:38Z

(1) I would just call ds.chunk({"Time": 4}).to_zarr(). This will get you even size chunks.
(2) Your files are setup nicely for climatological calculations (both monthly (ds.groupby("Time.month") and daily ds.groupby(ds.Time.dt.strftime('%h-%d'))) with method="cohorts". median and quantile won't work, but the rest should just work(TM). In my checks flox does detect the patterns and use method="cohorts" automatically.
(3) I suspect running ds, = dask.optimize(ds) right after open_mfdataset may help a bit by reducing the number of 'layers' in the graph.
(4) Your objectives here are to reduce the data in time. Choosing {"Time": -1} chunksize is a good idea for both daily and monthly climatologies. By choosing the biggest size along time, you'll reduce the data immediately upon read. To write these to Zarr, you can again rechunk to a uniform size just before writing.

Thomas-Moore-Creative · 2024-04-30T04:29:00Z

(4) Your objectives here are to reduce the data in time. Choosing {"Time": -1} chunksize is a good idea for both daily and monthly climatologies. By choosing the biggest size along time, you'll reduce the data immediately upon read. To write these to Zarr, you can again rechunk to a uniform size just before writing.

Thanks heaps for all the advice above, @dcherian. Re: (4) We're blessed with a fair bit of /scratch space and my current thinking for all these kind of time reduction operations is that it's always best to first convert the many netcdf files into a {"Time": -1} zarr collection for many people to compute against many times? Is this what you're saying here in (4)? And by the last sentence are you flagging that the climatological output will end up with small chunks and that's easy to rechunk before writing out the results?

It sounds like my next step is figuring out the best possible way to get that very large {"Time": -1} zarr collection written from the 11TB's across 366 netcdfs without blowing up the cluster?

dcherian · 2024-04-30T04:34:50Z

Sorry I meant {"Time": -1} in open_mfdataset will chunk so that all timestamps in one file are in one chunk. That is going to be a monthly frequency chunking which will work nicely for the climatologies.

is that it's always best to first convert the many netcdf files into a {"Time": -1} zarr collection for many people to compute against many times

Well yes but then you have a different problem. But yeah if you get this then even medians and quantiles will work, and will work quite well.

It sounds like my next step is figuring out the best possible way to get that very large {"Time": -1} zarr collection written from the 11TB's across 366 netcdfs without blowing up the cluster?

https://rechunker.readthedocs.io/en/latest/

Thomas-Moore-Creative · 2024-04-30T05:02:57Z

Thanks again @dcherian - have been using rechunker and trying to tune that for the 11TB float32 objects, this specific problem, and our clusters.

So approach sounds like:

{"Time": -1} in open_mfdataset yielding monthly frequency chunking direct into flox enabled ds.groupby("Time.month") for mean,std,min, &max
a rechunker chunked {"Time": -1} aka {'Time':11138} zarr collection to load into flox enabled ds.groupby("Time.month") for median & quantile ( method would be "blockwise"?)

Lastly, by

Well yes but then you have a different problem.

I assume you are saying that a {"Time": -1} aka {'Time':11138} chunked array is poorly optimised for the ds.groupby("Time.month").mean() problem and will end up with many more tasks and smaller output chunks than is needed?

dcherian · 2024-04-30T17:25:34Z

I assume you are saying that a {"Time": -1} aka {'Time':11138} chunked array is poorly optimised for the ds.groupby("Time.month").mean() problem and will end up with many more tasks and smaller output chunks than is needed?

Not at all. It will work very well indeed. Just saying that the need for a rechunked array isn't in your "Objective" listed above.

Thomas-Moore-Creative · 2024-05-04T19:54:14Z

all-time base stats completed for 11TB 3D @ daily variables

Cluster: MegaMem ncpus=48 mem=2990GB
Walltime: approx 4 hours per variable
Approach: Via PBS batch >> intake search >> lazy load with xarray_open_kwargs = {"chunks": {"Time": -1,'st_ocean':10}} >> .groupby().mean() w/ engine='flox',method="cohorts",skipna=False >> write to netcdf

( "base" stats doesn't include median or quantile )

dcherian · 2024-05-04T20:14:39Z

With that chunking quantile with flox will work quite well too. I recommend just adding q=0.5 to get the median at the same time. Flox will calculate grouped quantiles for all quantiles in a vectorized manner

Thomas-Moore-Creative · 2024-05-04T22:28:54Z

With that chunking quantile with flox will work quite well too. I recommend just adding q=0.5 to get the median at the same time. Flox will calculate grouped quantiles for all quantiles in a vectorized manner

Thanks for those recommendations!

I have a 3D @ daily variable chunked "for all-time" in an ARD zarr collection - so I'll test your advice above with the {"Time":[31,28,31,30 . . .]} vs. using that {"Time":11138} lazy object.
[ EDIT FOR CLARITY ]

Thomas-Moore-Creative · 2024-05-05T23:22:55Z

Flox will calculate grouped quantiles for all quantiles in a vectorized manner

If I specify engine="flox" I get GroupBy.quantile() got an unexpected keyword argument 'engine'. I'm probably misunderstanding how this works (again). So since I have flox imported "Flox will calculate grouped quantiles for all quantiles in a vectorized manner" even without the engine="flox" argument?

Thomas-Moore-Creative · 2024-05-06T02:01:48Z

@dcherian - further when I try to run (with flox imported and the option to use in xr set to True) the following:

quant = ds.groupby(time_dim+'.month').quantile(q_list,skipna=skipna_flag,dim=time_dim)

I get this error: ValueError: Aggregation quantile is only supported for method='blockwise', but the chunking is not right. ???

I print out the chunking right before I call the above:

importing functions ...
Spinning up a dask cluster...
<Client: 'tcp://127.0.0.1:42425' processes=48 threads=48, memory=2.91 TiB>
variable requested: salt
Time chunks: (31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30
, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30
, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30
, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31
, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31
, 30, 31, 30, 31, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 29, 31
, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31, 31, 28, 31, 30, 31, 30)
st_ocean chunks: (10, 10, 10, 10, 10, 1)
yt_ocean chunks: (300, 300, 300, 300, 300)
xt_ocean chunks: (300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300, 300)

dcherian · 2024-05-06T16:11:33Z

So since I have flox imported "Flox will calculate grouped quantiles for all quantiles in a vectorized manner" even without the engine="flox" argument?

o_O nice find. But yes you just need it installed and xr.set_options(use_flox=True). I think the top level quantile doesn't take arbitrary kwargs yet, so it complains.

quant = ds.groupby(time_dim+'.month').quantile(q_list,skipna=skipna_flag,dim=time_dim)

Another aside, it's nice to save the result of ds.groupby(time_dim+'.month') like so:

gb = ds.groupby(time_dim + '.month')
quant = gb.quantile(...)

This will save some repeated effort in setting up the groupby problem.

get this error: ValueError: Aggregation quantile is only supported for method='blockwise', but the chunking is not right.

Hmmm.. potentially a bug.

dcherian · 2024-05-06T16:17:10Z

Oh right this won't work.

Quantiles work by sorting (really, paritioning the data) so you need all time points for each month in a single chunk.

Since you have a time-rechunked dataset, use that as an input.

Thomas-Moore-Creative · 2024-05-06T16:22:44Z

Another aside, it's nice to save the result of ds.groupby(time_dim+'.month') like so:
gb = ds.groupby(time_dim + '.month')
quant = gb.quantile(...)
This will save some repeated effort in setting up the groupby problem.

Nice tip, thank you. Always keen to shave off walltime.

Thomas-Moore-Creative · 2024-05-06T16:32:01Z

Oh right this won't work.

Quantiles work by sorting (really, paritioning the data) so you need all time points for each month in a single chunk.

Since you have a time-rechunked dataset, use that as an input.

Thanks for thinking this through. So, looks like I need to fix the problems with my rechunker code to get chunk:"all-time" zarr's for all my 3D variables.

It has been really useful to learn through this discussion that chunk:"all-time" isn't required for fast floxy aggregations like mean,std,min,max. My code to generate these stats is finishing in about 3 hours for an 11TB 3D & time float32 variable on 28 x 8GB ( and possibly could drop some of that memory and reduce resources used )

dcherian · 2024-05-06T16:35:33Z

Yes everything but quantiles. We could do an approximate_quantile with that chunking.

See https://github.com/xarray-contrib/flox/pull/284/files if you want to test that out.

Thomas-Moore-Creative · 2024-05-20T04:43:28Z

what we learned here has been factored into new code

Thomas-Moore-Creative mentioned this issue Apr 29, 2024

chat? #16

Closed

Thomas-Moore-Creative self-assigned this Apr 29, 2024

Thomas-Moore-Creative mentioned this issue Apr 30, 2024

write "analysis ready data" (ARD) zarr collection from the 366 netcdf files using xarray_open_kwargs = {"chunks": {"Time": 1, "st_ocean": 20, "xt_ocean": 1200, "yt_ocean": 1200}} #18

Closed

Thomas-Moore-Creative added a commit that referenced this issue Apr 30, 2024

one year clim mean in 8min on 28 x 8GB #17

76f990f

Thomas-Moore-Creative added a commit that referenced this issue May 3, 2024

updated stats functions to reflect flox testing #17

cb9231e

Thomas-Moore-Creative added a commit that referenced this issue May 3, 2024

update to code to run3D base stats, 3D var on megamem cluster #17

ac3ca75

Thomas-Moore-Creative added a commit that referenced this issue May 3, 2024

checks on the 3D salt basic stats - ran in 4hours walltime #17

e4a9990

Thomas-Moore-Creative added a commit that referenced this issue May 3, 2024

running base stats on 3D u velocity on megamem #17

e25530b

Thomas-Moore-Creative added a commit that referenced this issue May 6, 2024

interactive comparison of quantile calcs on 28 x 8GB cluster #17

a324383

Thomas-Moore-Creative closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17

Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17

Thomas-Moore-Creative commented Apr 29, 2024

Thomas-Moore-Creative commented Apr 29, 2024 •

edited

Loading

Thomas-Moore-Creative commented Apr 29, 2024 •

edited

Loading

Thomas-Moore-Creative commented Apr 30, 2024

dcherian commented Apr 30, 2024

Thomas-Moore-Creative commented Apr 30, 2024

dcherian commented Apr 30, 2024 •

edited

Loading

Thomas-Moore-Creative commented Apr 30, 2024 •

edited

Loading

dcherian commented Apr 30, 2024

Thomas-Moore-Creative commented May 4, 2024 •

edited

Loading

dcherian commented May 4, 2024

Thomas-Moore-Creative commented May 4, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 5, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 6, 2024 •

edited

Loading

dcherian commented May 6, 2024

dcherian commented May 6, 2024

Thomas-Moore-Creative commented May 6, 2024

Thomas-Moore-Creative commented May 6, 2024

dcherian commented May 6, 2024

Thomas-Moore-Creative commented May 20, 2024

Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17

Generating descriptive statistics composited over ENSO phase for very large ocean reanalysis data using xarray, dask, and flox. #17

Comments

Thomas-Moore-Creative commented Apr 29, 2024

Thomas-Moore-Creative commented Apr 29, 2024 • edited Loading

Thomas-Moore-Creative commented Apr 29, 2024 • edited Loading

What chunking xarray_open_kwargs do we use on the initial loading .open_mfdataset?

Thomas-Moore-Creative commented Apr 30, 2024

dcherian commented Apr 30, 2024

Thomas-Moore-Creative commented Apr 30, 2024

dcherian commented Apr 30, 2024 • edited Loading

Thomas-Moore-Creative commented Apr 30, 2024 • edited Loading

dcherian commented Apr 30, 2024

Thomas-Moore-Creative commented May 4, 2024 • edited Loading

all-time base stats completed for 11TB 3D @ daily variables

dcherian commented May 4, 2024

Thomas-Moore-Creative commented May 4, 2024 • edited Loading

Thomas-Moore-Creative commented May 5, 2024 • edited Loading

Thomas-Moore-Creative commented May 6, 2024 • edited Loading

dcherian commented May 6, 2024

dcherian commented May 6, 2024

Thomas-Moore-Creative commented May 6, 2024

Thomas-Moore-Creative commented May 6, 2024

dcherian commented May 6, 2024

Thomas-Moore-Creative commented May 20, 2024

Thomas-Moore-Creative commented Apr 29, 2024 •

edited

Loading

Thomas-Moore-Creative commented Apr 29, 2024 •

edited

Loading

What chunking `xarray_open_kwargs` do we use on the initial loading `.open_mfdataset`?

dcherian commented Apr 30, 2024 •

edited

Loading

Thomas-Moore-Creative commented Apr 30, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 4, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 4, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 5, 2024 •

edited

Loading

Thomas-Moore-Creative commented May 6, 2024 •

edited

Loading