Advice on rechunking NetCDF files #393

NikosAlexandris · 2023-11-11T02:28:50Z

NikosAlexandris
Nov 11, 2023

What tools do you use for rechunking? What is your use-case chunking strategy?

Context

Mixed chunking shapes, or let us say a non-uniform chunking layout, among the files that pertain to the same series of a larger data product, is not that uncommon. It may well be out of indifference driven by the increasing computational power that for many cases doesn't mind moderate differences in the time required to consume larger time series. It may also be automatic routines that decide for an (different) optimal chunking shape each time it comes to produce and add new/recent years of data in to an archive.

Example

Such is the case of some SARAH3 products. For example, SID data are 'chunked' per year as follows ¹ :

1999 1300 x 1300
2000 1300 x 1300
2001 1300 x 1300
2002 1300 x 1300
2003 1300 x 1300
2004 1300 x 1300
2005 1300 x 1300
2006 2600 x 1
2007 1300 x 1300
2008 2600 x 1
2009 2600 x 1
2010 2600 x 1
2011 2600 x 1
2012 2600 x 1
2013 2600 x 1
2014 1300 x 1300
2015 2600 x 1
2016 2600 x 1
2017 2600 x 1
2018 2600 x 1
2019 2600 x 1
2020 2600 x 1
2021 2600 x 2600
2022 2600 x 2600

UPDATE or using Xarray

Variable	Shapes	Files	Count
time	512	SIDin201901020000004231000101MA.nc ...	6208
time	524288	SIDin200307060000004231000101MA.nc ...	2558
lon	2600	SIDin201901020000004231000101MA.nc ...	8766
lon_bnds	2600 x 2	SIDin201901020000004231000101MA.nc ...	8766
lat	2600	SIDin201901020000004231000101MA.nc ...	8766
lat_bnds	2600 x 2	SIDin201901020000004231000101MA.nc ...	8766
SID	1 x 1 x 2600	SIDin201901020000004231000101MA.nc ...	5478
SID	1 x 1300 x 1300	SIDin200307060000004231000101MA.nc ...	2558
SID	1 x 2600 x 2600	SIDin2021011200000042310001I1MA.nc ...	730
record_status	48	SIDin201901020000004231000101MA.nc ...	8765
record_status	4096	SIDin2021110600000042310001I1MA.nc	1

Kerchunk

Kerchunk requires a uniform chunking layout throughout the complete time series. So be it. The time it takes to rechunk data depends, I guess, on a series of factors. Nonetheless, here I focus on obvious ones : the input and the target chunking shape of the data. For some combinations ( let's consider the dimensions of the products in question : input time x lat x lon --> output time x lat x lon) it takes as much time as it would take to copy the data. For others more. However, for some combinations it'll take quite a lot, i.e. running for a day having covered a small fragment of the input data. Such is the case to rechunk data from (lat, lon) 2600 x 1.

I've only successfully rechunked original NetCDF files using nccopy. A simple example would be to rechunk from 1 x 2600 x 1 to 1 x 2600 x 2600 :

nccopy -c time/1,lat/2600,lon/2600 SIDin200601010000004231000101MA.nc /project/scratch/p200206/SIDin200601010000004231000101MA_1_2600_2600.nc

or then the same data to 48 x 650 x 650 which is painfully slow and translates to a higher energy consumption too.

Another example is selective processing using GNU Parallel from 1 x 1300 x 1300 to 48 x 650 x 650 for the good years

echo SIDin{2007,2014,2021,2022}*.nc | xargs -n 1 |parallel --joblog "parallel.rechunking.sid.48.650.650.attempt.3.log" nccopy -c time/48,lat/650,lon/650 {} /project/scratch/p200206/sarah3/sid/chunks_48_650_650/{.}_48_650_650.nc

which works quite reasonably/fast.

Tools

I wonder what other tools are out there to do this ? I know about rechunker but it will do only for Zarr or TileDB formats. Another two ideas which I did not make them work would be to 1) use the createVariable function from the netcdf4-python library, and 2) xarray with Dask support.

Notes

For the sake of it, here also the fragment from :

The optional keyword chunksizes can be used to manually specify the HDF5 chunksizes for each dimension of the variable. A detailed discussion of HDF chunking and I/O performance is available here. The default chunking scheme in the netcdf-c library is discussed here. Basically, you want the chunk size for each dimension to match as closely as possible the size of the data block that users will read from the file. chunksizes cannot be set if contiguous=True.

The page https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html discusses chunking in detail and mentions among other things the relevance to the physical block size. Else, as noted in #153 (comment), modern storage is fast enough to render this not that important.

These are Blocks reported via gdalinfo ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice on rechunking NetCDF files #393

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Advice on rechunking NetCDF files #393

NikosAlexandris Nov 11, 2023

Footnotes

Replies: 0 comments

NikosAlexandris
Nov 11, 2023