Advice on rechunking NetCDF files #393
Unanswered
NikosAlexandris
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What tools do you use for rechunking? What is your use-case chunking strategy?
Context
Mixed chunking shapes, or let us say a non-uniform chunking layout, among the files that pertain to the same series of a larger data product, is not that uncommon. It may well be out of indifference driven by the increasing computational power that for many cases doesn't mind moderate differences in the time required to consume larger time series. It may also be automatic routines that decide for an (different) optimal chunking shape each time it comes to produce and add new/recent years of data in to an archive.
Example
Such is the case of some SARAH3 products. For example, SID data are 'chunked' per year as follows 1 :
1999 1300 x 1300
2000 1300 x 1300
2001 1300 x 1300
2002 1300 x 1300
2003 1300 x 1300
2004 1300 x 1300
2005 1300 x 1300
2006 2600 x 1
2007 1300 x 1300
2008 2600 x 1
2009 2600 x 1
2010 2600 x 1
2011 2600 x 1
2012 2600 x 1
2013 2600 x 1
2014 1300 x 1300
2015 2600 x 1
2016 2600 x 1
2017 2600 x 1
2018 2600 x 1
2019 2600 x 1
2020 2600 x 1
2021 2600 x 2600
2022 2600 x 2600
UPDATE or using Xarray
Kerchunk
Kerchunk requires a uniform chunking layout throughout the complete time series. So be it. The time it takes to rechunk data depends, I guess, on a series of factors. Nonetheless, here I focus on obvious ones : the input and the target chunking shape of the data. For some combinations ( let's consider the dimensions of the products in question : input
time
xlat
xlon
--> outputtime
xlat
xlon
) it takes as much time as it would take to copy the data. For others more. However, for some combinations it'll take quite a lot, i.e. running for a day having covered a small fragment of the input data. Such is the case to rechunk data from (lat, lon) 2600 x 1.I've only successfully rechunked original NetCDF files using
nccopy
. A simple example would be to rechunk from1
x2600
x1
to1
x2600
x2600
:or then the same data to
48
x650
x650
which is painfully slow and translates to a higher energy consumption too.Another example is selective processing using GNU Parallel from
1
x1300
x1300
to48
x650
x650
for the good yearswhich works quite reasonably/fast.
Tools
I wonder what other tools are out there to do this ? I know about rechunker but it will do only for Zarr or TileDB formats. Another two ideas which I did not make them work would be to 1) use the createVariable function from the netcdf4-python library, and 2) xarray with Dask support.
Notes
For the sake of it, here also the fragment from :
The page https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html discusses chunking in detail and mentions among other things the relevance to the physical block size. Else, as noted in #153 (comment), modern storage is fast enough to render this not that important.
Footnotes
These are Blocks reported via
gdalinfo
↩Beta Was this translation helpful? Give feedback.
All reactions