Why did you chose the Zarr format? #1

skaae · 2024-01-21T16:28:32Z

Hi,

I'm writing data loader for loading GRIB weather data and found this project while browsing github. I'm currently considering what format to use in the dataloader. I hope you have time to explain some of your design choices? I need to download data in GRIB format from either ECMWF, GFS or HRRR and store it in a format that ca be used for ML.

For the dataloader I experimented with storing the data as:

uncompressed npy arrays ( very fast to load, but very big. 10x the size of GRIB)
compressed npy arrays (~4x bigger than GRIB but faster to load)
Zarr format (Very big files )
GRIB files split into a single file per field. (highly compressed but requires a lot of cpu's to decompress)

Zarr is easy to load and compatible with Xarray but was also way bigger than the original grib files?
Currently I'm leaning towards storing the data as an individual GRIB files for each field because it requires the least amount of diskspace. Maybe you could shed some light on why you choose the Zarr format?
Is it because it's fast to load or is it to stay compatible with WeatherBench2?

b8raoult · 2024-01-26T10:09:39Z

Because that format fits our need to run 100 epochs over multi-terabytes datasets for training a weather forecasting model. Each chunk is on date/time will all the variables. Our datasets range between 7TB and 70TB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why did you chose the Zarr format? #1

Why did you chose the Zarr format? #1

skaae commented Jan 21, 2024

b8raoult commented Jan 26, 2024

Why did you chose the Zarr format? #1

Why did you chose the Zarr format? #1

Comments

skaae commented Jan 21, 2024

b8raoult commented Jan 26, 2024