You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm writing data loader for loading GRIB weather data and found this project while browsing github. I'm currently considering what format to use in the dataloader. I hope you have time to explain some of your design choices? I need to download data in GRIB format from either ECMWF, GFS or HRRR and store it in a format that ca be used for ML.
For the dataloader I experimented with storing the data as:
uncompressed npy arrays ( very fast to load, but very big. 10x the size of GRIB)
compressed npy arrays (~4x bigger than GRIB but faster to load)
Zarr format (Very big files )
GRIB files split into a single file per field. (highly compressed but requires a lot of cpu's to decompress)
Zarr is easy to load and compatible with Xarray but was also way bigger than the original grib files?
Currently I'm leaning towards storing the data as an individual GRIB files for each field because it requires the least amount of diskspace. Maybe you could shed some light on why you choose the Zarr format?
Is it because it's fast to load or is it to stay compatible with WeatherBench2?
The text was updated successfully, but these errors were encountered:
Because that format fits our need to run 100 epochs over multi-terabytes datasets for training a weather forecasting model. Each chunk is on date/time will all the variables. Our datasets range between 7TB and 70TB.
Hi,
I'm writing data loader for loading GRIB weather data and found this project while browsing github. I'm currently considering what format to use in the dataloader. I hope you have time to explain some of your design choices? I need to download data in GRIB format from either ECMWF, GFS or HRRR and store it in a format that ca be used for ML.
For the dataloader I experimented with storing the data as:
Zarr is easy to load and compatible with Xarray but was also way bigger than the original grib files?
Currently I'm leaning towards storing the data as an individual GRIB files for each field because it requires the least amount of diskspace. Maybe you could shed some light on why you choose the Zarr format?
Is it because it's fast to load or is it to stay compatible with WeatherBench2?
The text was updated successfully, but these errors were encountered: