Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why did you chose the Zarr format? #1

Open
skaae opened this issue Jan 21, 2024 · 1 comment
Open

Why did you chose the Zarr format? #1

skaae opened this issue Jan 21, 2024 · 1 comment

Comments

@skaae
Copy link

skaae commented Jan 21, 2024

Hi,

I'm writing data loader for loading GRIB weather data and found this project while browsing github. I'm currently considering what format to use in the dataloader. I hope you have time to explain some of your design choices? I need to download data in GRIB format from either ECMWF, GFS or HRRR and store it in a format that ca be used for ML.

For the dataloader I experimented with storing the data as:

  • uncompressed npy arrays ( very fast to load, but very big. 10x the size of GRIB)
  • compressed npy arrays (~4x bigger than GRIB but faster to load)
  • Zarr format (Very big files )
  • GRIB files split into a single file per field. (highly compressed but requires a lot of cpu's to decompress)

Zarr is easy to load and compatible with Xarray but was also way bigger than the original grib files?
Currently I'm leaning towards storing the data as an individual GRIB files for each field because it requires the least amount of diskspace. Maybe you could shed some light on why you choose the Zarr format?
Is it because it's fast to load or is it to stay compatible with WeatherBench2?

@b8raoult
Copy link
Contributor

Because that format fits our need to run 100 epochs over multi-terabytes datasets for training a weather forecasting model. Each chunk is on date/time will all the variables. Our datasets range between 7TB and 70TB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants