GFS gribs in S3: Scan grib and fast referencing using index file, streaming it with Dask cluster in Coiled #530

nishadhka · 2024-11-27T04:59:12Z

nishadhka
Nov 27, 2024

Thanks to kerchunk scan_grib and the method in fast referencing dynamic-Grib-chunking method, could able to make GFS grib references into paraquet file for a single 00z run.
As the issue with duplicated references #407, a list of variables and its levels are passed to make the references, which looks avoiding the issue of referencing and zarr creation.
The file run_day_gfs_gik.py is used to create the parquet file with custom functions from utils.py. There is pre made scan_grib file for each time step is stored as paraquet and used in during the process. Which makes the single referencing for next 120 hours about 2 minutes to complete in a Coiled notebook in same region of GFS S3 bucket.

The file run_day_stream_gfs_gik_to_zarr.py reads the created paraquet into zarr and stream 15 variables nearly 15GB in size under 5 minutes.

The parquet references files for 15 variables are stored in https://huggingface.co/datasets/Nishadhka/gfs_s3_gik_refs/tree/main

martindurant · 2024-11-27T15:39:31Z

martindurant
Nov 27, 2024
Maintainer

cc @emfdavid

0 replies

emfdavid · 2024-11-27T21:45:09Z

emfdavid
Nov 27, 2024

So glad you found this work helpful!
We are moving the dynamic chunking code from the prototype asascience-open repo to kerchunk.
The api is now complete in kerchunk. The tests are still in progress (PR)!

0 replies

emfdavid · 2024-11-27T21:52:48Z

emfdavid
Nov 27, 2024

Can you share more about your work and what you are doing with the data?

0 replies

nishadhka · 2024-11-28T00:35:15Z

nishadhka
Nov 28, 2024
Author

This is related to using Ensemble Prediction System(EPS) (GEFS and ECMWF) for impact based forecasting and anticipatory action, especially for disasters such as floods and drought. The data is being used to train ML models to be used for EPS post processing and bias correction.

@emfdavid Thanks for the heads up on testing. Planning for more tests, on ensemble members, on the long term parquet references, data integrity hence storing the references in huggingface.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GFS gribs in S3: Scan grib and fast referencing using index file, streaming it with Dask cluster in Coiled #530

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

GFS gribs in S3: Scan grib and fast referencing using index file, streaming it with Dask cluster in Coiled #530

nishadhka Nov 27, 2024

Replies: 4 comments

martindurant Nov 27, 2024 Maintainer

emfdavid Nov 27, 2024

emfdavid Nov 27, 2024

nishadhka Nov 28, 2024 Author

nishadhka
Nov 27, 2024

martindurant
Nov 27, 2024
Maintainer

emfdavid
Nov 27, 2024

emfdavid
Nov 27, 2024

nishadhka
Nov 28, 2024
Author