GFS gribs in S3: Scan grib and fast referencing using index file, streaming it with Dask cluster in Coiled #530
Replies: 4 comments
-
cc @emfdavid |
Beta Was this translation helpful? Give feedback.
-
So glad you found this work helpful! |
Beta Was this translation helpful? Give feedback.
-
Can you share more about your work and what you are doing with the data? |
Beta Was this translation helpful? Give feedback.
-
This is related to using Ensemble Prediction System(EPS) (GEFS and ECMWF) for impact based forecasting and anticipatory action, especially for disasters such as floods and drought. The data is being used to train ML models to be used for EPS post processing and bias correction. @emfdavid Thanks for the heads up on testing. Planning for more tests, on ensemble members, on the long term parquet references, data integrity hence storing the references in huggingface. |
Beta Was this translation helpful? Give feedback.
-
Thanks to kerchunk scan_grib and the method in fast referencing dynamic-Grib-chunking method, could able to make GFS grib references into paraquet file for a single 00z run.
As the issue with duplicated references #407, a list of variables and its levels are passed to make the references, which looks avoiding the issue of referencing and zarr creation.
The file run_day_gfs_gik.py is used to create the parquet file with custom functions from utils.py. There is pre made scan_grib file for each time step is stored as paraquet and used in during the process. Which makes the single referencing for next 120 hours about 2 minutes to complete in a Coiled notebook in same region of GFS S3 bucket.
The file run_day_stream_gfs_gik_to_zarr.py reads the created paraquet into zarr and stream 15 variables nearly 15GB in size under 5 minutes.
The parquet references files for 15 variables are stored in https://huggingface.co/datasets/Nishadhka/gfs_s3_gik_refs/tree/main
Beta Was this translation helpful? Give feedback.
All reactions