[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

meridionaljet · 2023-06-01T06:47:23Z

This is an updated implementation of #338 , addressing a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and almost always unnecessary since GRIB files typically have identical grids for each record.

This pull request introduces automatic caching of geographic coordinate data by default when calling cfgrib.open_dataset() or cfgrib.open_datasets(). The caching logic is embedded into cfgrib.dataset.build_variable_components(), utilizing the md5sum of the Grid Definition Section of the GRIB file (thanks @iainrussell for that suggestion).

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

The only possible negative side effect that I can see is a small one: the cache must be implemented globally and thus can theoretically grow unboundedly in a long-lived application wherein cfgrib opens many different grid geometries. I have thus included a mechanism for the user to opt out of coordinate caching by passing cache_geo_coords=False to backend_kwargs. Practically, this should be a rare need, since the total data size would cause memory issues for a typical user long before the coordinate cache would, and most workflows read a small number of unique grid geometries.

The speedup offered here releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

when opening a dataset

a dataset

…puted geographical coordinates

… kwargs

meridionaljet · 2023-06-01T07:01:30Z

Fixed the failing code format check

codecov-commenter · 2023-06-01T08:38:51Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (2b2e190) 95.62% compared to head (001f003) 95.65%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   95.62%   95.65%   +0.03%     
==========================================
  Files          26       26              
  Lines        2056     2073      +17     
  Branches      236      238       +2     
==========================================
+ Hits         1966     1983      +17     
  Misses         59       59              
  Partials       31       31

Impacted Files	Coverage Δ
cfgrib/xarray_plugin.py	`88.40% <ø> (ø)`
cfgrib/dataset.py	`98.45% <100.00%> (+0.05%)`	⬆️
tests/test_40_xarray_store.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

cfgrib/dataset.py

tests/test_40_xarray_store.py

iainrussell · 2023-06-01T08:50:45Z

Many thanks @meridionaljet , this is a really nice improvement - I just added a couple of comments above, then I think we're close to merging it in!

meridionaljet · 2023-06-01T15:28:10Z

Requested tweaks by @iainrussell have been implemented

iainrussell · 2023-06-02T08:13:15Z

Thank you @meridionaljet ! I really like this improvement, and the fact that you added documentation, a test, and also a way to disable it in case of it being used as part of a long-running server. Thanks also for being patient with my suggestions and taking them on board, I think this solution is nice because it works 'out of the box' and does not have the risk of corrupted xarrays if the incoming GRIB file has multiple geometries. Thanks again!

martindurant · 2023-06-05T17:57:29Z

For kerchunk's use, we would really most like to simple not calculate coordinates at all, as we can store them elsewhere. If it were possible, then, to just skip the bytes that define the geometry to the actual measurements in a given message, all the better. Do you think this is possible?

iainrussell · 2023-06-13T15:27:37Z

Hi @martindurant, could you create a new issue for this use case please? It would be good to see an example of a GRIB file and how you would like the resulting xarray to look. It's not clear if you want to remove all the coordinates, including the time and vertical dimensions, and if this is for performance, memory or aesthetics. So if if it really would be useful, pop it in another issue and we can discuss there!
Cheers,
Iain

TAdeJong · 2024-01-24T10:40:06Z

Edit: cfgrib 0.9.11.0 incorporating these changes has now been released! 😀

This pull_request greatly increases the speed of our workflow. However, installing from source is somewhat of a hassle.
@iainrussell, I see you are recently doing work on this repository again. Are there plans for a new release soon? It would greatly help us, and I am sure a lot of other people using grib files and xarray 😄 .

~~(I couldn't really think of another place to ask this, so I hope this way is OK.)~~

meridionaljet added 10 commits April 14, 2023 14:41

allow precomputed geographic coordinate data to be passed to the backend

4e1f778

when opening a dataset

add test for using precomputed geographic coordinate data when creating

4a45b95

a dataset

reformat

f2948d4

move precomputed geo coords test to proper file

01ffc96

add documentation section for optimizing dataset creation with precom…

1ce00c1

…puted geographical coordinates

Merge branch 'master' into precompute-geocoords

d65ccd7

cache geo coords by default; add cache_geo_coords option to backend…

315d19e

… kwargs

BUGFIX: geometry cache key must depend on encode_cf

aaa1ff0

add test for cached grid geometry

70521c6

update documentation for opting out of coordinate caching

1e20f58

meridionaljet mentioned this pull request Jun 1, 2023

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Merged

tlmquintino requested review from iainrussell and sandorkertesz June 1, 2023 06:51

black formatting

93051e3

iainrussell requested changes Jun 1, 2023

View reviewed changes

cfgrib/dataset.py Outdated Show resolved Hide resolved

tests/test_40_xarray_store.py Outdated Show resolved Hide resolved

meridionaljet added 2 commits June 1, 2023 05:22

BUGFIX: caching default is True, so test against False

cce708d

use "md5GridSection" generic key to avoid reading GRIB edition

001f003

iainrussell merged commit cccbdb7 into ecmwf:master Jun 2, 2023

martindurant mentioned this pull request Jun 14, 2023

Only read payload buffer #343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

meridionaljet commented Jun 1, 2023

meridionaljet commented Jun 1, 2023

codecov-commenter commented Jun 1, 2023 •

edited

Loading

iainrussell commented Jun 1, 2023

meridionaljet commented Jun 1, 2023

iainrussell commented Jun 2, 2023

martindurant commented Jun 5, 2023

iainrussell commented Jun 13, 2023

TAdeJong commented Jan 24, 2024 •

edited

Loading

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

Conversation

meridionaljet commented Jun 1, 2023

meridionaljet commented Jun 1, 2023

codecov-commenter commented Jun 1, 2023 • edited Loading

Codecov Report

iainrussell commented Jun 1, 2023

meridionaljet commented Jun 1, 2023

iainrussell commented Jun 2, 2023

martindurant commented Jun 5, 2023

iainrussell commented Jun 13, 2023

TAdeJong commented Jan 24, 2024 • edited Loading

codecov-commenter commented Jun 1, 2023 •

edited

Loading

TAdeJong commented Jan 24, 2024 •

edited

Loading