Dramatically speed up dataset creation by pre-computing geographic coordinates #338

meridionaljet · 2023-04-15T01:19:15Z

This addresses a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and often unnecessary since such files typically have identical grids for each record.

This pull request introduces the option for the user to include pre-computed geographic coordinate data in backend_kwargs when calling cfgrib.open_dataset() or cfgrib.open_datasets() in cases where the user knows that their GRIB file has identical geometry for each parameter. The utility function cfgrib.dataset.get_first_geo_coords() has been added, which computes coordinates for the first record in the index. The user can then use the output as follows:

geocoords = cfgrib.dataset.get_first_geo_coords(filename)
ds = cfgrib.open_dataset(filename, backend_kwargs=dict(precomputed_geo_coords=geocoords))

geocoords could be serialized and reused by the user for multiple files with identical geometry, requiring zero calls to build_geography_coordinates() once computed for the first time.

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

A basic test has been added to the test suite as part of this pull request.

I have not contributed to cfgrib before, but as far as I can tell, adding this optional feature cannot break anything unless the user employs this option on a file for which it is inappropriate, and such files are uncommon. The speedup offered releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

when opening a dataset

a dataset

FussyDuck · 2023-04-15T01:19:21Z

All committers have signed the CLA.

sandorkertesz · 2023-04-18T11:08:01Z

Thank you for your excellent work, it is quite an impressive improvement.

However, as you can see not all the tests have passed. I suggest you move your test into test_40_xarray_store.py, which is already dealing with the backend_kwargs options, then you do not need:

from cfgrib import open_dataset

and can simply use

xarray_store.open_dataset(...)

to open your data.

…puted geographical coordinates

codecov-commenter · 2023-04-19T07:19:22Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (3951355) 95.44% compared to head (1ce00c1) 95.47%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #338      +/-   ##
==========================================
+ Coverage   95.44%   95.47%   +0.03%     
==========================================
  Files          26       26              
  Lines        2040     2057      +17     
  Branches      236      237       +1     
==========================================
+ Hits         1947     1964      +17     
  Misses         62       62              
  Partials       31       31

Impacted Files	Coverage Δ
cfgrib/xarray_plugin.py	`88.40% <ø> (ø)`
cfgrib/dataset.py	`98.45% <100.00%> (+0.05%)`	⬆️
tests/test_40_xarray_store.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

meridionaljet · 2023-05-23T17:30:53Z

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

tlmquintino · 2023-05-23T19:08:05Z

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed?
I see that there is currently a conflict. Can you please fix it?
This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

meridionaljet · 2023-05-23T19:17:01Z

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

Yes, the commits on April 18th addressed what @sandorkertesz had said. I just fixed the new conflict due to an update to master - should be good to go now unless you guys want to adjust the implementation.

iainrussell · 2023-05-24T07:11:46Z

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers,
Iain

meridionaljet · 2023-05-25T13:42:56Z

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers, Iain

I took a look at the feasibility of this. For cfgrib.open_dataset(), it is straight-forward to implement a cache inside cfgrib.dataset.build_dataset_components().

However, for cfgrib.open_datasets(), build_dataset_components() is called multiple times, and the execution path of open_datasets() seems to run through xarray_store.open_dataset(). This would require the cache to be implemented globally, which is easy enough, but in turn would require some sort of state management to avoid the cache growing unboundedly in long-lived applications which open many different grid geometries. A possible idea is to utilize functools.lru_cache somehow to restrict the number of entries in the global cache, though this doesn't have knowledge of the memory size of each entry, so it is difficult to pick an appropriate number.

That said, it's possible we could assume reasonably that the geometry alone is small enough to be unconcerned about for 99.9% of workflows. I suppose we could also make geocaching an optional feature enabled/disabled by the user as a backend_kwarg.

In light of these observations, what do you think about this approach?

meridionaljet · 2023-06-01T06:49:44Z

@iainrussell @tlmquintino @sandorkertesz I have posted an alternative (and I believe better) implementation at #341 .

This is different enough that I felt it warranted a separate PR to avoid confusion. Please let me know what you think!

meridionaljet added 2 commits April 14, 2023 14:41

allow precomputed geographic coordinate data to be passed to the backend

4e1f778

when opening a dataset

add test for using precomputed geographic coordinate data when creating

4a45b95

a dataset

meridionaljet added 3 commits April 18, 2023 20:21

reformat

f2948d4

move precomputed geo coords test to proper file

01ffc96

add documentation section for optimizing dataset creation with precom…

1ce00c1

…puted geographical coordinates

sandorkertesz self-requested a review April 19, 2023 09:38

Merge branch 'master' into precompute-geocoords

d65ccd7

meridionaljet mentioned this pull request Jun 1, 2023

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

Merged

iainrussell merged commit d65ccd7 into ecmwf:master Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

meridionaljet commented Apr 15, 2023 •

edited

Loading

FussyDuck commented Apr 15, 2023 •

edited

Loading

sandorkertesz commented Apr 18, 2023

codecov-commenter commented Apr 19, 2023 •

edited

Loading

meridionaljet commented May 23, 2023

tlmquintino commented May 23, 2023

meridionaljet commented May 23, 2023

iainrussell commented May 24, 2023

meridionaljet commented May 25, 2023 •

edited

Loading

meridionaljet commented Jun 1, 2023

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Conversation

meridionaljet commented Apr 15, 2023 • edited Loading

FussyDuck commented Apr 15, 2023 • edited Loading

sandorkertesz commented Apr 18, 2023

codecov-commenter commented Apr 19, 2023 • edited Loading

Codecov Report

meridionaljet commented May 23, 2023

tlmquintino commented May 23, 2023

meridionaljet commented May 23, 2023

iainrussell commented May 24, 2023

meridionaljet commented May 25, 2023 • edited Loading

meridionaljet commented Jun 1, 2023

meridionaljet commented Apr 15, 2023 •

edited

Loading

FussyDuck commented Apr 15, 2023 •

edited

Loading

codecov-commenter commented Apr 19, 2023 •

edited

Loading

meridionaljet commented May 25, 2023 •

edited

Loading