Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Merged
merged 6 commits into from
Jun 2, 2023

Conversation

meridionaljet
Copy link
Contributor

@meridionaljet meridionaljet commented Apr 15, 2023

This addresses a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and often unnecessary since such files typically have identical grids for each record.

This pull request introduces the option for the user to include pre-computed geographic coordinate data in backend_kwargs when calling cfgrib.open_dataset() or cfgrib.open_datasets() in cases where the user knows that their GRIB file has identical geometry for each parameter. The utility function cfgrib.dataset.get_first_geo_coords() has been added, which computes coordinates for the first record in the index. The user can then use the output as follows:

geocoords = cfgrib.dataset.get_first_geo_coords(filename)
ds = cfgrib.open_dataset(filename, backend_kwargs=dict(precomputed_geo_coords=geocoords))

geocoords could be serialized and reused by the user for multiple files with identical geometry, requiring zero calls to build_geography_coordinates() once computed for the first time.

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

A basic test has been added to the test suite as part of this pull request.

I have not contributed to cfgrib before, but as far as I can tell, adding this optional feature cannot break anything unless the user employs this option on a file for which it is inappropriate, and such files are uncommon. The speedup offered releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

@FussyDuck
Copy link

FussyDuck commented Apr 15, 2023

CLA assistant check
All committers have signed the CLA.

@sandorkertesz
Copy link
Contributor

Thank you for your excellent work, it is quite an impressive improvement.

However, as you can see not all the tests have passed. I suggest you move your test into test_40_xarray_store.py, which is already dealing with the backend_kwargs options, then you do not need:

from cfgrib import open_dataset

and can simply use

xarray_store.open_dataset(...)

to open your data.

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 🎉

Comparison is base (3951355) 95.44% compared to head (1ce00c1) 95.47%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #338      +/-   ##
==========================================
+ Coverage   95.44%   95.47%   +0.03%     
==========================================
  Files          26       26              
  Lines        2040     2057      +17     
  Branches      236      237       +1     
==========================================
+ Hits         1947     1964      +17     
  Misses         62       62              
  Partials       31       31              
Impacted Files Coverage Δ
cfgrib/xarray_plugin.py 88.40% <ø> (ø)
cfgrib/dataset.py 98.45% <100.00%> (+0.05%) ⬆️
tests/test_40_xarray_store.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sandorkertesz sandorkertesz self-requested a review April 19, 2023 09:38
@meridionaljet
Copy link
Contributor Author

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

@tlmquintino
Copy link
Member

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed?
I see that there is currently a conflict. Can you please fix it?
This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

@meridionaljet
Copy link
Contributor Author

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

Yes, the commits on April 18th addressed what @sandorkertesz had said. I just fixed the new conflict due to an update to master - should be good to go now unless you guys want to adjust the implementation.

@iainrussell
Copy link
Member

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers,
Iain

@meridionaljet
Copy link
Contributor Author

meridionaljet commented May 25, 2023

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers, Iain

I took a look at the feasibility of this. For cfgrib.open_dataset(), it is straight-forward to implement a cache inside cfgrib.dataset.build_dataset_components().

However, for cfgrib.open_datasets(), build_dataset_components() is called multiple times, and the execution path of open_datasets() seems to run through xarray_store.open_dataset(). This would require the cache to be implemented globally, which is easy enough, but in turn would require some sort of state management to avoid the cache growing unboundedly in long-lived applications which open many different grid geometries. A possible idea is to utilize functools.lru_cache somehow to restrict the number of entries in the global cache, though this doesn't have knowledge of the memory size of each entry, so it is difficult to pick an appropriate number.

That said, it's possible we could assume reasonably that the geometry alone is small enough to be unconcerned about for 99.9% of workflows. I suppose we could also make geocaching an optional feature enabled/disabled by the user as a backend_kwarg.

In light of these observations, what do you think about this approach?

@meridionaljet
Copy link
Contributor Author

@iainrussell @tlmquintino @sandorkertesz I have posted an alternative (and I believe better) implementation at #341 .

This is different enough that I felt it warranted a separate PR to avoid confusion. Please let me know what you think!

@iainrussell iainrussell merged commit d65ccd7 into ecmwf:master Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants