-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dramatically speed up dataset creation by pre-computing geographic coordinates #338
Conversation
Thank you for your excellent work, it is quite an impressive improvement. However, as you can see not all the tests have passed. I suggest you move your test into test_40_xarray_store.py, which is already dealing with the backend_kwargs options, then you do not need:
and can simply use
to open your data. |
…puted geographical coordinates
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #338 +/- ##
==========================================
+ Coverage 95.44% 95.47% +0.03%
==========================================
Files 26 26
Lines 2040 2057 +17
Branches 236 237 +1
==========================================
+ Hits 1947 1964 +17
Misses 62 62
Partials 31 31
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Hello - any chance of merging this soon? Looks like new conflicts with |
Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? |
Yes, the commits on April 18th addressed what @sandorkertesz had said. I just fixed the new conflict due to an update to |
Hi @meridionaljet, Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down). I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway. Cheers, |
I took a look at the feasibility of this. For However, for That said, it's possible we could assume reasonably that the geometry alone is small enough to be unconcerned about for 99.9% of workflows. I suppose we could also make geocaching an optional feature enabled/disabled by the user as a backend_kwarg. In light of these observations, what do you think about this approach? |
@iainrussell @tlmquintino @sandorkertesz I have posted an alternative (and I believe better) implementation at #341 . This is different enough that I felt it warranted a separate PR to avoid confusion. Please let me know what you think! |
This addresses a massive performance bottleneck when opening a GRIB file as an
xarray
dataset. Currently,cfgrib
callscfgrib.dataset.build_geography_coordinates()
for every parameter in the index when creating a dataset. Each call requireseccodes
'sgrib_get_array
to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and often unnecessary since such files typically have identical grids for each record.This pull request introduces the option for the user to include pre-computed geographic coordinate data in
backend_kwargs
when callingcfgrib.open_dataset()
orcfgrib.open_datasets()
in cases where the user knows that their GRIB file has identical geometry for each parameter. The utility functioncfgrib.dataset.get_first_geo_coords()
has been added, which computes coordinates for the first record in the index. The user can then use the output as follows:geocoords
could be serialized and reused by the user for multiple files with identical geometry, requiring zero calls tobuild_geography_coordinates()
once computed for the first time.This approach reduces the
cfgrib.open_dataset()
time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened withcfgrib.open_datasets()
, the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.A basic test has been added to the test suite as part of this pull request.
I have not contributed to
cfgrib
before, but as far as I can tell, adding this optional feature cannot break anything unless the user employs this option on a file for which it is inappropriate, and such files are uncommon. The speedup offered releases a significant bottleneck in data processing workflows usingxarray
andcfgrib
, especially for large files, makingxarray
dataset creation for GRIB almost as cheap as it is for other data formats likeNetCDF
andzarr
.