-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341
Conversation
when opening a dataset
…puted geographical coordinates
Fixed the failing code format check |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #341 +/- ##
==========================================
+ Coverage 95.62% 95.65% +0.03%
==========================================
Files 26 26
Lines 2056 2073 +17
Branches 236 238 +2
==========================================
+ Hits 1966 1983 +17
Misses 59 59
Partials 31 31
☔ View full report in Codecov by Sentry. |
Many thanks @meridionaljet , this is a really nice improvement - I just added a couple of comments above, then I think we're close to merging it in! |
Requested tweaks by @iainrussell have been implemented |
Thank you @meridionaljet ! I really like this improvement, and the fact that you added documentation, a test, and also a way to disable it in case of it being used as part of a long-running server. Thanks also for being patient with my suggestions and taking them on board, I think this solution is nice because it works 'out of the box' and does not have the risk of corrupted xarrays if the incoming GRIB file has multiple geometries. Thanks again! |
For kerchunk's use, we would really most like to simple not calculate coordinates at all, as we can store them elsewhere. If it were possible, then, to just skip the bytes that define the geometry to the actual measurements in a given message, all the better. Do you think this is possible? |
Hi @martindurant, could you create a new issue for this use case please? It would be good to see an example of a GRIB file and how you would like the resulting xarray to look. It's not clear if you want to remove all the coordinates, including the time and vertical dimensions, and if this is for performance, memory or aesthetics. So if if it really would be useful, pop it in another issue and we can discuss there! |
Edit: cfgrib 0.9.11.0 incorporating these changes has now been released! 😀
|
This is an updated implementation of #338 , addressing a massive performance bottleneck when opening a GRIB file as an
xarray
dataset. Currently,cfgrib
callscfgrib.dataset.build_geography_coordinates()
for every parameter in the index when creating a dataset. Each call requireseccodes
'sgrib_get_array
to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and almost always unnecessary since GRIB files typically have identical grids for each record.This pull request introduces automatic caching of geographic coordinate data by default when calling
cfgrib.open_dataset()
orcfgrib.open_datasets()
. The caching logic is embedded intocfgrib.dataset.build_variable_components()
, utilizing the md5sum of the Grid Definition Section of the GRIB file (thanks @iainrussell for that suggestion).This approach reduces the
cfgrib.open_dataset()
time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened withcfgrib.open_datasets()
, the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.The only possible negative side effect that I can see is a small one: the cache must be implemented globally and thus can theoretically grow unboundedly in a long-lived application wherein
cfgrib
opens many different grid geometries. I have thus included a mechanism for the user to opt out of coordinate caching by passingcache_geo_coords=False
tobackend_kwargs
. Practically, this should be a rare need, since the total data size would cause memory issues for a typical user long before the coordinate cache would, and most workflows read a small number of unique grid geometries.The speedup offered here releases a significant bottleneck in data processing workflows using
xarray
andcfgrib
, especially for large files, makingxarray
dataset creation for GRIB almost as cheap as it is for other data formats likeNetCDF
andzarr
.