-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_datatree performance improvement on NetCDF, H5, and Zarr files #9014
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
…to datatree-zarr merging into same branch
…to datatree-zarr merging branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thoughts about the legacyhdf5 api and how it might be incorporated.
renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very close! We should try to get it in.
This deserves a whats-new.rst entry!
We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.
Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file
I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!
We shouldn't be adding extra kwargs though - as far as I can tell open_datatree
should have exactly the same signature as open_dataset
. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.
Hi @TomNicholas, Thanks for your review and my apologies for not joining today's meeting. My internet connection is quite limited, which is causing some difficulties in staying fully connected. I will ensure to have some of this addressed by our upcoming meeting on 06/18. However, I have some comments to share.
I can work on this for our next meeting. It won't take to much work.
I may be able to work on it, but I'm not sure if it will be ready for next week. Let me check it out
Regarding this last point, I have been using the |
…netCDF4 and Hdf5 backends
This looks to be updated, and I would like to try to merge it before next week. The items remaining from @TomNicholas comment are addressed but for the benchmark test that can be done in a separate PR. The keywords are not exact with the open_dataset, but have removed those related to writing. That seems fine to me, but looking for second opinions via thumbs up. Otherwise I will add a |
Nice! Merging since this has been thoroughly looked at. |
Wooooo great work all!! |
Thank you @aladinor and all! This is a great contribution. |
I was nice to work in this PR. See you all in the next datatree meeting. |
* upstream/main: [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014) Migrate datatree io.py and common.py into xarray/core (pydata#9011) Micro optimizations to improve indexing (pydata#9002) (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)
* main: new whats-new section (pydata#9115) release v2024.06.0 (pydata#9113) release notes for 2024.06.0 (pydata#9092) [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)
…9014) * open_datatree performance improvement on NetCDF files * fixing issue with forward slashes * fixing issue with pytest * open datatree in zarr format improvement * fixing incompatibility in returned object * passing group parameter to opendatatree method and reducing duplicated code * passing group parameter to opendatatree method - NetCDF * Update xarray/backends/netCDF4_.py renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org> * renaming variables * renaming variables * renaming group_store variable * removing _open_datatree_netcdf function not used anymore in open_datatree implementations * improving performance of open_datatree method * renaming 'i' variable within list comprehension in open_store method for zarr datatree * using the default generator instead of loading zarr groups in memory * fixing issue with group path to avoid using group[1:] notation. Adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args * fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files * fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files * adding 'mode' parameter to open_datatree method * adding 'mode' parameter to H5NetCDFStore.open method * adding new entry related to open_datatree performance improvement * adding new entry related to open_datatree performance improvement * Getting rid of unnecessary parameters for 'open_datatree' method for netCDF4 and Hdf5 backends --------- Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Kai Mühlbauer <kai.muehlbauer@uni-bonn.de>
Great work Alfonso! |
open_datatree performance improvement on NetCDF files
whats-new.rst