Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Merged
merged 38 commits into from
Jun 12, 2024
Merged

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

merged 38 commits into from
Jun 12, 2024

Conversation

aladinor
Copy link
Contributor

@aladinor aladinor commented May 7, 2024

open_datatree performance improvement on NetCDF files

Sorry, something went wrong.

Copy link

welcome bot commented May 7, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

aladinor added 5 commits May 7, 2024 14:44

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@TomNicholas TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024
@Illviljan Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024
aladinor added 4 commits May 10, 2024 07:59

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…to datatree-zarr

merging into same branch
@aladinor aladinor changed the title open_datatree performance improvement on NetCDF files open_datatree performance improvement on NetCDF and Zarr files May 10, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
aladinor added 5 commits May 18, 2024 17:36

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…to datatree-zarr

merging branches

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Member

@flamingbear flamingbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thoughts about the legacyhdf5 api and how it might be incorporated.

xarray/backends/netCDF4_.py Show resolved Hide resolved
aladinor and others added 3 commits May 28, 2024 17:07

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
renaming variables

Co-authored-by: Tom Nicholas <tom@cworthy.org>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very close! We should try to get it in.

This deserves a whats-new.rst entry!

We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.

Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!

We shouldn't be adding extra kwargs though - as far as I can tell open_datatree should have exactly the same signature as open_dataset. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@aladinor
Copy link
Contributor Author

Hi @TomNicholas,

Thanks for your review and my apologies for not joining today's meeting. My internet connection is quite limited, which is causing some difficulties in staying fully connected.

I will ensure to have some of this addressed by our upcoming meeting on 06/18. However, I have some comments to share.

This is very close! We should try to get it in.

This deserves a whats-new.rst entry!

We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.

I can work on this for our next meeting. It won't take to much work.

Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

I may be able to work on it, but I'm not sure if it will be ready for next week. Let me check it out

I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!

We shouldn't be adding extra kwargs though - as far as I can tell open_datatree should have exactly the same signature as open_dataset. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.

Regarding this last point, I have been using the open_dataset parameters since the beginning. However, I completely agree with you and @kmuehlbauer about removing those unnecessary parameters. This will be ready as soon as possible.

aladinor added 4 commits June 11, 2024 20:27

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
…netCDF4 and Hdf5 backends
@flamingbear
Copy link
Member

This looks to be updated, and I would like to try to merge it before next week. The items remaining from @TomNicholas comment are addressed but for the benchmark test that can be done in a separate PR.

The keywords are not exact with the open_dataset, but have removed those related to writing. That seems fine to me, but looking for second opinions via thumbs up. Otherwise I will add a plan to merge label and wait.

@flamingbear flamingbear added the plan to merge Final call for comments label Jun 12, 2024
@dcherian
Copy link
Contributor

Nice! Merging since this has been thoroughly looked at.

@dcherian dcherian merged commit 3967351 into pydata:main Jun 12, 2024
33 of 34 checks passed
Copy link

welcome bot commented Jun 12, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@mgrover1
Copy link
Contributor

Wooooo great work all!!

@TomNicholas
Copy link
Member

Thank you @aladinor and all! This is a great contribution.

@aladinor
Copy link
Contributor Author

I was nice to work in this PR. See you all in the next datatree meeting.

@aladinor aladinor deleted the datatree-zarr branch June 12, 2024 20:54
dcherian added a commit to dcherian/xarray that referenced this pull request Jun 13, 2024
* upstream/main:
  [skip-ci] Try fixing hypothesis CI trigger (pydata#9112)
  Undo custom padding-top. (pydata#9107)
  add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110)
  Add user survey announcement to docs (pydata#9101)
  skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104)
  Adds Matt Savoie to CITATION.cff (pydata#9103)
  [skip-ci] Fix skip-ci for hypothesis (pydata#9102)
  open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)
  Migrate datatree io.py and common.py into xarray/core (pydata#9011)
  Micro optimizations to improve indexing (pydata#9002)
  (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)
dcherian added a commit to mraspaud/xarray that referenced this pull request Jun 13, 2024
* main:
  new whats-new section (pydata#9115)
  release v2024.06.0 (pydata#9113)
  release notes for 2024.06.0 (pydata#9092)
  [skip-ci] Try fixing hypothesis CI trigger (pydata#9112)
  Undo custom padding-top. (pydata#9107)
  add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110)
  Add user survey announcement to docs (pydata#9101)
  skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104)
  Adds Matt Savoie to CITATION.cff (pydata#9103)
  [skip-ci] Fix skip-ci for hypothesis (pydata#9102)
  open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)
andersy005 pushed a commit that referenced this pull request Jun 14, 2024

Verified

This commit was signed with the committer’s verified signature.
andersy005 Anderson Banihirwe
…9014)

* open_datatree performance improvement on NetCDF files

* fixing issue with forward slashes

* fixing issue with pytest

* open datatree in zarr format improvement

* fixing incompatibility in returned object

* passing group parameter to opendatatree method and reducing duplicated code

* passing group parameter to opendatatree method - NetCDF

* Update xarray/backends/netCDF4_.py

renaming variables

Co-authored-by: Tom Nicholas <tom@cworthy.org>

* renaming variables

* renaming variables

* renaming group_store variable

* removing _open_datatree_netcdf function not used anymore in open_datatree implementations

* improving performance of open_datatree method

* renaming 'i' variable within list comprehension in open_store method for zarr datatree

* using the default generator instead of loading zarr groups in memory

* fixing issue with group path to avoid using group[1:] notation. Adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args

* fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files

* fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files

* adding 'mode' parameter to open_datatree method

* adding 'mode' parameter to H5NetCDFStore.open method

* adding new entry related to open_datatree performance improvement

* adding new entry related to open_datatree performance improvement

* Getting rid of unnecessary parameters for 'open_datatree' method for netCDF4 and Hdf5 backends

---------

Co-authored-by: Tom Nicholas <tom@cworthy.org>
Co-authored-by: Kai Mühlbauer <kai.muehlbauer@uni-bonn.de>
@shoyer
Copy link
Member

shoyer commented Jun 14, 2024

Great work Alfonso!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-backends topic-DataTree Related to the implementation of a DataTree class topic-performance
Projects
Development

Successfully merging this pull request may close these issues.

Improving performance of open_datatree
9 participants