open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor · 2024-05-07T19:24:11Z

open_datatree performance improvement on NetCDF files

Closes Improving performance of open_datatree #8994 (NetCDF + Zarr datatree)
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

welcome · 2024-05-07T19:24:14Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

…to datatree-zarr merging into same branch

xarray/backends/zarr.py

…d code

…to datatree-zarr merging branches

flamingbear

I had thoughts about the legacyhdf5 api and how it might be incorporated.

xarray/backends/netCDF4_.py

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

TomNicholas

This is very close! We should try to get it in.

This deserves a whats-new.rst entry!

We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.

Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!

We shouldn't be adding extra kwargs though - as far as I can tell open_datatree should have exactly the same signature as open_dataset. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.

aladinor · 2024-06-12T01:26:39Z

Hi @TomNicholas,

Thanks for your review and my apologies for not joining today's meeting. My internet connection is quite limited, which is causing some difficulties in staying fully connected.

I will ensure to have some of this addressed by our upcoming meeting on 06/18. However, I have some comments to share.

This is very close! We should try to get it in.

This deserves a whats-new.rst entry!

We don't need to wait for anything here - one of the maintainers can just add this by pushing to this branch.

I can work on this for our next meeting. It won't take to much work.

Would you be willing to add an benchmark test? You can see here how we benchmark opening and loading a single netCDF file

I may be able to work on it, but I'm not sure if it will be ready for next week. Let me check it out

I'm fine to leave this to a later PR, we should just create an issue so we remember to add it later. The main purpose is as a performance regression test - we already know this PR is a big improvement!

We shouldn't be adding extra kwargs though - as far as I can tell open_datatree should have exactly the same signature as open_dataset. See Kai's comment. Again @owenlittlejohns or @flamingbear you can just push to this branch to push it over the finish line. Once that's done I'm happy to approve and merge.

Regarding this last point, I have been using the open_dataset parameters since the beginning. However, I completely agree with you and @kmuehlbauer about removing those unnecessary parameters. This will be ready as soon as possible.

…netCDF4 and Hdf5 backends

flamingbear · 2024-06-12T15:24:20Z

This looks to be updated, and I would like to try to merge it before next week. The items remaining from @TomNicholas comment are addressed but for the benchmark test that can be done in a separate PR.

The keywords are not exact with the open_dataset, but have removed those related to writing. That seems fine to me, but looking for second opinions via thumbs up. Otherwise I will add a plan to merge label and wait.

dcherian · 2024-06-12T15:42:18Z

Nice! Merging since this has been thoroughly looked at.

welcome · 2024-06-12T15:42:28Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

mgrover1 · 2024-06-12T15:48:30Z

Wooooo great work all!!

TomNicholas · 2024-06-12T16:09:30Z

Thank you @aladinor and all! This is a great contribution.

aladinor · 2024-06-12T20:39:14Z

I was nice to work in this PR. See you all in the next datatree meeting.

* upstream/main: [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014) Migrate datatree io.py and common.py into xarray/core (pydata#9011) Micro optimizations to improve indexing (pydata#9002) (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)

* main: new whats-new section (pydata#9115) release v2024.06.0 (pydata#9113) release notes for 2024.06.0 (pydata#9092) [skip-ci] Try fixing hypothesis CI trigger (pydata#9112) Undo custom padding-top. (pydata#9107) add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110) Add user survey announcement to docs (pydata#9101) skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104) Adds Matt Savoie to CITATION.cff (pydata#9103) [skip-ci] Fix skip-ci for hypothesis (pydata#9102) open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)

…9014) * open_datatree performance improvement on NetCDF files * fixing issue with forward slashes * fixing issue with pytest * open datatree in zarr format improvement * fixing incompatibility in returned object * passing group parameter to opendatatree method and reducing duplicated code * passing group parameter to opendatatree method - NetCDF * Update xarray/backends/netCDF4_.py renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org> * renaming variables * renaming variables * renaming group_store variable * removing _open_datatree_netcdf function not used anymore in open_datatree implementations * improving performance of open_datatree method * renaming 'i' variable within list comprehension in open_store method for zarr datatree * using the default generator instead of loading zarr groups in memory * fixing issue with group path to avoid using group[1:] notation. Adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree for h5 files. Finally, separating positional from keyword args * fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for netCDF files * fixing issue with group path to avoid using group[1:] notation and adding group variable typing hints (str | Iterable[str] | callable) under the open_datatree method for zarr files * adding 'mode' parameter to open_datatree method * adding 'mode' parameter to H5NetCDFStore.open method * adding new entry related to open_datatree performance improvement * adding new entry related to open_datatree performance improvement * Getting rid of unnecessary parameters for 'open_datatree' method for netCDF4 and Hdf5 backends --------- Co-authored-by: Tom Nicholas <tom@cworthy.org> Co-authored-by: Kai Mühlbauer <kai.muehlbauer@uni-bonn.de>

shoyer · 2024-06-14T22:59:31Z

Great work Alfonso!

open_datatree performance improvement on NetCDF files

14aaf56

aladinor added 5 commits May 7, 2024 14:44

fixing issue with forward slashes

3a5edb4

Merge branch 'main' into datatree-zarr

72d7660

fixing issue with pytest

d9dde29

fixing issue with pytest

2bc5e73

Merge branch 'main' into datatree-zarr

89fb4fb

TomNicholas added the topic-DataTree Related to the implementation of a DataTree class label May 8, 2024

Illviljan added the run-benchmark Run the ASV benchmark workflow label May 10, 2024

aladinor added 4 commits May 10, 2024 07:59

open datatree in zarr format improvement

0343f10

Merge branch 'main' into datatree-zarr

93e1d59

fixing incompatibility in returned object

ac11b3e

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

6d0ee13

…to datatree-zarr merging into same branch

aladinor changed the title ~~open_datatree performance improvement on NetCDF files~~ open_datatree performance improvement on NetCDF and Zarr files May 10, 2024

Merge branch 'main' into datatree-zarr

91c5f0a

shoyer reviewed May 14, 2024

View reviewed changes

xarray/backends/zarr.py Show resolved Hide resolved

aladinor added 5 commits May 18, 2024 17:36

Merge branch 'main' into datatree-zarr

3363e91

passing group parameter to opendatatree method and reducing duplicate…

7bba52c

…d code

Merge branch 'datatree-zarr' of https://github.com/aladinor/xarray in…

725aed7

…to datatree-zarr merging branches

passing group parameter to opendatatree method - NetCDF

903effd

Merge branch 'main' into datatree-zarr

d468478

flamingbear reviewed May 20, 2024

View reviewed changes

xarray/backends/netCDF4_.py Show resolved Hide resolved

TomNicholas added topic-backends io topic-performance labels May 28, 2024

TomNicholas reviewed May 28, 2024

View reviewed changes

xarray/backends/netCDF4_.py Outdated Show resolved Hide resolved

aladinor and others added 3 commits May 28, 2024 17:07

Update xarray/backends/netCDF4_.py

51da175

renaming variables Co-authored-by: Tom Nicholas <tom@cworthy.org>

Merge branch 'main' into datatree-zarr

24881bd

renaming variables

5f4bff1

TomNicholas mentioned this pull request Jun 11, 2024

Coordinate inheritance for xarray.DataTree #9077

Closed

TomNicholas requested changes Jun 11, 2024

View reviewed changes

Merge branch 'main' into datatree-zarr

e298ac4

aladinor added 4 commits June 11, 2024 20:27

Merge branch 'main' into datatree-zarr

833c978

adding new entry related to open_datatree performance improvement

4ff6035

adding new entry related to open_datatree performance improvement

3844dea

Getting rid of unnecessary parameters for 'open_datatree' method for …

456ce29

…netCDF4 and Hdf5 backends

flamingbear added the plan to merge Final call for comments label Jun 12, 2024

flamingbear mentioned this pull request Jun 12, 2024

Add benchmark test for open_datatree #9100

Closed

dcherian merged commit 3967351 into pydata:main Jun 12, 2024
33 of 34 checks passed

aladinor deleted the datatree-zarr branch June 12, 2024 20:54

kmnhan mentioned this pull request Jun 19, 2024

open_datatree not accepting backend-specific keyword arguments #9135

Closed

5 tasks

TomNicholas mentioned this pull request Jun 19, 2024

open_dict_of_datasets function to open any file containing nested groups #9137

Closed

aladinor restored the datatree-zarr branch June 24, 2024 02:01

aladinor deleted the datatree-zarr branch June 24, 2024 02:30

jthielen mentioned this pull request Jul 12, 2024

DRAFT: Implement open_datatree in BackendEntrypoint for preliminary DataTree support #7437

Closed

3 tasks

JessicaS11 mentioned this pull request Jul 26, 2024

add tests for keyword arguments passed to engines when opening a datatree #9283

Draft

4 tasks

slevang mentioned this pull request Sep 8, 2024

DataTree.to_zarr() is very slow writing to high latency store #9455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

aladinor commented May 7, 2024 •

edited

Loading

welcome bot commented May 7, 2024

flamingbear left a comment

TomNicholas left a comment •

edited

Loading

aladinor commented Jun 12, 2024

flamingbear commented Jun 12, 2024

dcherian commented Jun 12, 2024

welcome bot commented Jun 12, 2024

mgrover1 commented Jun 12, 2024

TomNicholas commented Jun 12, 2024

aladinor commented Jun 12, 2024

shoyer commented Jun 14, 2024

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

open_datatree performance improvement on NetCDF, H5, and Zarr files #9014

Conversation

aladinor commented May 7, 2024 • edited Loading

welcome bot commented May 7, 2024

flamingbear left a comment

Choose a reason for hiding this comment

TomNicholas left a comment • edited Loading

Choose a reason for hiding this comment

aladinor commented Jun 12, 2024

flamingbear commented Jun 12, 2024

dcherian commented Jun 12, 2024

welcome bot commented Jun 12, 2024

mgrover1 commented Jun 12, 2024

TomNicholas commented Jun 12, 2024

aladinor commented Jun 12, 2024

shoyer commented Jun 14, 2024

aladinor commented May 7, 2024 •

edited

Loading

TomNicholas left a comment •

edited

Loading