Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added methods to convert InferenceData to and from Zarr [docs] #1518

Merged
merged 31 commits into from
Feb 6, 2021

Conversation

semohr
Copy link
Contributor

@semohr semohr commented Jan 22, 2021

Description

This pull request adds two methods to the 'InferenceData' class:

  • .from_zarr() to construct an 'InferenceData' object via xarrays .open_zarr method.
  • .to_zarr() to convert a zarr 'store' or a path to an 'InferenceData' object

These two methods are quite useful to save and restore 'InferenceData' objects to disk with zarr. Zarr is still experimental in xarray, but I think it could be a good addition.

For the implementation, I tried to use a similar approach as in the from_dataframe and to_dataframe methods. I also tried to add the changes to the changelog, please take a look. I would appreciate if someone could assist me with writing tests for these methods.

  • Someone may also need to add a mock import in sphinx for Zarr. I tried to generate the documentation locally but I got a weird runtime-error so I skipped that for now.

Best,
Sebastian

References

Zarr

Checklist

  • Follows official PR format
  • New features are properly documented (with an example if appropriate)?
  • Includes new or updated tests to cover the new feature
    • Move tests into a different file using the same approach as for numba and bokeh
  • Code style correct (follows pylint and black guidelines)
  • Changes are listed in changelog

@ahartikainen ahartikainen changed the title Added methods to convert InferenceData to and from Zarr Added methods to convert InferenceData to and from Zarr [docs] Jan 22, 2021
@ahartikainen
Copy link
Contributor

Hi, great idea to use zarr.

I added [docs] tag for the title, so our CI will build the documentation.

Copy link
Contributor

@ahartikainen ahartikainen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added initial comments. Looks great.

arviz/data/inference_data.py Outdated Show resolved Hide resolved
arviz/data/inference_data.py Outdated Show resolved Hide resolved
arviz/data/inference_data.py Outdated Show resolved Hide resolved
arviz/data/inference_data.py Outdated Show resolved Hide resolved
# Create zarr group in store with same group name
getattr(self, group).to_zarr(store=store, group=group, mode="w")

return zarr.open(store) # Open store to get overarching group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does zarr handle open files ?

Are they closed manually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it only keeps metafiles to the location of the files and loads them on demand.

From the documentation:
Files are only held open while they are being read or written and are
closed immediately afterwards, so there is no need to manually close any files.

FYI: There is a great talk from one of the developers here.

@ahartikainen
Copy link
Contributor

There is linting error so add this to top of the file

# pylint: disable=too-many-lines,too-many-public-methods

Also variable g to maybe some 2chr name

@ahartikainen
Copy link
Contributor

ahartikainen commented Jan 22, 2021

Hi, for code style, code structure and import order stuff

from project folder you can do

black arviz
isort arviz
pylint arviz

black -> reformat code
isort -> sort imports
pylint -> check codestyle

The settings used by the tools can be found in pyproject.toml / pylint.rc (if you are interested what they are doing)

@ahartikainen
Copy link
Contributor

Oh yes, we probably don't use isort officially, but it gets the job done.

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I have added some comments but nothing big.

Regarding tests, my guess is that following https://github.com/arviz-devs/arviz/blob/master/arviz/tests/base_tests/test_data.py#L1194 but with zarr will be enough.

arviz/data/inference_data.py Outdated Show resolved Hide resolved
arviz/data/inference_data.py Outdated Show resolved Hide resolved
arviz/data/inference_data.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 24, 2021

Codecov Report

Merging #1518 (dd129a6) into main (525e311) will decrease coverage by 0.03%.
The diff coverage is 80.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1518      +/-   ##
==========================================
- Coverage   91.14%   91.10%   -0.04%     
==========================================
  Files         105      105              
  Lines       11342    11382      +40     
==========================================
+ Hits        10338    10370      +32     
- Misses       1004     1012       +8     
Impacted Files Coverage Δ
arviz/data/inference_data.py 83.59% <80.95%> (-0.22%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 525e311...dd129a6. Read the comment docs.

@canyon289
Copy link
Member

Im not opposed to this, just curious. Why would people want to use zarr over hdf5? I didnt see anything on their docs that explained it so curious if you know

@OriolAbril
Copy link
Member

I think this presentation is a good summary: https://zarr-developers.github.io/slides/scipy-2019.html (note to those not familiar with RISE/reveal.js, press space for next slide)

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests! This is looking really good

@@ -1246,6 +1249,86 @@ def test_empty_inference_data_object(self):
assert not os.path.exists(filepath)


class TestDataZarr:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what would be the best approach here (pinging @canyon289 and @ahartikainen to see what they think), but we should either create a test_data_zarr.py or add a class level skipif. As zarr is an optional dependency, we should make sure its test are skipped if it's not installed instead of failing the test suite. Something similar to what we do with numba: https://github.com/arviz-devs/arviz/blob/master/arviz/tests/base_tests/test_diagnostics_numba.py#L16 and bokeh https://github.com/arviz-devs/arviz/blob/master/arviz/tests/base_tests/test_plots_bokeh.py#L48

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Idea was quite easy to add 👍 Haven't created a new file though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Oriol that a pytest marker should be used to skip tests if zarr isnt present instead of failing the test suite.

A separate file would be nice

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pylint can be quite a headache, I hope this helps fixing it so we can merge :)


import numpy as np
import pytest
import zarr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work because it will raise an exception here before getting to the skipif

arviz/tests/base_tests/test_data.py Outdated Show resolved Hide resolved
arviz/tests/base_tests/test_data_zarr.py Show resolved Hide resolved
@semohr
Copy link
Contributor Author

semohr commented Jan 26, 2021

Thank you, I was starting do get crazy :) Well, now there is an azure error in precompile models... I don't think I changed something related to that.

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, now there is an azure error in precompile models... I don't think I changed something related to that.

You did not, let's see if it still happens tomorrow and if it does we'll try to see why it it, sometimes it's only azures fault or some release lag in related packages (i.e. tf probability releasing a couple days later than tf).

We do have to solve the import error I commented above. Currently test_data_zarr imports zarr and then checks if its available in order to skip the tests. Therefore, if zarr is not installed, pylint will raise an import error instead of skipping the tests. It should be imported below and make pylint ignore the "wrong" order.

@semohr
Copy link
Contributor Author

semohr commented Jan 26, 2021

True, we can try to move zarr after the pytest statement and disable pylint. Not sure if this is the right approach though.

pytestmark = pytest.mark.skipif(  # pylint: disable=invalid-name
    importlib.util.find_spec("zarr") is None and not running_on_ci(),
    reason="test requires zarr which is not installed",
)

import zarr #pylint: disable=wrong-import-position

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to merge, thank you so much! I hope it was an enjoyable and enriching experience

I will wait until we figure out what is happening with CI and then merge

@semohr
Copy link
Contributor Author

semohr commented Jan 26, 2021

Yes it was quite telling, learned a lot about the azur pipeline. Thanks for taking your time 👍

Base automatically changed from master to main January 26, 2021 19:44
semohr and others added 23 commits February 6, 2021 11:46
- Moved MutableMapping to top
- Check if zarr is installed
- Removed ifs observed_data,constant_data,predictions_constant_data
- Renamed g to zarr_handle

Removed depreciated docstring
Co-authored-by: Ari Hartikainen <ahartikainen@users.noreply.github.com>
It is working locally without errors
Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
@OriolAbril OriolAbril merged commit 2be3611 into arviz-devs:main Feb 6, 2021
@OriolAbril
Copy link
Member

Thanks @semohr!

utkarsh-maheshwari pushed a commit to utkarsh-maheshwari/arviz that referenced this pull request May 27, 2021
)

* Added a to_zarr method which converts the InferenceData object
to a hierarchical zarr group.

* Added a from_zarr method which create an inferenceData object from a zarr store.

* Fixed small typo in to_zarr and from_zarr

* Added the ability to create InferenceData from zarr.hierarchy.Group

* Oversight in zarr.hierachy.group.groups() generator call.

* Forgot dictionary definition.

* Cleanup zarr:
- added reference to zarr docs
- replaced type with isInstance
- replaced MemoryStore with TempStore

* Added to methods to CHANGELOG

* PR-Comments:
- Moved MutableMapping to top
- Check if zarr is installed
- Removed ifs observed_data,constant_data,predictions_constant_data
- Renamed g to zarr_handle

Removed depreciated docstring

* Removed typo

* Replaced last occurence of g  with zarr_handle

* Added from packaging import version

* Fixed wrong import order, I did not know this is a thing in python, wow

* Even later import of version

* Yet another try to fix the import order

* Local pylint is working with this formatting...

* Docstring formatting changes and removed deprecated parts.

* Fixed local black version mismatch.

* Update arviz/data/inference_data.py

Co-authored-by: Ari Hartikainen <ahartikainen@users.noreply.github.com>

* Added tests and docs intersphinx_mapping.

* Improved test coverage for to_zarr and from_zarr functions.

* Added pytest.mark.skipif to zarr test class

* Moved test class to new file called 'test_data_zarr'

* Fixed small pylint wrong-import-position error

* Yet another import-order fix

* Pylint is still black magic for me...
It is working locally without errors

* Update arviz/tests/base_tests/test_data_zarr.py

Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>

* Removed running_on_ci

* Reverted last change and removed running_on_ci in right place now

* Moved zarr import down and added `# pylint: disable=wrong-import-position`

* Update arviz/tests/base_tests/test_data_zarr.py

Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>

Co-authored-by: Ari Hartikainen <ahartikainen@users.noreply.github.com>
Co-authored-by: Oriol Abril-Pla <oriol.abril.pla@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants