Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add N5 Support #309

Merged
merged 76 commits into from
Feb 26, 2019
Merged

Add N5 Support #309

merged 76 commits into from
Feb 26, 2019

Conversation

funkey
Copy link
Contributor

@funkey funkey commented Oct 18, 2018

This adds support to read and write from and to N5 containers. The N5Store handling the conversion between the zarr and N5 format will automatically be selected whenever the path for a container ends in .n5 (similar to how the ZipStore is used for files ending in .zip).

The conversion is done mostly transparently, with one exception being the N5ChunkWrapper: This is a Codec with id n5_wrapper that will automatically be wrapped around the requested compressor. For example, if you create an array with a zlib compressor, in fact the n5_wrapper codec will be used that delegates to the zlib codec internally. The additional codec was necessary to introduce N5's chunk headers and ensure big endian storage.

In a related note, gzip compressed N5 arrays can at the moment not be read, since numcodecs treats zlib and gzip as synonyms, which they aren't (their compression headers differ). PR zarr-developers/numcodecs#87 solves this issue.

See also https://github.com/zarr-developers/zarr/issues/231

TODO:

  • Add unit tests and/or doctests in docstrings
  • Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or
    pytest -v --doctest-modules zarr)
  • Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or
    pytest -v zarr)
  • PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

funkey added 29 commits October 13, 2018 17:12
This requires numcodecs GZip support to be compatible with the N5
standard. Added with 410af66be0ea470d77b923e516d63cf5238114db to
numcodecs.
As we are only using the logger to issue warnings, just issue warnings
instead and drop the logger. Picked `RuntimeWarning`s for these two
warnings as they are in regard to "dubious runtime behavior". Namely
using compressors that are not explicitly supported by many N5
implementations to write N5 files.
Make sure that `N5Store` is raising `RuntimeWarnings` for some
compressors that are not widely supported.
As N5 stores attributes and metadata in the same JSON object, there are
some keys that are simply not allowed to be attributes as they would
conflict with the metadata (and possibly corrupt the data if they were
written in). The `N5Store` correctly raises for these cases, but this
was not being tested previously. Here we add a test to cover this case.
Check that valid `fill_values` work with N5Store-backed arrays and that
invalid `fill_values` raise.
We are explicitly setting the `blocksize` to `0` internally. So this
assertion is not a user-facing error. It would also be difficult to test
from the API. As we set the `blocksize` to `0` in a different function,
it does make sense to `assert` this here.
@jakirkham
Copy link
Member

jakirkham commented Feb 23, 2019

Alright have updated all user-facing exceptions to be something other than asserts. Mostly these are ValueErrors, but am happy to change them if others have thoughts on these. A couple are still asserts as they are actually internal checks.

Also played around with some sample data that @constantinpape and @hanslovsky have compiled in the zarr_implementations repo. This worked pretty well, but had a few issues around LZ4 data from N5 (have dropped LZ4 support from this PR). Otherwise was able to load all other N5 files generated by N5 or z5py. More details about how this was tested and the results found in this comment.

In my view, this code is ready to be merged. The LZ4 issues should be explored, but they can be placed in a separate issue. Would appreciate it if some other people did some final reviews of the PR.

Note: Previously this mentioned there were issues with XZ. Some recent commits have been pushed to support XZ as a special case of Numcodecs' LMZA, which works well. Also LZ4 support has been disabled in N5Store for now. Raised issue ( zarr-developers/numcodecs#175 ) about the LZ4 differences.

@jakirkham jakirkham mentioned this pull request Feb 23, 2019
Numcodecs' LZMA compressor is able to handle XZ compressed data as a
special case. This maps XZ compression in N5 to Zarr's LZMA compression
with the proper arguments as well as maps Zarr's LZMA support back to
N5's XZ support if the right options are set.
Adds some tests for XZ support in N5 using Numcodecs' LZMA compressor
with specific options. Also tests LZMA options that are not currently
supported by N5 to ensure they still warn and are handled correctly.
There is not a clear mapping between LZ4 in the Zarr and N5
implementations at present. So this drops LZ4 from the supported codecs
in `N5Store` for now.
@constantinpape
Copy link

As far as I understand it this is not fully compatible with the n5 format yet,
because it doesn't resize / clip edge chunks. This will make datasets whose edges don't align with chunks unreadable in n5-java (and for that matter also in z5py). See also the example by @hanslovsky from above.
If I remember correctly, @funkey and @axtimwalde were discussing if this should be changed in the n5 spec, but until this is decided, maybe delay merging this?

@axtimwalde
Copy link
Contributor

@constantinpape your're right that this only implements a subset of n5, i.e. it does not store arbitrary block-sizes in any case and can not deal with multiset types or other varlength based tricks. As such, however, it is already useful. N5 has no spec that requires blocks to have any specific size. N5-ImgLib2 until recently expected that trailing blocks are cropped to size but I have changed this there saalfeldlab/n5-imglib2@dd26251 because this seemed to be easier than to support variable size blocks in zarr. The corresponding N5-ImgLib2 release is 3.2.0 https://github.com/saalfeldlab/n5-imglib2/tree/3.2.0.

@jakirkham
Copy link
Member

That's very cool! Thanks for sharing that, @axtimwalde. Is N5-ImgLib2 the primary way you are reading in N5 files or are there other libraries we should also be thinking about?

I understand some things are still not supported, @constantinpape, but in my view this is a first pass, which will be easier to iterate on once merged. Raising issues for additional functionality is always possible and can be more easily pursued in parallel once this is integrated.

Also have talked to several people already who are pip installing this branch as opposed to a normal Zarr release. While it's nice that they are able to get started so easily, this is not great from a support perspective. For instance they may be missing other bug fixes that are in master or they may be getting a different product each time they (or their colleagues) install. A release would really help set a baseline for how things work going forward.

@alimanfoo
Copy link
Member

alimanfoo commented Feb 24, 2019 via email

@axtimwalde
Copy link
Contributor

N5-ImgLib2 is the preferred way to load actual tensor data, but we also use the core block loading/ storing API directly, e.g. mutliset pixels are still elsewhere, Paintera uses it for indexing and agglomeration.

Thanks to @alimanfoo for the support to add this as an experimental feature. I also like the idea to decouple the backend from access API. I do indeed have a bunch of N5 containers sit on Google Cloud, and our stitching pipeline https://github.com/saalfeldlab/stitching-spark is meant to run on and load from and export to AWS.

@constantinpape
Copy link

constantinpape commented Feb 25, 2019

I understand some things are still not supported, @constantinpape, but in my view this is a first pass, which will be easier to iterate on once merged. Raising issues for additional functionality is always possible and can be more easily pursued in parallel once this is integrated.

Fully agree. My concern was that this would not be compatible with core n5-java for edge chunks that do not align. @axtimwalde has clarified that this is not the case, so no objections to merge from my side.

@jakirkham
Copy link
Member

Thanks for the info. Is it pretty easy for the other libraries to use the same edge chunk handling strategy?

Thanks @alimanfoo. Have already added some notes to the docs explaining this is experimental and subject to change.

The transformational layer is probably useful for a few cases and does sound interesting. That said, I lack the time to pursue this personally and am guessing @funkey is similarly busy. Could I suggest that you raise a new issue with some more info about what you have in mind? Maybe we can discuss this a bit more after this is merged.

@alimanfoo
Copy link
Member

The transformational layer is probably useful for a few cases and does sound interesting. That said, I lack the time to pursue this personally and am guessing @funkey is similarly busy. Could I suggest that you raise a new issue with some more info about what you have in mind? Maybe we can discuss this a bit more after this is merged.

SGTM, have raised zarr-developers/n5py#9, hopefully that makes sense but happy to elaborate.

@jakirkham jakirkham merged commit a780691 into zarr-developers:master Feb 26, 2019
@jakirkham jakirkham added this to the v2.3 milestone Feb 26, 2019
@jakirkham
Copy link
Member

Thanks everyone! 🍾 🎉 😄

As it sounds like we are ok with this as is for now, have gone ahead and merged it. If you encounter any problems, please feel free to raise an issue and we can discuss how best to address it.

@axtimwalde
Copy link
Contributor

Thanks a lot everybody!

@jakirkham jakirkham mentioned this pull request Feb 26, 2019
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants