-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
z5 library (Zarr/N5 interoperability) #44
Comments
I started writing z5, because I needed access to a chunked data storage that allows parallel I/O The main differences I can think of right now:
|
@jakirkham - thank your for posting. With the recent addition of zarr as a backend to the xarray project, we have been discussing that a potentially key limitation to the library/storage format is the lack of a low level language (e.g. C++) for others, beyond the Python world, to use. It sounds like this has been accomplished by z5 which is indeed encouraging on many fronts. |
Is it a sensible goal to evolve both projects to implement the same spec? |
Possibly. FWIW here's N5's spec. |
Thanks for the info @constantinpape. That's very helpful. :) Regarding axis convention, did you play at all with Zarr Array's Could you please elaborate on the partial chunks point? What parts are being stored? What is tracked in the header? Also could you elaborate on what it means to store multiple values per index? As per the directory layout, recently PR ( zarr-developers/zarr-python#177 ) provided the option to nest directories with Zarr as well. So this option should be available in the 2.2.0 release. It seems also N5 stores the attributes in a different file (though still JSON). |
The axis convention actually matter in two different places: First in how the attributes (i.e. shape and chunk shape) are stored and accordingly in which order the chunks are addressed. Here, Zarr uses the Second, in how chunks are stored on disc. Here, Zarr supports both orders with the Regarding chunks:
Note that the chunk shape is adjusted, s.t. each chunk "fits" into the dataset. Zarr doesn't need it, because it always stores the chunk with full shape, e.g. in the case above Good to know that Zarr will also support nested directories. N5 stores all metadata in |
Thanks for the follow-up @constantinpape. Was wondering if the chunk naming was the cause for the C/F difference. It might be possible to make this configurable in Zarr for the intention of closer compatibility with other formats. Raised issue ( zarr-developers/zarr-python#232 ) to discuss this point and have laid out a potential way forward. Though other thoughts/suggestions would certainly be welcome. As to the chunk layout, we probably could investigate shortening edge chunks somehow. Have written up some thoughts on it in issue ( zarr-developers/zarr-python#233 ) with a few possibilities. Think we could implement this without a header, but it would be good to know if I'm missing anything. Also would generally appreciate feedback on the ideas there and any other approaches that would be worth considering. Am still a little confused on As to the metadata, raised issue ( saalfeldlab/n5#24 ) suggesting these be broken out into two files. Though that is admittedly a breaking change as it is proposed. Perhaps there is a way to do it so that it is non-breaking? Any thoughts on any of these issues would be appreciated. |
The current plan for N5 is to support zarr as one of many possible backends. Towards this end, I introduced extensible compression schemes in 2.0.0 because zarr uses blosc instead of the Java internals that we use out of convenience. n5-hdf5 is an example for such an alternative backend. It implements the N5 API on top of HDF5 at a best effort level. For zarr, this best effort will be better because the concepts are more similar. I reached out to @alimanfoo about the similarities and differences and we are both interested in converging, yet limited in how much time we can spend. The new "/" block separator for zarr is one of the outcomes of this, this project "z5" is another ;). Z5 currently only supports the file system spec of both formats, not the AWS and Google cloud implementations. Shortening edge chunks without header means that you have to do some meta-data math before loading, and I wanted to keep it simple. It also allows overlapping blocks which starts to become handy in applications where we use overlapping blocks. varlength is for datatypes that do not store scalar pixel values but e.g. multisets. I am undecided what the best and most flexible while simple and efficient spec would be and we will see evolution. |
This has been folded into various other discussions so I'm going to close it, but I'll also transfer the issue to the zarr-specs repo so it's nearer to other spec-related discussions. |
Ran across z5 recently, which allows reading and writing of both Zarr and N5 in C++ and Python. As Zarr and N5 have both grown FWICT for similar reasons, but in different languages (Python and Java respectively), am interested to understand the similarities and differences between them. Along those lines, it would be good to learn in what areas interoperability between Zarr and N5 can be improved. I think we would be in a really great place if data can more smoothly move between these two formats and different languages.
cc @constantinpape @saalfeldlab
The text was updated successfully, but these errors were encountered: