-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr N5 spec diff #3
Comments
Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group. |
In zarr, the "C" or "F" order refers to the ordering of items within a chunk. Not completely sure what you mean by "indexing" here. E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19. |
Thanks for clarifying, that makes total sense. |
Yes, that's what I meant. |
In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features. E.g., the default file-system storage class in zarr python (DirectoryStore) does the obvious thing of mapping keys to file paths without any transformation, so you will get a file called "2.4". But there is an alternative implementation of file-system storage (NestedDirectoryStore) which applies a transformation on the chunk keys to get to file paths, so you get file paths like "2/4". This is an example of how in the zarr storage spec there is a separation between the store interface, which is assumed to be an abstract key-value interface and does not make any assumptions about how that will get implemented in terms of files or objects or memory or whatever; and the underlying storage implementation, which makes concrete decisions about what files to create (if using a file system) and what each file should contain. The zarr storage spec does not place any constraint on the storage implementation, as long as you can provide a key-value interface over it then any form of storage is allowed. E.g., storing data inside an sqlite3, bdb or lmdb database, or zip file, are all valid ways of storing zarr data on a file system. That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs, so that e.g. anyone who wants to implement a specific file format can do so more easily. |
Many thanks @constantinpape, great summary. Hopefully comments have clarified a few points, but very happy to expand on any areas. |
Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
Yes, that would be very helpful indeed. |
Thanks for clarifying @alimanfoo. |
Along these lines PR ( zarr-developers/zarr-python#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library. |
Thanks for pointing this out @jakirkham. I think that in general consolidating the specs would be of great use nevertheless. |
This is great, thank you!
👍 For the items you listed under The chunk differences seem to reverse that though.... |
Yes that captures it pretty well. For For chunks n5 is more expressive, as it supports clipped edge chunks and varlength mode by means of the header data. |
Overview of the diff between zarr and n5 specs with the potential goal of consolidating the two formats.
@alimanfoo, @jakirkham / @axtimwalde please correct me if I am misrepresenting zarr / N5 spec or if you think there is something to add here.
Note that the zarr and n5 spec have different naming conventions.
The data-containers are called arrays in zarr and datasets in n5.
Zarr refers to the nested storage of data-containers as hierarchies or groups
(it is not quite clear to me, n5 only refers to groups.what the actual difference is, see below)
I will use the group / dataset notation.
Edit:
Some corrections from @alimanfoo, I left in the original statements but striked them out.
Groups
.zgroup
which MUST containzarr_format
and MUST NOT contain any other keys. They CAN contain additional attributes in.zattrs
attributes.json
containing arbitrary json serializable attributes. The root group"/"
MUST contain the keyn5
with the n5 version.zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset.Zarr does not allow nested datasets (i.e. a dataset containing another dataset). This is not allowed in n5 either, I think. The spec does not explicitly forbid it though.Datasets
.zarray
.attributes.json
.C
(row-major) andF
(column major) indexing, which determineshow chunks are indexed andhow chunks are stored. This is determined via the keyorder
. Chunks are always indexed as row-major.F
).dtype
holds numpy type encoding. Importantly, supports big- and little- endian, which MUST be specified.n5
: keydataType
, only numerical types and only big endian.compressors
.raw
(= no compression),bzip2
,gzip
,lz4
andxz
. There is a mechanism to support additional compressors. Stored in keycompression
.filters
.fill_value
..zattributes
attributes.json
. MUST NOT override keys reserved for metadata.In addition, zarr and n5 store the shape of the dataset and of the chunks in the metadata with the keys
shape
,chunks
/dimensions
,blockSize
.Chunk storage
(30, 30)
and dataset shape(100, 100)
)..
separated keys, e.g.2.4
.I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.These keys get mapped to a representation appropriate for the implementation. E. g. on the filesystem, keys can be trivially mapped to files called2.4
or nested as2/4
.2/4
. (This is also implementation dependent. There are implementations where nested might not make sense. The difference is only.
separated vs./
separated.)The text was updated successfully, but these errors were encountered: