-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add components and flexibility pages #131
base: main
Are you sure you want to change the base?
Add components and flexibility pages #131
Conversation
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". | ||
Most, but not all, zarr implementations will serialize to this format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this needs an explicit section in the specification, even if it's pretty trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.
components/index.md
Outdated
**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API. | ||
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing.
**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). | ||
It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic. | ||
|
||
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay for me to enshrine the name "Native Zarr Format" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "native" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.
See the following GitHub repositories for more information: | ||
|
||
* [Zarr Python](https://github.com/zarr-developers/zarr) | ||
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs) | ||
* [Numcodecs](https://github.com/zarr-developers/numcodecs) | ||
* [Z5](https://github.com/constantinpape/z5) | ||
* [N5](https://github.com/saalfeldlab/n5) | ||
* [Zarr.jl](https://github.com/meggart/Zarr.jl) | ||
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!
@@ -51,6 +45,7 @@ See the following GitHub repositories for more information: | |||
## Features | |||
|
|||
* Chunk multi-dimensional arrays along any dimension. | |||
* Compress array chunks via an extensible system of compressors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemed like a important omission.
components/index.md
Outdated
|
||
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. | ||
|
||
**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and metadata as byte streams
small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be addressed by 3514d41
components/index.md
Outdated
|
||
- **NCZarr** and **Lindi** can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API. | ||
Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class. | ||
[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very confident that I've actually understood what NCZarr does properly.
components/index.md
Outdated
- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys. | ||
It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still exist anywhere? I wanted an example of a python store implementation that wasn't in zarr-python v3's zarr.storage
module, and didn't use the zarr native format on disk.
index.md
Outdated
For more details read about the various [Components of Zarr](https://zarr.dev/components/), | ||
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation, | ||
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to do relative links on this site. These links are broken in the preview docs build because they don't exist on the released site.
components/index.md
Outdated
|
||
These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system. | ||
|
||
**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data as byte streams as well as store metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be more accurate to call this the "Zarr Protocol" - that's what it actually is, a set of rules for transferring data between devices. The "specification" then could refer to the description of the protocol + of the data model + of the zarr native format specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I edited this - the more I think about it the more I think that the spec itself should explicitly talk about the protocol and the format as separate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #131 (comment) for more explanation
Thanks Tom! I'm on the road for the next week and will read ASAP but I love the idea. 🙌🏼 |
|
||
## Features | ||
|
||
* Serialize NumPy-like arrays in a simple and fast way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like the applications and features were mixed up together.
* Store arrays in memory, on disk, inside a Zip file, on S3, etc. | ||
* Read and write arrays concurrently from multiple threads or processes. | ||
* Organize arrays into hierarchies via annotatable groups. | ||
* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link here is intended to start the reader reading through each page in turn, as the other technical pages I added also have a link at the bottom to the next one along.
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). | ||
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. | ||
|
||
**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the spec, i.e. the protocol
I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my long comment below: #131 (comment)
**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API. | ||
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In zarr-python v2 the store API was based on MutableMapping
, but IMO the zarr-python v3 Store
api is not really MutableMapping
like. Instead it's a pretty vanilla "read and write stuff to kv storage" API.
|
||
**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device). | ||
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface). | ||
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format". | ||
Most, but not all, zarr implementations will serialize to this format. | ||
|
||
**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does opt-in mean here? if you are using xarray with zarr, the xarray extensions to zarr are mandatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. All extensions are by definition not required (as then they would be core), but specific tools might well require you to use a certain extension, so calling things "opt-in" or "opt-out" doesn't make much sense.
thanks for working on this, here are few rambling thoughts that hopefully you find useful: you list 4 abstract components of zarr:
I'm having trouble placing these 4 things in separate conceptual categories. For me, a clearer "abstract parts list" would be something like this:
It feels weird calling this latter description a "protocol" without defining some verbs, but we could restructure the statements to take the form "to create an array at When discussing extensibility, I think it's important to distinguish between a few scenarios that all get called "extensions":
so basically I don't see "extensibility" as a core abstract component of zarr. Instead I see extensibility as a vital property / feature of different layers of the zarr model, and this varies the version of the zarr format. And I'm not sure what you mean when you say an extension is "abstract". |
You've convinced me that extensions are not a core abstract component, they are something else. I can edit this PR to reflect that. That leaves
Before going through these, let me re-describe the conceptual confusion that I'm attempting to clarify with this nomenclature. I used to think that "zarr" was simply a format, which was laid out in a specific way in filesystem or object storage, and that the spec described this format. I think a lot of other people assumed (and still assume) this. Then once Icechunk came out I was told that icechunk was a valid implementation of the zarr specification but did not use the same format on-disk. This was quite surprising and confusing, because I had thought the specification dictated the format on-disk. Now I realise that VirtualiZarr's For this to all be consistent, one of the following must be true:
I'm trying to find a nomenclature that more clearly separates these two components. Perhaps that nomenclature already exists, but if so then it's not documented at all on the main We seem to agree that the data model is it's own abstract component. You then mention
this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.
I had thought this didn't exist anywhere, but it turns out that it's here - https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html#file-system-store-v1. (At least that document covers filesystem storage - I think there should be another one for object storage too.) So (1) above is incorrect. That leaves a choice between (2) and (3): whether we say that there is one zarr specification with a mandatory "protocol" and an optional "format", or we say that zarr has a "protocol" and an optional "format", with separate specifications describing each. I have no strong opinion on that, I only request that we have some word other than "specification" to describe the non-format abstract component of zarr. |
I agree with this, and I think we should emphasize the protocol angle. In this context I think a key difference between a format and a protocol is that a format is a state, but a protocol is set of rules (generally speaking). So a good framing of the zarr specs would be:
So I think my preference would be to give primacy to the protocol. The stored representation of metadata and chunks should be considered the interaction between the zarr protocol and the behavior of a storage backend. |
Implements the suggestion in zarr-developers/zarr-python#2956.
Not quite finished yet.This is ready for review (@d-v-b @joshmoore)