Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add components and flexibility pages #131

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Apr 4, 2025

Implements the suggestion in zarr-developers/zarr-python#2956.

Not quite finished yet. This is ready for review (@d-v-b @joshmoore)

Comment on lines +24 to +25
**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
Most, but not all, zarr implementations will serialize to this format.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs an explicit section in the specification, even if it's pretty trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.

Comment on lines 35 to 36
**Zarr-Python Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, based on a `MutableMapping`-like API.
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels weird to have "abstract" base classes in the "concrete" section, but I think jumping back and forth between talking about zarr-python and language-agnostic concepts would be more confusing.

**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html).
It consists of a heirarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic.

**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
Copy link
Member Author

@TomNicholas TomNicholas Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay for me to enshrine the name "Native Zarr Format" here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "native" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.

Comment on lines -35 to -43
See the following GitHub repositories for more information:

* [Zarr Python](https://github.com/zarr-developers/zarr)
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs)
* [Numcodecs](https://github.com/zarr-developers/numcodecs)
* [Z5](https://github.com/constantinpape/z5)
* [N5](https://github.com/saalfeldlab/n5)
* [Zarr.jl](https://github.com/meggart/Zarr.jl)
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!

@@ -51,6 +45,7 @@ See the following GitHub repositories for more information:
## Features

* Chunk multi-dimensional arrays along any dimension.
* Compress array chunks via an extensible system of compressors.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like a important omission.


These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.

**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data and metadata as byte streams via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and metadata as byte streams

small nit: the spec doesn't say the metadata has to be serialized as bytes. (e.g. a memorystore or other database could keep the metadata in a dict-like object)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be addressed by 3514d41


- **NCZarr** and **Lindi** can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API.
Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class.
[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library. Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very confident that I've actually understood what NCZarr does properly.

Comment on lines 51 to 52
- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys.
It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still exist anywhere? I wanted an example of a python store implementation that wasn't in zarr-python v3's zarr.storage module, and didn't use the zarr native format on disk.

index.md Outdated
Comment on lines 35 to 37
For more details read about the various [Components of Zarr](https://zarr.dev/components/),
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation,
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to do relative links on this site. These links are broken in the preview docs build because they don't exist on the released site.


These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.

**Specification**: All zarr-related projects obey the [Zarr Specification](https://zarr-specs.readthedocs.io/), which formally describes how to serialize and de-serialize array data as byte streams as well as store metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more accurate to call this the "Zarr Protocol" - that's what it actually is, a set of rules for transferring data between devices. The "specification" then could refer to the description of the protocol + of the data model + of the zarr native format specification.

Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited this - the more I think about it the more I think that the spec itself should explicitly talk about the protocol and the format as separate things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #131 (comment) for more explanation

@joshmoore
Copy link
Member

Thanks Tom! I'm on the road for the next week and will read ASAP but I love the idea. 🙌🏼

@TomNicholas TomNicholas changed the title Add components page Add components and flexibility page Apr 5, 2025
@TomNicholas TomNicholas changed the title Add components and flexibility page Add components and flexibility pages Apr 5, 2025
@TomNicholas TomNicholas added the enhancement New feature or request label Apr 5, 2025

## Features

* Serialize NumPy-like arrays in a simple and fast way.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like the applications and features were mixed up together.

* Store arrays in memory, on disk, inside a Zip file, on S3, etc.
* Read and write arrays concurrently from multiple threads or processes.
* Organize arrays into hierarchies via annotatable groups.
* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link here is intended to start the reader reading through each page in turn, as the other technical pages I added also have a link at the bottom to the next one along.

The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.

**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.

Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec, i.e. the protocol

I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my long comment below: #131 (comment)

Comment on lines +36 to +37
**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API.
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In zarr-python v2 the store API was based on MutableMapping, but IMO the zarr-python v3 Store api is not really MutableMapping like. Instead it's a pretty vanilla "read and write stuff to kv storage" API.


**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device).
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).

**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
Most, but not all, zarr implementations will serialize to this format.

**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does opt-in mean here? if you are using xarray with zarr, the xarray extensions to zarr are mandatory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. All extensions are by definition not required (as then they would be core), but specific tools might well require you to use a certain extension, so calling things "opt-in" or "opt-out" doesn't make much sense.

@d-v-b
Copy link

d-v-b commented Apr 6, 2025

thanks for working on this, here are few rambling thoughts that hopefully you find useful:

you list 4 abstract components of zarr:

  • the protocol
  • the data model
  • the format
  • extensions

I'm having trouble placing these 4 things in separate conceptual categories. For me, a clearer "abstract parts list" would be something like this:

  • the data model
    • there 2 entities: arrays and groups
    • arrays and groups have arbitrary user-defined attributes
    • arrays contain n-dimensional typed values
    • groups contain arrays or other groups
  • a scheme for representing this data model in key-value storage (not sure if this is the format or the protocol?)
    • an array or group named x is denoted by a structured JSON metadata document at the key x/<metadata document name>
    • attributes for an array or group x are stored at x/.zattrs for v2, or in a special field in the array / group metadata document for v3.
    • chunks, etc

It feels weird calling this latter description a "protocol" without defining some verbs, but we could restructure the statements to take the form "to create an array at x ....", then it would feel more protocol-like.

When discussing extensibility, I think it's important to distinguish between a few scenarios that all get called "extensions":

  • "extensible" metadata
    The zarr v2 spec doesn't constrain the set of codecs. It just requires that codecs have a certain JSON structure. So in a sense the compressor and filters fields are "extensible" insofar as there's an infinite set of spec-compliant codecs. But from a spec POV, creating and using your own codec is not really an extension of the spec, any more than using a non-0 fill value for integer data is an "extension".

  • subsetting / narrowing conventions:
    if we consider the space of all possible zarr hierarchies as a set, there are many zarr extensions (or conventions) that can be defined as a subset of this larger set. For example, some zarr conventions impose requirements on particular metadata fields. A contrived example would be a zarr convention where all the arrays must be 8bit integers, and all groups must contain 1 array with the name "foo", and the user attributes must contain a field called name that has a string value. This doesn't add any new entities to the base zarr data model or the stored representation. I think ome-ngff and geozarr fall into this category.

  • true extensions
    consolidated metadata in zarr v2 adds a new metadata object and defines semantics for this object that interact with the creation / mutation of arrays and groups. For this reason consolidated metadata is a true extension (addition) to the zarr stored representation. But IMO zarr v2 consolidated metadata doesn't expand the basic zarr data model. By contrast, I think one could argue that Zarr v3 consolidated metadata does expand the basic zarr data model, because it gives groups a new attribute (an image of their contents). Another example of a true extension to the data model would be allowing arrays to contain other arrays (e.g., making "group" a trait that arrays could implement).

so basically I don't see "extensibility" as a core abstract component of zarr. Instead I see extensibility as a vital property / feature of different layers of the zarr model, and this varies the version of the zarr format. And I'm not sure what you mean when you say an extension is "abstract".

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
@TomNicholas
Copy link
Member Author

so basically I don't see "extensibility" as a core abstract component of zarr.

You've convinced me that extensions are not a core abstract component, they are something else. I can edit this PR to reflect that.

That leaves

  • the protocol
  • the data model
  • the format

Before going through these, let me re-describe the conceptual confusion that I'm attempting to clarify with this nomenclature.

I used to think that "zarr" was simply a format, which was laid out in a specific way in filesystem or object storage, and that the spec described this format. I think a lot of other people assumed (and still assume) this.

Then once Icechunk came out I was told that icechunk was a valid implementation of the zarr specification but did not use the same format on-disk. This was quite surprising and confusing, because I had thought the specification dictated the format on-disk.

Now I realise that VirtualiZarr's ManifestStore is another example of a specification-compliant implementation of a store that does not use the same format on-disk (it is a read-only store that uses archival formats such as netCDF instead).

For this to all be consistent, one of the following must be true:

  1. The "specification" doesn't actually specify anything about the on-disk format. That's what I assumed in DOC: Missing page on layers of Zarr abstractions zarr-python#2956.
  2. The "specification" is split into multiple parts, one of which does not specify the on-disk format, and that's the one people mean when they refer to "the specification" in the context of icechunk.
  3. There are actually multiple specifications, one specifying the on-disk format, and another specifying the thing icechunk implements.

I'm trying to find a nomenclature that more clearly separates these two components. Perhaps that nomenclature already exists, but if so then it's not documented at all on the main zarr.dev website, even though users do need to understand this distinction (they need to know what format they are writing!).


We seem to agree that the data model is it's own abstract component. You then mention

a scheme for representing this data model in key-value storage

this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.

the format

I had thought this didn't exist anywhere, but it turns out that it's here - https://zarr-specs.readthedocs.io/en/latest/v3/stores/filesystem/v1.0.html#file-system-store-v1. (At least that document covers filesystem storage - I think there should be another one for object storage too.) So (1) above is incorrect.

That leaves a choice between (2) and (3): whether we say that there is one zarr specification with a mandatory "protocol" and an optional "format", or we say that zarr has a "protocol" and an optional "format", with separate specifications describing each. I have no strong opinion on that, I only request that we have some word other than "specification" to describe the non-format abstract component of zarr.

@d-v-b
Copy link

d-v-b commented Apr 6, 2025

this is basically what I mean by the "protocol". I was looking for a word to play counterpart to "format" (i.e. "icechunk obeys the X, not the format"). I think protocol is quite a good word for it - it's an agreement between two systems (or parts of a system) on a scheme for transferring chunk data and metadata. It makes no claims about the type of system implementing the protocol. It's not a networking protocol, but still seems to fit the broader definition of a protocol.

I agree with this, and I think we should emphasize the protocol angle.

In this context I think a key difference between a format and a protocol is that a format is a state, but a protocol is set of rules (generally speaking). So a good framing of the zarr specs would be:

  • the specs define a format for in-memory metadata documents (they must be JSON with such and such fields)

  • the specs define rules (a protocol) for the storage of those metadata documents.

  • the specs do not explicitly define a format for the stored representation of metadata documents. "POSIX file system, but all filenames are reversed" could implement the zarr protocol (we would have v2 array metadata called yarraz.).

    But implementations using the simplest interpretation of the storage protocol, targeting POSIX file systems or commercial cloud storage, will produce compatible stored representations.

So I think my preference would be to give primacy to the protocol. The stored representation of metadata and chunks should be considered the interaction between the zarr protocol and the behavior of a storage backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants