Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Missing page on layers of Zarr abstractions #2956

Open
TomNicholas opened this issue Apr 4, 2025 · 3 comments
Open

DOC: Missing page on layers of Zarr abstractions #2956

TomNicholas opened this issue Apr 4, 2025 · 3 comments
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Apr 4, 2025

Describe the issue linked to the documentation

tl;dr: A page is missing from the Zarr docs which distinguishes between the different layers of specs, APIs, ABCs, formats etc.

It took me a year of working closely with zarr (via VirtualiZarr) to fully understand that "Zarr" is a multi-layered thing, which you can opt-in and out of at many different levels.

What is zarr?

My current understanding is that "Zarr" encompasses all of these things:

  1. A canonical on-disk file format, for both file and object storage, sometimes known as "native zarr",
  2. A specification for how to serialize and de-serialize array data and metadata as byte streams to an arbitrary key-value store,
  3. A python ABC for Store subclasses, which are key-value stores with a standardized API implementing the spec, but otherwise can do whatever they like behind the scenes, including not writing using the "native zarr" format,
  4. A set of canonical python Store implementations, which generally do write using the "native zarr" format,
  5. A python API for interacting with those Store subclasses, allowing python client code to treat many storage systems as interchangable.
  6. A set of informal extensions, metadata standards, and a nascent framework for formalizing the extensions.

I feel this nuance and hierarchy is not clearly documented anywhere.

"Zarr" projects and how they fit in

It matters because there are now many projects which opt-in to some of these layers but not others. For example:

  • The "zarr specification" is (2) and only (2), it doesn't actually touch on any of the other layers.
  • Much of the data in the wild today follows (1), even though AFAIK that layout isn't actually formally described anywhere official?! It obeys (2), but we're just lucky that the mapping from file/object storage to a KV store is so obvious that it's still easy to write readers implemented in any language without a formal description of (1).
  • Zarr-python
    • Zarr-python's zarr.abc provides (3),
    • Zarr-python's zarr.storage provides (4), which writes to local and object storage using (1) but without explicitly noting that,
    • Zarr-python's zarr.api provides (5), and can only interact with implementations of (3), such as (4),
  • Zarr implementations in other languages (such as zarr-js) generally use (1) as their format, following (2), and take vague inspiration from (4) and (5).
    • Tensorstore is included in that category, as it uses (1) on disk.
  • VirtualiZarr's new ManifestStore class is a concrete implementation of (3), but it eschews (1), with the aim of allowing access to an extensible set of non-zarr data formats on disk via (5).
  • Icechunk
    • Icechunk's python API implements (3) so that it can be used with (5),
    • Icechunk's spec is an alternative format to (1), but still follows (2),
    • Icechunk's rust client obeys (2) but otherwise nothing else IIUC,
    • A non-python library binding to Icechunk's rust client (as zarrs has done) would follow (2) and the Icechunk spec, potentially with API inspiration from (3) and (5) but otherwise nothing else formal.
  • Xarray is a key user of (5), but also quietly does a few things that falls under (6) (e.g. the coordinates attribute it adds),
  • GeoZarr and other extension efforts are only about (6),
  • OME-Zarr is possibly also just (6)?
  • Consolidated metadata is an example of (6), but one that's supported by (3), (4), and (5) (but not necessarily by other Store implementations).

Current docs

The main zarr homepage only says

Zarr is a community project to develop specifications and software for storage of large N-dimensional typed arrays

which is true but insufficient.

This is a problem because it leads to confusion as to what zarr "is" - for example many people understandably but mistakenly think that (1) is Zarr, and that (2) is the specification for this file format. It also makes it harder for potential contributors (e.g. @nenb) to place their ideas within the framework of the zarr project.

I think this separation of layers is awesome, I just wish I could have understood it a year earlier via reading the docs, instead of having to have it explained to me one-to-one by the likes of @d-v-b, @jhamman, and @rabernat.

cc also @maxrjones @paraseba

Suggested fix for documentation

We should have a new page on the main Zarr docs explaining these layers, and a page on the zarr-python docs explaining how it fits into this framework. Other projects such as Icechunk and VirtualiZarr can then more easily explain their relationship to Zarr.

@TomNicholas TomNicholas added documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic labels Apr 4, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Apr 4, 2025

thanks for the writeup, I totally agree that we need to do a better job at explaining the various abstractions at work here.

I wonder if it would be helpful either in the zarr-python docs or elsewhere to demonstrate a "zarr from scratch" demo, where we build a tiny zarr implementation from minimal components. This could illustrate the moving parts pretty well.

@TomNicholas
Copy link
Member Author

I think well before that even just including some kind of bulleted list like I have done would be significantly more helpful.

@TomNicholas
Copy link
Member Author

I took a crack at this in zarr-developers/zarr-developers.github.io#131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements to the documentation help wanted Issue could use help from someone with familiarity on the topic
Projects
None yet
Development

No branches or pull requests

2 participants