Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add components and flexibility pages #131

Open
wants to merge 29 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
53318b9
hagne description to point to components page, zarr-python, and the i…
TomNicholas Apr 4, 2025
d91c743
add compression as another key feature of zarr
TomNicholas Apr 4, 2025
237b674
describe abstract components
TomNicholas Apr 4, 2025
072d3c5
add section on concrete components
TomNicholas Apr 4, 2025
96cbd4d
add heading for section on flexibility
TomNicholas Apr 4, 2025
750e99c
make each sentence a new line
TomNicholas Apr 4, 2025
4a24927
add section on extensions
TomNicholas Apr 4, 2025
dfd10e4
add section on TensorStore
TomNicholas Apr 4, 2025
d6d0c13
add extensions, icechunk, and mongodb
TomNicholas Apr 4, 2025
0469920
NCZarr and Lindi
TomNicholas Apr 5, 2025
3ab1fbd
add virtualizarr
TomNicholas Apr 5, 2025
1bb6238
format onto one sentence per line
TomNicholas Apr 5, 2025
5a655e6
virtualizarr clarifications
TomNicholas Apr 5, 2025
13fc385
linebreak
TomNicholas Apr 5, 2025
3514d41
don't imply metadata are serialized as byte streams
TomNicholas Apr 5, 2025
8179105
add to sidebar and fix link
TomNicholas Apr 5, 2025
d16deab
fix some links
TomNicholas Apr 5, 2025
bda516f
redirection layer
TomNicholas Apr 5, 2025
6d73ece
specification->protocol
TomNicholas Apr 5, 2025
180b5ba
organize sidebar better
TomNicholas Apr 5, 2025
a1dcf26
create separate page to describe flexibility
TomNicholas Apr 5, 2025
1c05a9c
add types of flexibility
TomNicholas Apr 5, 2025
b487799
more links between pages
TomNicholas Apr 5, 2025
1f9613c
link to external example libraries
TomNicholas Apr 5, 2025
22df358
add flexibility as a feature
TomNicholas Apr 5, 2025
76cf0ef
add more applications
TomNicholas Apr 5, 2025
fbdfcda
split the applications and the features better
TomNicholas Apr 5, 2025
87cb05c
be consistent about bullet points
TomNicholas Apr 5, 2025
d499af0
Spelling
TomNicholas Apr 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 14 additions & 8 deletions _data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,19 @@ sidebar:
url: "#sponsorship"
- title: "Videos"
url: "#videos"
- title: Subpages
- title: Technical
children:
- title: "Components"
url: '/components'
- title: "Flexibility"
url: '/flexibility'
- title: "Implementations"
url: '/implementations'
- title: "Specification"
url: https://zarr-specs.readthedocs.io/
- title: "ZEPs"
url: '/zeps'
- title: Community
children:
- title: "Adopters"
url: "/adopters"
Expand All @@ -31,13 +43,7 @@ sidebar:
url: '/conventions'
- title: "Datasets"
url: '/datasets'
- title: "Implementations"
url: '/implementations'
- title: "Office Hours"
url: "/office-hours"
- title: "Slides"
url: "/slides"
- title: "Specification"
url: https://zarr-specs.readthedocs.io/
- title: "ZEPs"
url: '/zeps'
url: "/slides"
46 changes: 46 additions & 0 deletions components/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
layout: single
author_profile: false
title: Zarr Components
sidebar:
title: "Components"
nav: sidebar
---

Zarr consists of several components, both abstract and concrete.
These span both the physical storage layer and the conceptual structural layer.
Zarr-related projects all use the Zarr Protocol (and hence data model), described by the [Zarr Specification](https://zarr-specs.readthedocs.io/), but otherwise may choose to implement other layers however they wish.

## Abstract components

These abstract components together describe what type of data can be stored in zarr, and how to store it, without assuming you are working in a particular programming language, or with a particular storage system.

**Protocol**: All zarr-related projects use the Zarr Protocol, described in the [Zarr Specification](https://zarr-specs.readthedocs.io/), which allows transfer of chunked array data and metadata between devices (or between memory regions of the same device).
The protocol works by serializing and de-serializing array data as byte streams and storing both this data and accompanying metadata via an [Abstract Key-Value Store Interface](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#abstract-store-interface).
A system of [Codecs](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#chunk-encoding) is used to describe the encoding and serialization steps.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to distinguish how metadata documents are stored vs how chunk data is stored. for example, it's significant that the compresspr / filters (v2) and codecs (v3) define the encoding of chunk data, not metadata documents.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My wording was intended to make that distinction already, because Joe said the same thing in an earlier comment. Clearly I need to distinguish them better though.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the prose only needs a minor adjustment, since in the previous section you distinguish array data and metadata. It might be sufficient to just disambiguate what exactly is encoded and serialized by the codecs (i.e., the chunks of an array).


**Data Model**: The specification's description of the [Stored Representation](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#stored-representation) implies a particular data model, based on the [HDF Abstract Data Model](https://support.hdfgroup.org/documentation/hdf5/latest/_h5_d_m__u_g.html).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like it makes more sense to lead with the data model. the spec, i.e. the protocol, defines operations (create group, create array, write chunks to an array, etc) that only make sense in light of that particular data model.

Copy link
Member Author

@TomNicholas TomNicholas Apr 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the spec, i.e. the protocol

I think I disagree that these are one and the same (see #131 (comment)), but otherwise agree with your suggestion here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between the contents of the zarr v2 / v3 specs and the zarr v2 / v3 protocols?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my long comment below: #131 (comment)

It consists of a hierarchical tree of groups and arrays, with optional arbitrary metadata at every node. This model is completely domain-agnostic.

**Format**: If the keys in the abstract key-value store interface are mapped unaltered to paths in a POSIX filesystem or prefixes in object storage, the data written to disk will follow the "Native Zarr Format".
Copy link
Member Author

@TomNicholas TomNicholas Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay for me to enshrine the name "Native Zarr Format" here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "native" mean here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #131 (comment), the word "native" is perhaps redundant if we have a clear understanding of what "format" refers to.

Most, but not all, zarr implementations will serialize to this format.
Comment on lines +25 to +26
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this needs an explicit section in the specification, even if it's pretty trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it does (at least for filesystems - there's nothing for object storage). See #131 (comment) for more context.


**Extensions**: Zarr provides a core set of generally-useful features, but extensions to this core are encouraged. These might take the form of domain-specific [metadata conventions](https://zarr.dev/conventions/), new codecs, or additions to the data model via [extension points](https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#extension-points). These can be abstract, or enforced by implementations or client libraries however they like, but generally should be opt-in.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does opt-in mean here? if you are using xarray with zarr, the xarray extensions to zarr are mandatory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. All extensions are by definition not required (as then they would be core), but specific tools might well require you to use a certain extension, so calling things "opt-in" or "opt-out" doesn't make much sense.


## Concrete components

Concrete implementations of the abstract components can be implemented in any language.
The canonical reference implementation is [Zarr-Python](https://github.com/zarr-developers/zarr-python), but there are many [other implementations](https://zarr.dev/implementations/).
Zarr-Python contains reference examples of useful constructs that can be re-implemented in other languages.

**Abstract Base Classes**: Zarr-python's [`zarr.abc`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains abstract base classes enforcing a particular python realization of the specification's Key-Value Store interface, using a `Store` ABC, which is based on a `MutableMapping`-like API.
This component is concrete in the sense that it is implemented in a specific programming language, and enforces particular syntax for getting and setting values in a key-value store.
Comment on lines +36 to +37
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In zarr-python v2 the store API was based on MutableMapping, but IMO the zarr-python v3 Store api is not really MutableMapping like. Instead it's a pretty vanilla "read and write stuff to kv storage" API.


**Store Implementations**: Zarr-python's [`zarr.storage`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains concrete implementations of the `Store` ABC for interacting with particular storage systems.
The zarr-python store implementations which write to local filesystems or object storage write data in the Native Zarr Format.
It's expected that most users of zarr from python will just use one of these implementations.

**User API**: Zarr-python's [`zarr.api`](https://zarr.readthedocs.io/en/stable/api/zarr/abc/index.html) module contains functions and classes for interacting with any concrete implementation of the `zarr.abc.Store` interface.
This allows user applications to use a standard zarr API to read and write from a variety of common storage systems.

These various components allow for a huge amount of [flexibility](https://zarr.dev/flexibility/).
58 changes: 58 additions & 0 deletions flexibility/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
---
layout: single
author_profile: false
title: Zarr's Flexibility
sidebar:
title: "Flexibility"
nav: sidebar
---

One of Zarr's greatest strengths is its flexibility, or "hackability".
This largely comes from the separation of distinct [Zarr Components](https://zarr.dev/components/), but there are a range of other properties that make zarr flexible too.

## Types of flexibility

This flexibility comes in several forms:
- The Zarr protocol is device agnostic.
- The Zarr data model is domain agnostic.
- Key-value stores are an almost universal abstraction in data systems, and so can almost always be mapped to existing system interfaces.
- The Zarr format on-disk is extremely simple.
- Storing each chunk under a different key allows implementations to scale their IO throughput in a variety of simple ways.
- The reference Zarr implementation is written in Python, a very hackable language, with ABCs you can use when creating new store implementations.
- Components are seperated: the protocol, file format, standard API, ABC, and store implementations are all separate.
- There is no requirement to use more than one zarr component - individual projects can achieve powerful functionality by intelligently using only some of the Zarr components.
- You can define your own codecs.
- You are free to create your own domain-specific metadata standard and enforce it upon zarr stores however you like.
- Zarr v3 has nascent support for other extension points, including defining your own type of chunk grid, data types, and more.
- [Zarr Enhancement Proposals](https://zarr.dev/zeps/) (or "ZEPs") provide a mechanism for enhancing or adding to the specification in a community-standardized way.

## Examples

Here are a few zarr-related software projects, which each make use of a selected subset of different zarr components to achieve interesting functionality.
These particular projects are more than simply zarr implementations written in a different language (you can find a [list of implementations here](https://zarr.dev/implementations/)).

- **MongoDBStore** is a concrete store implementation in python, which stores values in a MongoDB NoSQL database under zarr keys.
It is therefore spec-compliant, and can be interacted with via the zarr-python user API, but does not write data in the native zarr format.

- [**VirtualiZarr**](https://github.com/zarr-developers/VirtualiZarr) provides a concrete store implementation in python (the `ManifestStore`) which stores references to locations and byte ranges of chunks on disk inside "chunk manifests", which reside inside files stored in other binary formats such as netCDF.
These references are generated by "readers", which do the job of parsing the file structure and mapping the contents to the zarr data model.
VirtualiZarr therefore eschews the native zarr format but still provides spec-compliant access to non-zarr-formatted data using zarr-python's API, without duplicating the original data.
The manifests effectively act as an indirection layer between the zarr-spec-compliant key interface, and the actual location of the chunks in storage.

- [**NCZarr**](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) and [**Lindi**](https://github.com/NeurodataWithoutBorders/lindi) can both in some sense be considered as the opposite of VirtualiZarr - they allow interacting with zarr-formatted data on disk via a non-zarr API.
Lindi maps zarr's data model to the HDF data model and allows access to via the `h5py` library through the [`LindiH5pyFile`](https://github.com/NeurodataWithoutBorders/lindi/blob/b125c111880dd830f2911c1bc2084b2de94f6d71/lindi/LindiH5pyFile/LindiH5pyFile.py#L28) class.
[NCZarr](https://docs.unidata.ucar.edu/nug/current/nczarr_head.html) allows interacting with zarr-formatted data via the netcdf-c library.
Note that both libraries implement optional additional optimizations by going beyond the zarr specification and format on disk, which is not recommended.

- [**Tensorstore**](https://github.com/google/tensorstore) is a general storage library written in C++ that can write to the Zarr format (so is a spec-compliant non-python "native" store implementation) but also to other array formats such as N5.
As it can write to multiple different storage sytems, it effectively has its own set of concrete store implementations.
Additional features are provided, notably using an Optionally-Cooperative Distributed B+Tree (OCDBT) on top of a base key-value store to implement ACID transactions.
It still stores all data using the native Zarr Format, but versions keys at the store level.

- [**Icechunk**](https://icechunk.io/) is a cloud-native tensor storage engine which also provides ACID transactions, but does so via indirection between a zarr-spec-compliant key-value store interface and a specialized non-zarr-native storage layout on-disk (for which Icechunk has it's own format specification).
Whilst the core icechunk client is written in rust, the `icechunk-python` client implements a concrete subclass of the zarr-python `Store` ABC.
Therefore libraries such as [xarray](https://xarray.dev/) can use the zarr-python user API to read and write to icechunk stores, effectively treating them as version-controlled zarr stores.
Icechunk also integrates with VirtualiZarr as a serialization format for byte range references.
Together they allow data stored in non-zarr formats to be committed to a persistent icechunk store and read back later via the zarr-python API without duplicating the original data chunks.

We also have a full list of [zarr implementations](https://zarr.dev/implementations/).
24 changes: 12 additions & 12 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,28 +32,28 @@ can be represented as a key-value store, including most commonly POSIX file
systems and cloud object storage but also zip files as well as relational and
document databases.

See the following GitHub repositories for more information:

* [Zarr Python](https://github.com/zarr-developers/zarr)
* [Zarr Specs](https://github.com/zarr-developers/zarr-specs)
* [Numcodecs](https://github.com/zarr-developers/numcodecs)
* [Z5](https://github.com/constantinpape/z5)
* [N5](https://github.com/saalfeldlab/n5)
* [Zarr.jl](https://github.com/meggart/Zarr.jl)
* [ndarray.scala](https://github.com/lasersonlab/ndarray.scala)
Comment on lines -35 to -43
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's deeply unhelpful to immediately point at specific implementations here as the source of further explanation. That's not what their docs are for!

For more details read about the various [components of Zarr](https://zarr.dev/components/),
see the canonical [Zarr-Python](https://github.com/zarr-developers/zarr-python) implementation,
or look through [other Zarr implementations](https://zarr.dev/implementations/) for one in your preferred language.

## Applications

* Simple and fast serialization of NumPy-like arrays, accessible from languages including Python, C, C++, Rust, Javascript and Java
* Multi-scale n-dimensional image storage, e.g. in light and electron microscopy
* Geospatial rasters, e.g. following the NetCDF / CF metadata conventions
* Multi-scale n-dimensional image storage, e.g. in light and electron microscopy.
* Genomics data, e.g. for quantitative and population genetics.
* Gridded scientific data in various domains, such as CFD or Plasma Physics.
* Geospatial rasters, e.g. following the NetCDF data model.
* Checkpointing ML model weights.

## Features

* Serialize NumPy-like arrays in a simple and fast way.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt like the applications and features were mixed up together.

* Access from languages including Python, C, C++, Rust, Javascript and Java.
* Chunk multi-dimensional arrays along any dimension.
* Compress array chunks via an extensible system of compressors.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seemed like a important omission.

* Store arrays in memory, on disk, inside a Zip file, on S3, etc.
* Read and write arrays concurrently from multiple threads or processes.
* Organize arrays into hierarchies via annotatable groups.
* Extend easily thanks to the [flexible design](https://zarr.dev/flexibility/).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link here is intended to start the reader reading through each page in turn, as the other technical pages I added also have a link at the bottom to the next one along.


## Sponsorship

Expand Down