From e13c62f58b1576532e37de5501058ab25d908aeb Mon Sep 17 00:00:00 2001 From: Alistair Miles Date: Wed, 29 Apr 2020 16:07:15 +0100 Subject: [PATCH 01/34] Update date to reflect recent commits --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index fa116aea..950f08dd 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -4,7 +4,7 @@ Zarr core protocol (version 3.0) ================================== ----------------------------- - Editor's draft 19 June 2019 + Editor's draft 25 April 2020 ----------------------------- Specification URI: From 1703cb9f5aaa507a5491622d7f2e612c8a593711 Mon Sep 17 00:00:00 2001 From: Alistair Miles Date: Wed, 29 Apr 2020 16:07:41 +0100 Subject: [PATCH 02/34] update copyright --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 950f08dd..93c7fe5f 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -16,7 +16,7 @@ Issue tracking: Suggest an edit for this spec: `GitHub editor `_ -Copyright 2019 `Zarr core development +Copyright 2019, 2020 `Zarr core development team `_ (@@TODO list institutions?). This work is licensed under a `Creative Commons Attribution 3.0 Unported From 148f48c5349366741be2f72c23d4b85f8a4c3a63 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Thu, 14 May 2020 16:19:59 -0700 Subject: [PATCH 03/34] Reiterate some of the examples in table form. It seem there was a typo/error where ocreating a key was suggesting to concatenate "meta" instead of meta/root if my reading is correct. --- docs/protocol/core/v3.0.rst | 49 ++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 93c7fe5f..cba45b10 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -440,6 +440,18 @@ grid will be (2, 10, 8), meaning that there will be 2 chunks along the first dimension, 10 along the second dimension, and 8 along the third dimension. +.. list-table:: Regular Grid Example + :header-rows: 1 + + * - Array Shape + - Chunk Shape + - Chunk Grid Shape + - Notes + * - (10, 200, 3000) + - (5, 20, 400) + - (2, 10, 8) + - The grid does overhang the edge of the array on the 3rd dimension. + An element of an array with coordinates (`a`, `b`, `c`, ...) will occur within the chunk at grid index (`a` // `dx`, `b` // `dy`, `c` // `dz`, ...), where "//" is the floor division operator. The element @@ -450,6 +462,7 @@ that chunk, where "%" is the modulo operator. For example, if a is contained within the chunk at grid index (1, 7, 2) and has coordinates (2, 10, 100) within that chunk. + The identifier for chunk with grid index (`i`, `j`, `k`, ...) is formed by joining together ASCII string representations of each index using a separator. The default value for the separator is the period @@ -1036,7 +1049,7 @@ For example, for a group at hierarchy path "/foo/bar", the corresponding metadata key is "meta/root/foo/bar.group". For an array at a non-root hierarchy path `P`, the metadata key for -the array metadata document is formed by concatenating "meta", `P`, +the array metadata document is formed by concatenating "meta/root", `P`, and ".array". The data key for array chunks is formed by concatenating "data", `P`, "/", and the chunk identifier as defined by the chunk grid layout. @@ -1051,6 +1064,40 @@ the root node is an array, the metadata key is "meta/root.array", and the data keys are formed by concatenating "data/" and the chunk identifier. +.. list-table:: Metadata Storage Key example + :header-rows: 1 + + * - Type + - Path "P" + - Key for Metadata at path `P` + * - Array (Root) + - `/` + - `meta/root.array` + * - Group (Root) + - `/` + - `meta/root.group` + * - Group + - `/foo/bar` + - `meta/root/foo/bar.group` + * - Array + - `foo/baz` + - `meta/root/foo/baz.array` + + + + + +.. list-table:: Data Storage Key example + :header-rows: 1 + + * - Type + - Path `P` of Chunck + - Data Key + * - Chunk + - `foo/baz/0.0` + - `data/foo/baz/0.0` + + Protocol operations ------------------- From 738ecedcf6ad139d5f7e73942a560dee1c6daeff Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 20 May 2020 09:59:08 -0700 Subject: [PATCH 04/34] Suggestion changes to the spec, This is rough suggestions, I am happy to change any naming/section or remove of course. This is not particularly organised, but intended to get the discussion going. Feel free to push chaanges to this branch --- docs/protocol/core/v3.0.rst | 84 ++++++++++++++++++++++++++++++++++++- 1 file changed, 82 insertions(+), 2 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 93c7fe5f..0ca4ac50 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -269,6 +269,8 @@ node names: * must not be a string composed only of period characters, e.g. "." or ".." +TODO: likely want to reserver names that start with just a single ``.`` + * must be at most 255 characters long Note that node names are used to form storage keys, and that some @@ -386,10 +388,22 @@ Core data types - double precision float: sign bit, 11 bits exponent, 52 bits mantissa - 8 - big-endian + * - ``r*`` (Optional) + - raw bits/bytes, use for extension type fallbacks + - variable, given by ``*``, multiple of 8 + - N/A + Floating point types correspond to basic binary interchange formats as defined by IEEE 754-2008. +Additionally to these base types, an implementation should also handle the raw +pass through type designated by the lowercase letter ``r`` followed by the +number of bits as a multiple of 8. For example, ``r8``, ``r104`` should be +understood as fallback types of respectively 1 and 13 bytes longs, and +implementation should ensure to pass or accept those types unchanged when using +extension fallback. + Chunk grids =========== @@ -650,6 +664,16 @@ containing the following names: ``must_understand`` is ``true`` then processing must terminate and an appropriate error raised. + TODO: can we have must-understand to "true" but limitted to a subset of + datasets / groups using this extension ? + + Each extension can also have a field ``requires``, that contain an unordered + list of URI to other extensions specification on which they rely for proper + behavior. Implementation MAY rely on the value of these requires to decide + in which order to load extensions and resolve types. If an extension + requires another one, then implementation should ensure that the required + extension is listed in the top-level extensions field. + For example, below is an entry point metadata document, specifying that JSON is being used for encoding of group and array metadata:: @@ -678,6 +702,35 @@ ignored if not understood:: } + +For example, below is an entry point metadata document as above, but also +specifying that a protocol extension is being used which may be +ignored if not understood:: + + { + "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", + "metadata_encoding": "application/json", + "extensions": [ + { + "extension": "http://example.org/zarr/extension/multiverse/1.0", + "must_understand": false, + "requires": ["http://example.org/zarr/extension/complex_datetime/1.0"] + }, + { + "extension": "http://example.org/zarr/extension/complex_datetime/1.0", + "requires": ["http://example.org/zarr/extension/complex/1.0"] + "must_understand": false, + }, + { + "extension": "http://example.org/zarr/extension/complex/1.0", + "must_understand": true, + " + } + ] + } + + + Array metadata -------------- @@ -711,7 +764,19 @@ following names: identifiers defined in this specification. If an implementation does not recognise the extension, but a ``fallback`` is present, then the implementation may proceed using the ``fallback`` value - as the data type. + as the data type. For fallback types that do not correspond to base + known types, extensions can fallback on on a raw number of bytes using + + TODO: should data_type extension be also available in the entry-point + metadata ? I'm going to guess a number of things can't won't be able to + dynamically load extensions so need to know immediately. + That would also allow to define datatypes only once on top-level hierarchy. + Do we allow multiple version of the same extenions ? + + Restrict extensions types charset/format to not block the core types to be + extended ? Allow different extension to have same named type in multiple + context and dispatch depending on extension ? + ``chunk_grid`` @@ -780,7 +845,9 @@ following names: ``extensions`` - The value @@TODO + The value of the extension field should be a list of URI of extensions used + to write/read the chunks present in this layout. Extension listed in this + field must be provided in the entry point metadata document. ``attributes`` @@ -937,6 +1004,7 @@ operations: | Parameters: none | Output: set of `keys` + ``list_prefix`` - Retrieve all keys with a given prefix. | Parameters: `prefix` @@ -955,6 +1023,8 @@ which do not contain the character "/" after the given prefix. "a/f/g", then ``list_dir("a/")`` would return keys "a/b" and "a/c" and prefixes "a/d/" and "a/f/". + @@TODO@@ more general list with depth XXX + Store implementations --------------------- @@ -1139,6 +1209,16 @@ Protocol extensions @@TODO define different types of protocol extension. +Many types of extensions can exists for a Zarr Protocol, they can be regrouped +in mostly 3 categories: + + - Core Datatypes Extensions – for example adding ability store fixed size + types like complex and datetime in chunks. + - Arrays Extensions – Non rectilinear grids, and + variable length types. + - Group extensions – Better understanding of a group layout : example pyramidal + resolution. + References ========== From bdc9cc40edf76ee94cdc184f76ea55f0426ab2cb Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Thu, 21 May 2020 12:01:53 -0700 Subject: [PATCH 05/34] Misc extra sugestions --- docs/protocol/core/v3.0.rst | 108 +++++++++++++++++++++--------------- 1 file changed, 64 insertions(+), 44 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 0ca4ac50..16340d53 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -316,75 +316,75 @@ Core data types - Numerical type - Size (no. bytes) - Byte order - * - `bool` - - Boolean, with False encoded as `\\x00` and True encoded as `\\x01` + * - ``bool`` + - Boolean, with False encoded as ``x00`` and True encoded as ``x01`` - 1 - None - * - `i1` + * - ``i1`` - signed integer - 1 - None - * - `i2` + * - ``i2`` - signed integer - 2 - big-endian - * - `>i4` + * - ``i4`` - signed integer - 4 - big-endian - * - `>i8` + * - ``i8`` - signed integer - 8 - big-endian - * - `u1` + * - ``u1`` - signed integer - 1 - None - * - `f2` + * - ``f2`` - half precision float: sign bit, 5 bits exponent, 10 bits mantissa - 2 - big-endian - * - `>f4` + * - ``f4`` - single precision float: sign bit, 8 bits exponent, 23 bits mantissa - 4 - big-endian - * - `>f8` + * - ``f8`` - double precision float: sign bit, 11 bits exponent, 52 bits mantissa - 8 - big-endian @@ -589,9 +589,10 @@ Metadata ======== This section defines the structure of metadata documents for Zarr hierarchies, -which consists of three types metadata documents: an entry point metadata document, -array metadata documents, and group metadata documents. Each type of metadata -document is described in the following subsections. +which consists of three types metadata documents: an entry point metadata +document ( e.g. ``zarr.json``), array metadata documents, and group metadata +documents. Each type of metadata document is described in the following +subsections. Metadata documents are defined here using the JSON type system defined in [RFC8259]_. In this section, the terms "value", @@ -667,6 +668,15 @@ containing the following names: TODO: can we have must-understand to "true" but limitted to a subset of datasets / groups using this extension ? + suggestion: + affects : "root", "groups", "arrays" + + affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. + affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. + affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. + + + Each extension can also have a field ``requires``, that contain an unordered list of URI to other extensions specification on which they rely for proper behavior. Implementation MAY rely on the value of these requires to decide @@ -704,8 +714,7 @@ ignored if not understood:: For example, below is an entry point metadata document as above, but also -specifying that a protocol extension is being used which may be -ignored if not understood:: +specifying that many protocol extensions are being used; two of which are not required to be understood, one of which is necessary to be able to interprete arrays encoding:: { "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", @@ -724,7 +733,7 @@ ignored if not understood:: { "extension": "http://example.org/zarr/extension/complex/1.0", "must_understand": true, - " + "affects": "arrays", } ] } @@ -735,7 +744,7 @@ Array metadata -------------- Each Zarr array in a hierarchy must have an array metadata -document. This document must contain a single object with the +document (``*.array``). This document must contain a single object with the following names: ``shape`` @@ -834,15 +843,22 @@ following names: value must be a number with no fraction or exponent part and must be within the range of the data type. - @@TODO what about floating point data types? - @@TODO what about nan and inf? - @@TODO what about extension data types? - For any data type, if the ``fill_value`` is the literal ``null`` then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value. + If the ``data_type`` of an array is defined in a ``data_type`` extension, + then said extension is responsible for interpreting the value of + ``fill_value`` and return a suitable type that can be used. + + For core ``data_type`` which ``fill_value`` are not permitted in JSON or + for which decimal representation could be lossy, a string representing of + the binary (starting with ```0b``) or hexadecimal value (starting with + ``0x``) is accepted. This string must include all leading or trailing + zeroes necessary to match the given type size. The string values ``"NaN"``, + ``"+Infinity"`` and ``"-Infinity"`` are also understood for floating point datatypes. + ``extensions`` The value of the extension field should be a list of URI of extensions used @@ -854,6 +870,8 @@ following names: The value must be an object. The object may contain any name/value pairs. +TODO see https://github.com/zarr-developers/zarr-specs/issues/72 to potentially split large metadata documents. + All other names within the array metadata object are reserved for future versions of this specification. @@ -918,10 +936,10 @@ chunking as above, but using an extension data type:: Group metadata -------------- -A Zarr group metadata object must contain the ``extensions`` and ``attributes`` -names as defined above in the `Array metadata`_ section. All other names -are reserved for future versions of this specification. See also the section -on `Protocol extensions`_ below. +A Zarr group metadata object (``*.group``) must contain the ``extensions`` and +``attributes`` names as defined above in the `Array metadata`_ section. All +other names are reserved for future versions of this specification. See also +the section on `Protocol extensions`_ below. For example, the JSON document below defines a group:: @@ -937,12 +955,14 @@ For example, the JSON document below defines a group:: Metadata encoding ----------------- -The entry point metadata document must be encoded as JSON. The array and -group metadata documents must be encoded as per the MIME type given in +The entry point metadata document must be encoded as JSON. The array (``*.arrays``) and +group metadata documents (``*.groups``) must be encoded as per the MIME type given in the ``metadata_encoding`` field in the entry point metadata document (described below). @@TODO more explanation needed? +Todo questions : How frequent are non-JSON metadata documents ? and how store specific are those. +If we allow arbitrary mimetypes, doe that precude the sections on arrays and group metadata to be normative ? Stores @@ -1008,7 +1028,7 @@ operations: ``list_prefix`` - Retrieve all keys with a given prefix. | Parameters: `prefix` - | Output: set of `keys` with the given `prefix` + | Output: set of `keys` with the given `prefix`, For example, if a store contains the keys "a/b", "a/c/d" and "e/f/g", then ``list_prefix("a/")`` would return "a/b" and "a/c/d". @@ -1023,7 +1043,7 @@ which do not contain the character "/" after the given prefix. "a/f/g", then ``list_dir("a/")`` would return keys "a/b" and "a/c" and prefixes "a/d/" and "a/f/". - @@TODO@@ more general list with depth XXX +@TODO : should store return trailing slashes when possible ? Store implementations @@ -1096,14 +1116,14 @@ identify ancestor nodes in the hierarchy. For example, the path Storage keys ------------ -The entry point metadata document is stored under the key "zarr.json". +The entry point metadata document is stored under the key ``zarr.json``. For a group at a non-root hierarchy path `P`, the metadata key for the -group metadata document is formed by concatenating "meta/root", `P`, -and ".group". +group metadata document is formed by concatenating ``meta/root``, `P`, +and ``.group``. -For example, for a group at hierarchy path "/foo/bar", the -corresponding metadata key is "meta/root/foo/bar.group". +For example, for a group at hierarchy path ``/foo/bar``, the +corresponding metadata key is ``meta/root/foo/bar.group``. For an array at a non-root hierarchy path `P`, the metadata key for the array metadata document is formed by concatenating "meta", `P`, @@ -1112,7 +1132,7 @@ and ".array". The data key for array chunks is formed by concatenating grid layout. For example, for an array at hierarchy path "/foo/baz", the -corresponding metadata key is "meta/root/foo/baz.array". If the array +corresponding metadata key is ``meta/root/foo/baz.array``. If the array has two dimensions and a regular chunk grid, the data key for the chunk with grid coordinates (0, 0) is "data/foo/baz/0.0". From 3ce33b66da4b7ba3b5825fee3a9839ce6c752597 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Thu, 21 May 2020 23:41:30 -0700 Subject: [PATCH 06/34] Some extra modification and questions on the spec Mostly driven by a side implementation I'm working on and will publish on another repository. --- docs/protocol/core/v3.0.rst | 105 ++++++++++++++++++++++++++++++++--- docs/protocol/extensions.rst | 5 ++ 2 files changed, 103 insertions(+), 7 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 505109d3..786d13cc 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -1006,6 +1006,18 @@ bytes. It is assumed that the store holds (`key`, `value`) pairs, with only one such pair for any given `key`. I.e., a store is a mapping from keys to values. +A store can make the following assumption on the structures of the keys it will receive: + +- A key always: + - start with ``meta/`` + - starts with ``data/`` + - is exactly ``zarr.json``. + + + +- List operations ``list_dir`` will always be passed keys ending with a trailing slash, that is to say it will only be asked to work with complete node names. + + The store operations are grouped into three sets of capabilities: **readable**, **writeable** and **listable**. It is not necessary for a store implementation to support all of these capabilities. @@ -1029,6 +1041,14 @@ A **writeable store** supports the following operations: | Parameters: `key` | Output: none +``delete_prefix`` - Delete all keys with the given prefix from the store, include the prefix itself if it exists as a key: + + | Parameter: `key` + | Output: None + + + clients of delete_prefix should pay attention to pass a trailing slash on the key to delete a complete dataset or group, otherwise the store may delete similar keys. + A **listable store** supports any one or more of the following operations: @@ -1038,6 +1058,16 @@ operations: | Output: set of `keys` +TODO: the 2 spec below is ill-defined – or unclear if the prefix doe not have a +trailing slash. I'm assuming this is meant to only list after a slash to mimic +v2, but is equivalent to a glob. + + - /group/dataset + /group2/dataset + + - querying list_prefix or list_dir /group (no trailing slash) does list group2 + + ``list_prefix`` - Retrieve all keys with a given prefix. | Parameters: `prefix` @@ -1046,6 +1076,9 @@ operations: For example, if a store contains the keys "a/b", "a/c/d" and "e/f/g", then ``list_prefix("a/")`` would return "a/b" and "a/c/d". + Note that ``list_prefix`` may be passed keys that do not end in in slashes + in order to find all node names starting with a prefix. + ``list_dir`` - Retrieve all keys and prefixes with a given prefix and which do not contain the character "/" after the given prefix. @@ -1056,6 +1089,25 @@ which do not contain the character "/" after the given prefix. "a/f/g", then ``list_dir("a/")`` would return keys "a/b" and "a/c" and prefixes "a/d/" and "a/f/". + + TODO: i'm thinking on enforcing this one and have a SHOULD or MUST. + Stores implementing list dir CAN assume that all keys passed as parameter + do end in a trailing slash. + + + Store user must ensure that the queries to list_dir do endup in + + Note: if we lift the requirement on trailing slashes, then doing a prefix + search can lead to returning keys in the same hierarchy level but longer name. e.g: + + - /meta/foo + - /meta/foo/dataset + - /meta/foobar + + list_dir('foo') == 'foo'&'foobar' + list_dir('foo/') == 'foo/dataset' + + @TODO : should store return trailing slashes when possible ? @@ -1149,17 +1201,21 @@ corresponding metadata key is ``meta/root/foo/baz.array``. If the array has two dimensions and a regular chunk grid, the data key for the chunk with grid coordinates (0, 0) is "data/foo/baz/0.0". -If the root node is a group, the metadata key is "meta/root.group". If +If the root node is a group, the metadata key is ``meta/root.group``. If the root node is an array, the metadata key is "meta/root.array", and the data keys are formed by concatenating "data/" and the chunk identifier. + .. list-table:: Metadata Storage Key example :header-rows: 1 * - Type - Path "P" - Key for Metadata at path `P` + * - Entry-Point metadata (Zarr.json) + - `n/a` + - `zarr.json` * - Array (Root) - `/` - `meta/root.array` @@ -1167,10 +1223,16 @@ identifier. - `/` - `meta/root.group` * - Group - - `/foo/bar` + - `/foo` + - `meta/root/foo.group` + * - Array + - `/foo` + - `meta/root/foo.array` + * - Group + - `/foo` - `meta/root/foo/bar.group` * - Array - - `foo/baz` + - `/foo/baz` - `meta/root/foo/baz.array` @@ -1184,7 +1246,7 @@ identifier. - Path `P` of Chunck - Data Key * - Chunk - - `foo/baz/0.0` + - `/foo/baz/0.0` - `data/foo/baz/0.0` @@ -1202,9 +1264,18 @@ with grid coordinates (`i`, `j`, ...). Let "+" be the string concatenation operator. +TODO: This kinda force creating a group or an array to check whether the same +key is also a group or an array, which may create some latencies. I'm assuming +we don't wan to allow both groups and array to be at the same path P right ? +Also have to think about how to handle that for implicit groups, That is to say +if /g1/ is a array and we want to create /g1/g2/g3/g4/dataset, we need to check +all the parents to make sure they are not datasets. + +I would suggest to tell store they can assume clients will not try to do the wrong tihng. + **Create a group** - To create a group at hierarchy path `P`, perform + To create an explict group at hierarchy path `P`, perform ``set(group_meta_key(P), value)``, where `value` is the serialisation of a valid group metadata document. @@ -1261,11 +1332,31 @@ Let "+" be the string concatenation operator. **Delete a group or array** - @@TODO + To delete an array it is necessary to + - delete the metadata document for the array, (meta/P.array) + - delete all keys which prefix have path pointing to this to this array. (data/P/\*) + + To delete a implicit group. + - delete all arrays under this group + - it should be sufficient to deleta all the keys starting with prefix meta/P/ and data/P/ + + To delete an explict group. + - delete all arrays under this group, + - delete all keys with meta/P/ prefix, meta/P/groups all keys with /data/P prefix, + + Note that store implmentation may decide to reify implicit groups and thus protocol implmentation should attempt to delete the .meta/P/.group file if they really wish to delte an empty implicit group. + + Store implmentation are also allowed to delete any implicit parent of a deleted implicit groups, so a protocol implmentation should make sure to reify a parent group if they need to keep it. + + + **Determine if a node exists** - @@TODO + To determine if a node exists at path `P`, perform `exist()` + exists is tricky, you can querry `meta/P.*`, as P might be implicit, + so you need to querry both `meta/P.*` and `meta/P/*`. + @@TODO need to describe/explain implicit groups more? diff --git a/docs/protocol/extensions.rst b/docs/protocol/extensions.rst index 139fb4c3..a5fb63e7 100644 --- a/docs/protocol/extensions.rst +++ b/docs/protocol/extensions.rst @@ -12,3 +12,8 @@ Under construction. extensions/complex-dtypes/v1.0 extensions/datetime-dtypes/v1.0 + +A number of other features might be included in the core protocol v3, but are currently considered as extensions. + + - Dimensions names for arrays: see https://github.com/zarr-developers/zarr-specs/issues/73 + - Ability to provide soft / hard ? links, and region references. From 4b23f779e521405edac04933bf9940be487d7044 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 3 Jun 2020 11:57:04 -0700 Subject: [PATCH 07/34] feddback from high bandwidth discussion --- docs/protocol/core/v3.0.rst | 93 +++++++++++++++++++------------------ 1 file changed, 49 insertions(+), 44 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 786d13cc..ef7e9528 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -1,5 +1,7 @@ .. This file is in restructured text format: http://docutils.sourceforge.net/rst.html + + ================================== Zarr core protocol (version 3.0) ================================== @@ -404,6 +406,14 @@ understood as fallback types of respectively 1 and 13 bytes longs, and implementation should ensure to pass or accept those types unchanged when using extension fallback. +Note: currently only fixed size elements are supported as a core data type. +There are many request for variable length element encoding. There are many +way to encode variable length and we want to keep flexibility. While we seem +to agree that for random access the most likely contender is to have two +arrays, one with the actual variable length data and one with fixed size +(pointer + length) to the variable size data we do nto want to commit to such +a structure. + Chunk grids =========== @@ -672,30 +682,48 @@ containing the following names: extension. If an implementation of this specification encounters an extension - that it does not recognise, but the value of ``must_understand`` + that it does not recognize, but the value of ``must_understand`` is ``false``, then the extension may be ignored and processing may - continue. If the extension is not recognised and the value of + continue. If the extension is not recognized and the value of ``must_understand`` is ``true`` then processing must terminate and an appropriate error raised. - TODO: can we have must-understand to "true" but limitted to a subset of + TODO: can we have must-understand to "true" but limited to a subset of datasets / groups using this extension ? - suggestion: - affects : "root", "groups", "arrays" + TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. + + Note: All extensions that are used in a zarr hierarchy must be stored in + the entry point metadata and implementation should expect to not + encounter any extensions not defined in the metadata entry points. + Implementation must ensure that any data written using an extension have + this extension added to the top level entry point. The entry point + metadata may contain extensions that are not in use in the hierarchy and + it is considered a viable trade off that removing extensions from the + entrypoint metadata requires a full rescan of the hierarchy. + + + + - affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. - affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. - affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. +.. suggestion: +.. affects : "root", "groups", "arrays" +.. +.. affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. +.. affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. +.. affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. - Each extension can also have a field ``requires``, that contain an unordered - list of URI to other extensions specification on which they rely for proper - behavior. Implementation MAY rely on the value of these requires to decide - in which order to load extensions and resolve types. If an extension - requires another one, then implementation should ensure that the required - extension is listed in the top-level extensions field. +.. Each extension can also have a field ``requires``, that contain an unordered +.. list of URI to other extensions specification on which they rely for proper +.. behavior. Implementation MAY rely on the value of these requires to decide +.. in which order to load extensions and resolve types. If an extension +.. requires another one, then implementation should ensure that the required +.. extension is listed in the top-level extensions field. + +.. it seem unclear whether the dependency should be in the spec or handled by implementations, +.. there are pro and cons to both. It seem like listing the dependencies in the spec is not necessary. For example, below is an entry point metadata document, specifying that JSON is being used for encoding of group and array metadata:: @@ -725,34 +753,6 @@ ignored if not understood:: } - -For example, below is an entry point metadata document as above, but also -specifying that many protocol extensions are being used; two of which are not required to be understood, one of which is necessary to be able to interprete arrays encoding:: - - { - "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", - "metadata_encoding": "application/json", - "extensions": [ - { - "extension": "http://example.org/zarr/extension/multiverse/1.0", - "must_understand": false, - "requires": ["http://example.org/zarr/extension/complex_datetime/1.0"] - }, - { - "extension": "http://example.org/zarr/extension/complex_datetime/1.0", - "requires": ["http://example.org/zarr/extension/complex/1.0"] - "must_understand": false, - }, - { - "extension": "http://example.org/zarr/extension/complex/1.0", - "must_understand": true, - "affects": "arrays", - } - ] - } - - - Array metadata -------------- @@ -867,7 +867,7 @@ following names: For core ``data_type`` which ``fill_value`` are not permitted in JSON or for which decimal representation could be lossy, a string representing of - the binary (starting with ```0b``) or hexadecimal value (starting with + the binary (starting with ``0b``) or hexadecimal value (starting with ``0x``) is accepted. This string must include all leading or trailing zeroes necessary to match the given type size. The string values ``"NaN"``, ``"+Infinity"`` and ``"-Infinity"`` are also understood for floating point datatypes. @@ -1094,8 +1094,10 @@ which do not contain the character "/" after the given prefix. Stores implementing list dir CAN assume that all keys passed as parameter do end in a trailing slash. + TODO: how does it behave on non-existing path ? + - Store user must ensure that the queries to list_dir do endup in + Store user must ensure that the queries to list_dir do end up in Note: if we lift the requirement on trailing slashes, then doing a prefix search can lead to returning keys in the same hierarchy level but longer name. e.g: @@ -1372,11 +1374,14 @@ in mostly 3 categories: - Core Datatypes Extensions – for example adding ability store fixed size types like complex and datetime in chunks. + - this seem to split into two categories: Purely declarative, and more complex. - Arrays Extensions – Non rectilinear grids, and variable length types. - Group extensions – Better understanding of a group layout : example pyramidal resolution. + see https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions + References ========== From b57bfade7aec01b9ba7f590eb8482f4a157bbc40 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 5 Jun 2020 13:03:41 -0700 Subject: [PATCH 08/34] rework some changes --- docs/protocol/core/v3.0.rst | 72 +++++++++++++++++++++++++++---------- 1 file changed, 53 insertions(+), 19 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index ef7e9528..b555f637 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -54,6 +54,20 @@ or by making a pull request against the This document was produced by the `Zarr core development team `_. +Goal of v3 spec +=============== + +Zarr spec v2 was originally designed around local filesystem, but Zarr has +grown and is now also often deployed on cloud / object storage. Those kind of +storage have characteristics, capabilities and usage patterns that can widely +differ from the assumptions of spec v2. V3 will be design to consider online +stores in particular we want to achieve the following. + + - no assumption that the underlying store have locking ability. + - ability to do concurrent writes with the assumption that writes from clients will be consistent. + + + Document conventions ==================== @@ -236,13 +250,16 @@ conceptual model underpinning the Zarr protocol. *Metadata document* - Each node in a hierarchy_ is represented by a metadata document, + Each hierarchy_ and array_ is represented by a metadata document, which is a machine-readable document containing essential processing information about the node. For example, an array_ metadata document will specify the number of dimensions_, shape_, `data type`_, grid_, `memory layout`_ and compressor_ for that array_. + Groups can have a optional metadata document which provide extra + information about a group. + .. _store: .. _stores: @@ -693,14 +710,16 @@ containing the following names: TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. - Note: All extensions that are used in a zarr hierarchy must be stored in - the entry point metadata and implementation should expect to not - encounter any extensions not defined in the metadata entry points. - Implementation must ensure that any data written using an extension have - this extension added to the top level entry point. The entry point - metadata may contain extensions that are not in use in the hierarchy and - it is considered a viable trade off that removing extensions from the - entrypoint metadata requires a full rescan of the hierarchy. +.. This seem to be difficult to achieve if there are no locks. +.. +.. Note: All extensions that are used in a zarr hierarchy must be stored in +.. the entry point metadata and implementation should expect to not +.. encounter any extensions not defined in the metadata entry points. +.. Implementation must ensure that any data written using an extension have +.. this extension added to the top level entry point. The entry point +.. metadata may contain extensions that are not in use in the hierarchy and +.. it is considered a viable trade off that removing extensions from the +.. entrypoint metadata requires a full rescan of the hierarchy. @@ -874,9 +893,10 @@ following names: ``extensions`` - The value of the extension field should be a list of URI of extensions used - to write/read the chunks present in this layout. Extension listed in this - field must be provided in the entry point metadata document. + See the top level metadata extension section for the time being. + + TODO: Add the fact that extensions only affects arrays, and expand on the must-understand. + TODO: Example with named dimensions. ``attributes`` @@ -949,33 +969,45 @@ chunking as above, but using an extension data type:: Group metadata -------------- -A Zarr group metadata object (``*.group``) must contain the ``extensions`` and -``attributes`` names as defined above in the `Array metadata`_ section. All +A Zarr group metadata object (``*.group``) must contain the +``attributes`` name as defined above in the `Array metadata`_ section. All other names are reserved for future versions of this specification. See also the section on `Protocol extensions`_ below. -For example, the JSON document below defines a group:: +For example, the JSON document below defines an explicit group:: { - "extensions": [], "attributes": { "spam": "ham", "eggs": 42, } } +.. note:: + + Groups cannot have extensions attached to them as of spec v3.0 Allowing + groups to have extensions would force any implementation to sequentially + traverse the store hierarchy in order to check for extensions, which would + defeat the purpose of a flat namespace and concurrent access. + + For the time being groups can only have attributes. + +.. note:: + + A group does not need a metadata document to exists, see implicit groups. + Metadata encoding ----------------- -The entry point metadata document must be encoded as JSON. The array (``*.arrays``) and -group metadata documents (``*.groups``) must be encoded as per the MIME type given in +The entry point metadata document must be encoded as JSON. The array (``*.array``s) and +group metadata documents (``*.group``s) must be encoded as per the MIME type given in the ``metadata_encoding`` field in the entry point metadata document (described below). @@TODO more explanation needed? Todo questions : How frequent are non-JSON metadata documents ? and how store specific are those. -If we allow arbitrary mimetypes, doe that precude the sections on arrays and group metadata to be normative ? +If we allow arbitrary mimetypes, does that precude the sections on arrays and group metadata to be normative ? Stores @@ -1017,6 +1049,8 @@ A store can make the following assumption on the structures of the keys it will - List operations ``list_dir`` will always be passed keys ending with a trailing slash, that is to say it will only be asked to work with complete node names. +Store implementation can assume they will only be given trailing slashes, and protocol implementation Must pass trailing slashes to underlying stores. + The store operations are grouped into three sets of capabilities: **readable**, **writeable** and **listable**. It is not necessary for From 9004089ccfcb267a9d13d45a64adbe4e616e40b8 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Sun, 14 Jun 2020 11:24:09 -0700 Subject: [PATCH 09/34] fix couple of issues on zspec --- docs/protocol/core/v3.0.rst | 95 ++++++++++++++++++++++++------------- 1 file changed, 61 insertions(+), 34 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index b555f637..da4cb357 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -54,19 +54,24 @@ or by making a pull request against the This document was produced by the `Zarr core development team `_. -Goal of v3 spec -=============== +Goal of v3 spec and main difference from v2 +=========================================== Zarr spec v2 was originally designed around local filesystem, but Zarr has -grown and is now also often deployed on cloud / object storage. Those kind of +grown and is now regularly deployed on cloud / object storage. Those kind of storage have characteristics, capabilities and usage patterns that can widely -differ from the assumptions of spec v2. V3 will be design to consider online +differ from the assumptions of spec v2. V3 is design to consider online stores in particular we want to achieve the following. - - no assumption that the underlying store have locking ability. - - ability to do concurrent writes with the assumption that writes from clients will be consistent. + - No assumption that the underlying store have locking ability. + - Ability to do concurrent writes with the assumption that writes from clients will be consistent, but non atomic. +Unlike Zarr spec version 2, the spec V3 have mainly the following difference: + - V3 is a flat key-value store instead of an hierarchical store. Hierarchy is implied. + - V3 has an explicit root, while v2 roots and group could not be distinguished. + - Separation of the data and metadata key space. + - Explicit support for extensions. Document conventions @@ -328,6 +333,8 @@ identify a data type. Core data types --------------- +.. endianess disapear + .. list-table:: Data types :header-rows: 1 @@ -343,27 +350,27 @@ Core data types - signed integer - 1 - None - * - ``i2`` + * - ``i2`` - signed integer - 2 - big-endian - * - ``i4`` + * - ``>i4`` - signed integer - 4 - big-endian - * - ``i8`` + * - ``>i8`` - signed integer - 8 - big-endian @@ -371,39 +378,39 @@ Core data types - signed integer - 1 - None - * - ``u2`` + * - ``f2`` - half precision float: sign bit, 5 bits exponent, 10 bits mantissa - 2 - big-endian - * - ``f4`` + * - ``>f4`` - single precision float: sign bit, 8 bits exponent, 23 bits mantissa - 4 - big-endian - * - ``f8`` + * - ``>f8`` - double precision float: sign bit, 11 bits exponent, 52 bits mantissa - 8 - big-endian @@ -504,15 +511,15 @@ is contained within the chunk at grid index (1, 7, 2) and has coordinates (2, 10, 100) within that chunk. -The identifier for chunk with grid index (`i`, `j`, `k`, ...) is +The identifier for chunk with grid index (``i``, ```j``, ``k``, ...) is formed by joining together ASCII string representations of each index using a separator. The default value for the separator is the period -character ("."), but this may be configured by providing a ``separator`` +character ( by default ``/``), but this may be configured by providing a ``separator`` value within the ``chunk_grid`` metadata object, see the section on `Array metadata`_ below. For example, in a 3 dimensional array, the identifier for the chunk at -grid index (1, 23, 45) is the string "1.23.45". +grid index (1, 23, 45) is the string "1/23/45". Note that this specification does not consider the case where the chunk grid and the array space are not aligned at the origin vertices @@ -523,6 +530,11 @@ origin vertex of the array may occur at an arbitrary position within any chunk, which is required to allow arrays to be extended by an arbitrary length in a "negative" direction along any dimension. +.. note:: A main difference with spec v2 is the default chunk separator + changed from ``.`` to ``/`` this help with compatibility with N5 as well as + decrease the maximum number of items in hierarchical stores like directory + stores. + Chunk memory layouts ==================== @@ -706,7 +718,7 @@ containing the following names: an appropriate error raised. TODO: can we have must-understand to "true" but limited to a subset of - datasets / groups using this extension ? + datasets / groups using this extension ? Should we also precise read/writes ? Example checksums ? TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. @@ -770,6 +782,8 @@ ignored if not understood:: } ] } + +TODO: clarify / test: do we allow the root to also be a group/array. I don't see any issue with doing so as the entrypoint md has a different name. Array metadata @@ -965,6 +979,14 @@ chunking as above, but using an extension data type:: "attributes": {} } +.. note:: + comparison with spec v2, + ``dtype`` have been renamed to ``data_type``, + ``chunk`` have been renamed to ``chunk_grid``, + ``order`` have been renamed to ``chunk_memory_layout``, + ``filters`` have been removed, + ``zarr_format`` have been removed, + Group metadata -------------- @@ -997,6 +1019,11 @@ For example, the JSON document below defines an explicit group:: A group does not need a metadata document to exists, see implicit groups. +Chunks Metadata +--------------- + +Do we want to allow per-chunk metadata ? example checksum (concurrent read/writes). + Metadata encoding ----------------- @@ -1249,7 +1276,7 @@ identifier. * - Type - Path "P" - Key for Metadata at path `P` - * - Entry-Point metadata (Zarr.json) + * - Entry-Point metadata (zarr.json) - `n/a` - `zarr.json` * - Array (Root) @@ -1311,9 +1338,9 @@ I would suggest to tell store they can assume clients will not try to do the wro **Create a group** - To create an explict group at hierarchy path `P`, perform + To create an explicit group at hierarchy path `P`, perform ``set(group_meta_key(P), value)``, where `value` is the - serialisation of a valid group metadata document. + serialization of a valid group metadata document. If `P` is a non-root path then it is **not** necessary to create or check for the existence of metadata documents for groups at any @@ -1374,15 +1401,15 @@ I would suggest to tell store they can assume clients will not try to do the wro To delete a implicit group. - delete all arrays under this group - - it should be sufficient to deleta all the keys starting with prefix meta/P/ and data/P/ + - it should be sufficient to delete all the keys starting with prefix meta/P/ and data/P/ - To delete an explict group. + To delete an explicit group. - delete all arrays under this group, - delete all keys with meta/P/ prefix, meta/P/groups all keys with /data/P prefix, - Note that store implmentation may decide to reify implicit groups and thus protocol implmentation should attempt to delete the .meta/P/.group file if they really wish to delte an empty implicit group. + Note that store implementation may decide to reify implicit groups and thus protocol implementation should attempt to delete the .meta/P/.group file if they really wish to delte an empty implicit group. - Store implmentation are also allowed to delete any implicit parent of a deleted implicit groups, so a protocol implmentation should make sure to reify a parent group if they need to keep it. + Store implementation are also allowed to delete any implicit parent of a deleted implicit groups, so a protocol implementation should make sure to reify a parent group if they need to keep it. @@ -1390,8 +1417,8 @@ I would suggest to tell store they can assume clients will not try to do the wro **Determine if a node exists** To determine if a node exists at path `P`, perform `exist()` - exists is tricky, you can querry `meta/P.*`, as P might be implicit, - so you need to querry both `meta/P.*` and `meta/P/*`. + exists is tricky, you can query `meta/P.*`, as P might be implicit, + so you need to query both `meta/P.*` and `meta/P/*`. From 04c81f13009947addf9442d84d50c9831d357aca Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Sun, 14 Jun 2020 18:58:31 -0700 Subject: [PATCH 10/34] typo --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index da4cb357..1deeac12 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -982,7 +982,7 @@ chunking as above, but using an extension data type:: .. note:: comparison with spec v2, ``dtype`` have been renamed to ``data_type``, - ``chunk`` have been renamed to ``chunk_grid``, + ``chunks`` have been renamed to ``chunk_grid``, ``order`` have been renamed to ``chunk_memory_layout``, ``filters`` have been removed, ``zarr_format`` have been removed, From 7bdf4f708181ca3aed61bd0f5d0df38457da1701 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Sun, 14 Jun 2020 22:13:48 -0700 Subject: [PATCH 11/34] list optional separator --- docs/protocol/core/v3.0.rst | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 1deeac12..716d57f9 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -511,7 +511,7 @@ is contained within the chunk at grid index (1, 7, 2) and has coordinates (2, 10, 100) within that chunk. -The identifier for chunk with grid index (``i``, ```j``, ``k``, ...) is +The identifier for chunk with grid index (``i``, ``j``, ``k``, ...) is formed by joining together ASCII string representations of each index using a separator. The default value for the separator is the period character ( by default ``/``), but this may be configured by providing a ``separator`` @@ -838,10 +838,10 @@ following names: The chunk grid of the Zarr array. If the chunk grid is a regular chunk grid as defined in this specification, then the value must be an object with the names ``type`` and ``chunk_shape``. The - value of ``type`` must be the string "regular", and the value of + value of ``type`` must be the string ``"regular"``, and the value of ``chunk_shape`` must be an array of integers providing the lengths of the chunk along each dimension of the array. For example, - ``{"type": "regular", "chunk_shape": [2, 5]}`` means a regular + ``{"type": "regular", "chunk_shape": [2, 5], "separator":"/"}`` means a regular grid where the chunks have length 2 along the first dimension and length 5 along the second dimension. @@ -935,7 +935,8 @@ compressed using gzip compression prior to storage:: "data_type": " Date: Wed, 29 Jul 2020 12:35:53 -0700 Subject: [PATCH 12/34] Apply suggestions from code review Co-authored-by: Alistair Miles --- docs/protocol/core/v3.0.rst | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 716d57f9..5c80c0cf 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -255,7 +255,7 @@ conceptual model underpinning the Zarr protocol. *Metadata document* - Each hierarchy_ and array_ is represented by a metadata document, + Each array_ in a hierarchy_ is represented by a metadata document, which is a machine-readable document containing essential processing information about the node. For example, an array_ metadata document will specify the number of dimensions_, shape_, @@ -642,7 +642,7 @@ Metadata This section defines the structure of metadata documents for Zarr hierarchies, which consists of three types metadata documents: an entry point metadata -document ( e.g. ``zarr.json``), array metadata documents, and group metadata +document (``zarr.json``), array metadata documents, and group metadata documents. Each type of metadata document is described in the following subsections. @@ -790,7 +790,7 @@ Array metadata -------------- Each Zarr array in a hierarchy must have an array metadata -document (``*.array``). This document must contain a single object with the +document. This document must contain a single object with the following names: ``shape`` @@ -993,7 +993,7 @@ chunking as above, but using an extension data type:: Group metadata -------------- -A Zarr group metadata object (``*.group``) must contain the +A Zarr group metadata object must contain the ``attributes`` name as defined above in the `Array metadata`_ section. All other names are reserved for future versions of this specification. See also the section on `Protocol extensions`_ below. @@ -1021,10 +1021,6 @@ For example, the JSON document below defines an explicit group:: A group does not need a metadata document to exists, see implicit groups. -Chunks Metadata ---------------- - -Do we want to allow per-chunk metadata ? example checksum (concurrent read/writes). Metadata encoding ----------------- @@ -1264,7 +1260,7 @@ grid layout. For example, for an array at hierarchy path "/foo/baz", the corresponding metadata key is ``meta/root/foo/baz.array``. If the array has two dimensions and a regular chunk grid, the data key for the -chunk with grid coordinates (0, 0) is "data/foo/baz/0.0". +chunk with grid coordinates (0, 0) is "data/foo/baz/0/0". If the root node is a group, the metadata key is ``meta/root.group``. If the root node is an array, the metadata key is "meta/root.array", and @@ -1294,7 +1290,7 @@ identifier. - `/foo` - `meta/root/foo.array` * - Group - - `/foo` + - `/foo/bar` - `meta/root/foo/bar.group` * - Array - `/foo/baz` @@ -1307,12 +1303,12 @@ identifier. .. list-table:: Data Storage Key example :header-rows: 1 - * - Type - - Path `P` of Chunck - - Data Key - * - Chunk - - `/foo/baz/0.0` - - `data/foo/baz/0.0` + * - Path `P` of array + - Chunk grid indices + - Data key + * - `/foo/baz` + - `(0, 0)` + - `data/foo/baz/0/0` From ab94946bf84cc71e0492dfd46fd27a27c0f3ab69 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 29 Jul 2020 13:04:56 -0700 Subject: [PATCH 13/34] some changes --- docs/protocol/core/v3.0.rst | 40 ++++++++++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 3 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 5c80c0cf..0aea5093 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -1072,9 +1072,29 @@ A store can make the following assumption on the structures of the keys it will -- List operations ``list_dir`` will always be passed keys ending with a trailing slash, that is to say it will only be asked to work with complete node names. +- List operations ``list_dir`` will always be passed keys ending with a trailing + slash, that is to say it will only be asked to work with complete node names. -Store implementation can assume they will only be given trailing slashes, and protocol implementation Must pass trailing slashes to underlying stores. +Store implementation can assume they will only be given trailing slashes, and +protocol implementation MUST pass trailing slashes to underlying stores. + +For example, a store containing the following keys: + + - ``meta/2018/.group`` + - ``meta/2018-01/.group`` + - ``meta/2018/bar/.array`` + - ``data/2018/bar/0.0`` + +The following queries are invalid: + - ``list_dir('201')`` is invalid as ``"201"`` is not an existing node. + - ``list_dir('2018')`` is invalid querries as ``"2018"`` does not ends with a ``/``, + +This is valid: + - ``list_dir('2018/')`` + - ``list_dir('2018-01/')`` + +This allows store implementation to avoid having to check for trailing slashes, +and avoid issues like "list_dir('2018')" to return values likes ``-01`` The store operations are grouped into three sets of capabilities: @@ -1171,6 +1191,20 @@ which do not contain the character "/" after the given prefix. @TODO : should store return trailing slashes when possible ? +Like would ``list_dir('mydir')`` returns: + - path1 + - path2 + - path3/ + - path4/ + +In which case you _know_ that 3,4 are dir or + + - path1 + - path2 + - path3 + - path4 + +Similar to ``ls -p`` Store implementations --------------------- @@ -1332,7 +1366,7 @@ Also have to think about how to handle that for implicit groups, That is to say if /g1/ is a array and we want to create /g1/g2/g3/g4/dataset, we need to check all the parents to make sure they are not datasets. -I would suggest to tell store they can assume clients will not try to do the wrong tihng. +I would suggest to tell store they can assume clients will not try to do the wrong thing. **Create a group** From 83c8f40762a566a6d06d1bc47ea19f44dd1a4327 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 29 Jul 2020 14:32:08 -0700 Subject: [PATCH 14/34] refine discussions on bytes; bit and trailing slashes --- docs/protocol/core/v3.0.rst | 133 ++++++++++++++++++++++-------------- 1 file changed, 82 insertions(+), 51 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 0aea5093..232f7e85 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -293,8 +293,6 @@ node names: * must not be a string composed only of period characters, e.g. "." or ".." -TODO: likely want to reserver names that start with just a single ``.`` - * must be at most 255 characters long Note that node names are used to form storage keys, and that some @@ -415,27 +413,44 @@ Core data types - 8 - big-endian * - ``r*`` (Optional) - - raw bits/bytes, use for extension type fallbacks - - variable, given by ``*``, multiple of 8 + - raw bits, use for extension type fallbacks + - variable, given by ``*``, is not limited to be a multiple of 8. - N/A Floating point types correspond to basic binary interchange formats as defined by IEEE 754-2008. -Additionally to these base types, an implementation should also handle the raw -pass through type designated by the lowercase letter ``r`` followed by the -number of bits as a multiple of 8. For example, ``r8``, ``r104`` should be -understood as fallback types of respectively 1 and 13 bytes longs, and -implementation should ensure to pass or accept those types unchanged when using -extension fallback. +Additionally to these base types, an implementation should also handle the +raw/opaque pass through type designated by the lowercase letter ``r`` followed +by the number of bits. For example, ``r3``, ``r11`` should be understood as +fallback types of respectively 3 and 11 bytes longs, and implementation should +ensure to pass or accept those types unchanged when using extension fallback. +For array whose total length is not a multiple of 8, zarr implementation can +expect expects a buffer which length is a multiple of 8bits, padded with +appropriate value at the end, and will in the same way return buffers padded to +the next multiple of 8 bytes. + +For example, here is a array of 5 elements of opaque type r3 will be stored as + + - ``AAABBBCC CDDDEEE_``, 3*5 = 15 bits, rounded to 2 bytes, with arbitrary + last bit. + + - 103*r11 items would take 1136 bits (142 bytes), with the 3 last bits + undefined. + +This is meant to help replacing the packbits filter from numcodecs, as well as +storing categorical data via extension in Zarr v3 + + + Note: currently only fixed size elements are supported as a core data type. There are many request for variable length element encoding. There are many way to encode variable length and we want to keep flexibility. While we seem to agree that for random access the most likely contender is to have two arrays, one with the actual variable length data and one with fixed size -(pointer + length) to the variable size data we do nto want to commit to such +(pointer + length) to the variable size data we do not want to commit to such a structure. @@ -693,9 +708,9 @@ containing the following names: ``metadata_encoding`` - A string containing the MIME type that has been used for encoding + A string containing the URI that has been used for encoding group and array metadata documents. If JSON encoding is used, this - value must be "application/json". + value must be ``"https://purl.org/zarr/spec/metadata/json/3.0"``. ``extensions`` @@ -1137,16 +1152,6 @@ operations: | Output: set of `keys` -TODO: the 2 spec below is ill-defined – or unclear if the prefix doe not have a -trailing slash. I'm assuming this is meant to only list after a slash to mimic -v2, but is equivalent to a glob. - - - /group/dataset - /group2/dataset - - - querying list_prefix or list_dir /group (no trailing slash) does list group2 - - ``list_prefix`` - Retrieve all keys with a given prefix. | Parameters: `prefix` @@ -1155,31 +1160,34 @@ v2, but is equivalent to a glob. For example, if a store contains the keys "a/b", "a/c/d" and "e/f/g", then ``list_prefix("a/")`` would return "a/b" and "a/c/d". - Note that ``list_prefix`` may be passed keys that do not end in in slashes - in order to find all node names starting with a prefix. + Note behavior of ``list_prefix`` is undefined if ``prefix`` does not ends + with a trailing slash ``/`` and store can assume there is as least one key + that stars with prefix. + + + TODO : Unlike ``list_dir`` all keys in list prefix are (likely?) to end up with no + slash as they are all terminal (unless you have a dangling implicit group + when a store does not cleanup after itself?); also not this in list. ``list_dir`` - Retrieve all keys and prefixes with a given prefix and which do not contain the character "/" after the given prefix. - | Parameters: `prefix` + | Parameters: `prefix`, ends with a trailing slash ``/`` | Output: set of `keys` and set of `prefixes` For example, if a store contains the keys "a/b", "a/c", "a/d/e", "a/f/g", then ``list_dir("a/")`` would return keys "a/b" and "a/c" and prefixes "a/d/" and "a/f/". + On non-existing prefix, store may return the empty set. - TODO: i'm thinking on enforcing this one and have a SHOULD or MUST. - Stores implementing list dir CAN assume that all keys passed as parameter - do end in a trailing slash. - - TODO: how does it behave on non-existing path ? - Store user must ensure that the queries to list_dir do end up in - Note: if we lift the requirement on trailing slashes, then doing a prefix - search can lead to returning keys in the same hierarchy level but longer name. e.g: + Note: The requirement on trailing slashes is to avoid + search returning keys in the same hierarchy level but longer name, and + potentially expensive logic testing for the present of trailing slash on + each query. e.g: - /meta/foo - /meta/foo/dataset @@ -1189,22 +1197,25 @@ which do not contain the character "/" after the given prefix. list_dir('foo/') == 'foo/dataset' -@TODO : should store return trailing slashes when possible ? + Stores Must return trailing slashes in key responses when those + have prefix of other keys. + + Like would ``list_dir('mydir')`` returns: + - ``path1`` + - ``path2`` + - ``path3/`` + - ``path4/`` + + Thus we know that path1, and path2 are terminal objects with data, and that + ``path3`` and ``path4`` -Like would ``list_dir('mydir')`` returns: - - path1 - - path2 - - path3/ - - path4/ -In which case you _know_ that 3,4 are dir or + Similar to ``ls -p`` - - path1 - - path2 - - path3 - - path4 + Note: In practice I believe this means that this means keys always ends in + ``/``, ``.json``, ``.array``, ``.group``, or otherwise they will be chunks + data. -Similar to ``ls -p`` Store implementations --------------------- @@ -1423,7 +1434,9 @@ I would suggest to tell store they can assume clients will not try to do the wro **Discover all nodes in a hierarchy** - @@TODO + To discover all nodes in a hierarchy, one can call ``list("meta/")``. + - all keys represent either explicit group or arrays. + - all intermediate prefixes ending in a ``/`` are implicit groups. **Delete a group or array** @@ -1439,9 +1452,29 @@ I would suggest to tell store they can assume clients will not try to do the wro - delete all arrays under this group, - delete all keys with meta/P/ prefix, meta/P/groups all keys with /data/P prefix, - Note that store implementation may decide to reify implicit groups and thus protocol implementation should attempt to delete the .meta/P/.group file if they really wish to delte an empty implicit group. + Note that store implementation may decide to reify implicit groups and thus + protocol implementation should attempt to delete the .meta/P/.group file if + they really wish to delete an empty implicit group. + + Store implementation are also allowed to delete any implicit parent of a + deleted implicit groups, so a protocol implementation should make sure to + reify a parent group if they need to keep it. For example assuming the + following: + + >>> z = new_dataset() + >>> z.create_array('/path/to/array') + + >>> z.delete_array('/path/to/array') + + This may not be sufficient to delete the group ``/path/to/``, as a store + implementation, and thus removing ``/path/to/`` may need an implmentation + to explicitly call + + >>> z.delete_group('/path/to/') + + Even if an explicit group was not explicitly created. + - Store implementation are also allowed to delete any implicit parent of a deleted implicit groups, so a protocol implementation should make sure to reify a parent group if they need to keep it. @@ -1454,8 +1487,6 @@ I would suggest to tell store they can assume clients will not try to do the wro -@@TODO need to describe/explain implicit groups more? - Protocol extensions =================== From c85f5eba5631b89f8c432cfa970c253b58388f9a Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 29 Jul 2020 14:41:30 -0700 Subject: [PATCH 15/34] Add notes on non-nnormative for non-json metadata --- docs/protocol/core/v3.0.rst | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 232f7e85..a73b174b 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -341,7 +341,7 @@ Core data types - Size (no. bytes) - Byte order * - ``bool`` - - Boolean, with False encoded as ``x00`` and True encoded as ``x01`` + - Boolean, with False encoded as ``\\x00`` and True encoded as ``\\x01`` - 1 - None * - ``i1`` @@ -671,6 +671,14 @@ name/value pairs. This section also defines how metadata documents are encoded for storage. +Only the top level metadata document ``zarr.json`` is guarantied to be json, and +can be used to defined other format to array-level and group-level metadata +document; in the case where non-json metadata document are use in a zarr +hierarchy the following sections on group and array level metadata are +non-normative; but other metadata format as expected to define some equivalence +relations with the JSON documents. + + Entry point metadata -------------------- @@ -841,7 +849,7 @@ following names: metadata ? I'm going to guess a number of things can't won't be able to dynamically load extensions so need to know immediately. That would also allow to define datatypes only once on top-level hierarchy. - Do we allow multiple version of the same extenions ? + Do we allow multiple version of the same extensions ? Restrict extensions types charset/format to not block the core types to be extended ? Allow different extension to have same named type in multiple @@ -1045,11 +1053,6 @@ group metadata documents (``*.group``s) must be encoded as per the MIME type giv the ``metadata_encoding`` field in the entry point metadata document (described below). -@@TODO more explanation needed? -Todo questions : How frequent are non-JSON metadata documents ? and how store specific are those. -If we allow arbitrary mimetypes, does that precude the sections on arrays and group metadata to be normative ? - - Stores ====== From da35be708413f35cd99c3dd163638736c46fe3bd Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Tue, 4 Aug 2020 10:37:11 -0700 Subject: [PATCH 16/34] fix bit/bytes & clarification --- docs/protocol/core/v3.0.rst | 66 ++++++++++++++++++++++--------------- 1 file changed, 40 insertions(+), 26 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index a73b174b..6089fff6 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -72,6 +72,11 @@ Unlike Zarr spec version 2, the spec V3 have mainly the following difference: - V3 has an explicit root, while v2 roots and group could not be distinguished. - Separation of the data and metadata key space. - Explicit support for extensions. + - chunk separator is ``/`` by default. + +This means that a store cannot be opened at an arbitrary point, and need to be +opened at the root. User facing convenience functions could walk up a given +hierarchy and return a sub-group, but this is not part of the API. Document conventions @@ -146,6 +151,11 @@ conceptual model underpinning the Zarr protocol. "/foo/bar" identifies a node named "bar", whose parent is named "foo", whose parent is the root of the hierarchy. The path "/" identifies the root node. + + A path always stars with ``/``. + + QUESTION: do we want to codify that group ``path`` end, can end, or must end + in ``/`` ? .. _dimension: .. _dimensions: @@ -424,14 +434,14 @@ defined by IEEE 754-2008. Additionally to these base types, an implementation should also handle the raw/opaque pass through type designated by the lowercase letter ``r`` followed by the number of bits. For example, ``r3``, ``r11`` should be understood as -fallback types of respectively 3 and 11 bytes longs, and implementation should +fallback types of respectively 3 and 11 bits longs, and implementation should ensure to pass or accept those types unchanged when using extension fallback. -For array whose total length is not a multiple of 8, zarr implementation can +For array whose total length is not a multiple of 8, Zarr implementation can expect expects a buffer which length is a multiple of 8bits, padded with appropriate value at the end, and will in the same way return buffers padded to -the next multiple of 8 bytes. +the next multiple of 8 bits. -For example, here is a array of 5 elements of opaque type r3 will be stored as +For example, here is a array of 5 elements of opaque type ``r3`` will be stored as - ``AAABBBCC CDDDEEE_``, 3*5 = 15 bits, rounded to 2 bytes, with arbitrary last bit. @@ -443,8 +453,6 @@ This is meant to help replacing the packbits filter from numcodecs, as well as storing categorical data via extension in Zarr v3 - - Note: currently only fixed size elements are supported as a core data type. There are many request for variable length element encoding. There are many way to encode variable length and we want to keep flexibility. While we seem @@ -1049,7 +1057,7 @@ Metadata encoding ----------------- The entry point metadata document must be encoded as JSON. The array (``*.array``s) and -group metadata documents (``*.group``s) must be encoded as per the MIME type given in +group metadata documents (``*.group``s) must be encoded as per the type given in the ``metadata_encoding`` field in the entry point metadata document (described below). @@ -1085,9 +1093,12 @@ A store can make the following assumption on the structures of the keys it will - A key always: - start with ``meta/`` - - starts with ``data/`` + - start with ``data/`` - is exactly ``zarr.json``. +- Most of the keys: + - start with ``meta/root`` + - start with ``data/root`` - List operations ``list_dir`` will always be passed keys ending with a trailing @@ -1105,7 +1116,7 @@ For example, a store containing the following keys: The following queries are invalid: - ``list_dir('201')`` is invalid as ``"201"`` is not an existing node. - - ``list_dir('2018')`` is invalid querries as ``"2018"`` does not ends with a ``/``, + - ``list_dir('2018')`` is invalid queries as ``"2018"`` does not ends with a ``/``, This is valid: - ``list_dir('2018/')`` @@ -1144,7 +1155,9 @@ A **writeable store** supports the following operations: | Output: None - clients of delete_prefix should pay attention to pass a trailing slash on the key to delete a complete dataset or group, otherwise the store may delete similar keys. + Clients of delete_prefix should pay attention to pass a trailing slash on + the key to delete a complete dataset or group, otherwise the store may + delete similar keys. A **listable store** supports any one or more of the following operations: @@ -1185,8 +1198,6 @@ which do not contain the character "/" after the given prefix. On non-existing prefix, store may return the empty set. - - Note: The requirement on trailing slashes is to avoid search returning keys in the same hierarchy level but longer name, and potentially expensive logic testing for the present of trailing slash on @@ -1196,27 +1207,27 @@ which do not contain the character "/" after the given prefix. - /meta/foo/dataset - /meta/foobar - list_dir('foo') == 'foo'&'foobar' - list_dir('foo/') == 'foo/dataset' + list_dir('/meta/foo') == '/meta/foo'&'/meta/foobar' + list_dir('/meta/foo/') == '/meta/foo/dataset' Stores Must return trailing slashes in key responses when those - have prefix of other keys. + are prefix of other keys. - Like would ``list_dir('mydir')`` returns: - - ``path1`` - - ``path2`` - - ``path3/`` - - ``path4/`` + Like would ``list_dir('/meta/mydir')`` returns: + - ``/meta/path1`` + - ``/meta/path2`` + - ``/meta/path3/`` + - ``/meta/path4/`` - Thus we know that path1, and path2 are terminal objects with data, and that - ``path3`` and ``path4`` + Thus we know that ``path1``, and ``path2`` are terminal objects with data, + and that ``/meta/path3`` and ``/meta/path4``. - Similar to ``ls -p`` + This is similar to ``ls -p`` on Unix systems. - Note: In practice I believe this means that this means keys always ends in - ``/``, ``.json``, ``.array``, ``.group``, or otherwise they will be chunks + Note: In practice this means that this means most returned keys always ends in + ``/``, ``.json``, ``.array``, ``.group``, they will otherwise be chunks data. @@ -1305,6 +1316,9 @@ and ".array". The data key for array chunks is formed by concatenating "data", `P`, "/", and the chunk identifier as defined by the chunk grid layout. +To get the path ``P`` from a key, either remove the trailing ``.array`` or +``.group`` as well as the ``meta/root`` prefix. + For example, for an array at hierarchy path "/foo/baz", the corresponding metadata key is ``meta/root/foo/baz.array``. If the array has two dimensions and a regular chunk grid, the data key for the @@ -1377,7 +1391,7 @@ TODO: This kinda force creating a group or an array to check whether the same key is also a group or an array, which may create some latencies. I'm assuming we don't wan to allow both groups and array to be at the same path P right ? Also have to think about how to handle that for implicit groups, That is to say -if /g1/ is a array and we want to create /g1/g2/g3/g4/dataset, we need to check +if path /g1/ is a array and we want to create /g1/g2/g3/g4/dataset, we need to check all the parents to make sure they are not datasets. I would suggest to tell store they can assume clients will not try to do the wrong thing. From 009e15f1ec1752c40f4b798ac1463bb0ca46c916 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Tue, 4 Aug 2020 10:39:07 -0700 Subject: [PATCH 17/34] update to json spec URI --- docs/protocol/core/v3.0.rst | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 6089fff6..f0502bff 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -436,8 +436,8 @@ raw/opaque pass through type designated by the lowercase letter ``r`` followed by the number of bits. For example, ``r3``, ``r11`` should be understood as fallback types of respectively 3 and 11 bits longs, and implementation should ensure to pass or accept those types unchanged when using extension fallback. -For array whose total length is not a multiple of 8, Zarr implementation can -expect expects a buffer which length is a multiple of 8bits, padded with +For array whose total length is not a multiple of 8 bits, Zarr implementation can +expect a buffer which length is a multiple of 8 bits, padded with appropriate value at the end, and will in the same way return buffers padded to the next multiple of 8 bits. @@ -726,7 +726,7 @@ containing the following names: A string containing the URI that has been used for encoding group and array metadata documents. If JSON encoding is used, this - value must be ``"https://purl.org/zarr/spec/metadata/json/3.0"``. + value must be ``"https://tools.ietf.org/html/rfc8259"``. ``extensions`` @@ -792,7 +792,7 @@ JSON is being used for encoding of group and array metadata:: { "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", - "metadata_encoding": "application/json", + "metadata_encoding": "https://tools.ietf.org/html/rfc8259", "extensions": [] } @@ -802,7 +802,7 @@ ignored if not understood:: { "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", - "metadata_encoding": "application/json", + "metadata_encoding": "https://tools.ietf.org/html/rfc8259", "extensions": [ { "extension": "http://example.org/zarr/extension/foo", @@ -1056,8 +1056,8 @@ For example, the JSON document below defines an explicit group:: Metadata encoding ----------------- -The entry point metadata document must be encoded as JSON. The array (``*.array``s) and -group metadata documents (``*.group``s) must be encoded as per the type given in +The entry point metadata document must be encoded as JSON. The array (``*.array`` s) and +group metadata documents (``*.group`` s) must be encoded as per the type given in the ``metadata_encoding`` field in the entry point metadata document (described below). From 9a74accff1a9c0cf55ed1bc0839ee5afe2c3dd36 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Sat, 8 Aug 2020 16:09:11 -0700 Subject: [PATCH 18/34] update ref links --- docs/protocol/core/v3.0.rst | 130 +++++++++++++++++------------------- 1 file changed, 60 insertions(+), 70 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index f0502bff..54d73a33 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -68,9 +68,9 @@ stores in particular we want to achieve the following. Unlike Zarr spec version 2, the spec V3 have mainly the following difference: - - V3 is a flat key-value store instead of an hierarchical store. Hierarchy is implied. - - V3 has an explicit root, while v2 roots and group could not be distinguished. - - Separation of the data and metadata key space. + - V3 is a flat key-value store instead of an hierarchical store. Hierarchy is implied. + - V3 has an explicit root, while v2 roots and group could not be distinguished. + - Separation of the data and metadata key space. - Explicit support for extensions. - chunk separator is ``/`` by default. @@ -151,8 +151,8 @@ conceptual model underpinning the Zarr protocol. "/foo/bar" identifies a node named "bar", whose parent is named "foo", whose parent is the root of the hierarchy. The path "/" identifies the root node. - - A path always stars with ``/``. + + A path always starts with ``/``. QUESTION: do we want to codify that group ``path`` end, can end, or must end in ``/`` ? @@ -439,7 +439,7 @@ ensure to pass or accept those types unchanged when using extension fallback. For array whose total length is not a multiple of 8 bits, Zarr implementation can expect a buffer which length is a multiple of 8 bits, padded with appropriate value at the end, and will in the same way return buffers padded to -the next multiple of 8 bits. +the next multiple of 8 bits. For example, here is a array of 5 elements of opaque type ``r3`` will be stored as @@ -527,7 +527,7 @@ An element of an array with coordinates (`a`, `b`, `c`, ...) will occur within the chunk at grid index (`a` // `dx`, `b` // `dy`, `c` // `dz`, ...), where "//" is the floor division operator. The element will have coordinates (`a` % `dx`, `b` % `dy`, `c` % `dz`, ...) within -that chunk, where "%" is the modulo operator. For example, if a +that chunk, where "%" is the modulo operator. For example, if a 3 dimensional array has shape (10, 200, 3000), and has chunk shape (5, 20, 400), then the element of the array with coordinates (7, 150, 900) is contained within the chunk at grid index (1, 7, 2) and has coordinates @@ -612,8 +612,6 @@ An array may be configured with a compressor, which a codec used to transform the binary representation of each chunk prior to storage, and to reverse the transformation during retrieval. -@@TODO better explanation of chunk encoding - A codec is defined as a pair of algorithms named *encode* and *decode*. Both of these algorithms transform a sequence of bytes (input) into another sequence of bytes (output). The *decode* @@ -724,9 +722,11 @@ containing the following names: ``metadata_encoding`` - A string containing the URI that has been used for encoding - group and array metadata documents. If JSON encoding is used, this - value must be ``"https://tools.ietf.org/html/rfc8259"``. + A string containing the URI pointing to a document describing the method + used for encoding group and array metadata documents. + + For document using the default JSON encoding and format describe in this document + then the value must be ``"https://purl.org/zarr/spec/protocol/core/3.0``. ``extensions`` @@ -751,10 +751,10 @@ containing the following names: TODO: can we have must-understand to "true" but limited to a subset of datasets / groups using this extension ? Should we also precise read/writes ? Example checksums ? - TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. + TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. .. This seem to be difficult to achieve if there are no locks. -.. +.. .. Note: All extensions that are used in a zarr hierarchy must be stored in .. the entry point metadata and implementation should expect to not .. encounter any extensions not defined in the metadata entry points. @@ -770,10 +770,10 @@ containing the following names: .. suggestion: .. affects : "root", "groups", "arrays" -.. -.. affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. -.. affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. -.. affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. +.. +.. affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. +.. affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. +.. affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. @@ -784,7 +784,7 @@ containing the following names: .. requires another one, then implementation should ensure that the required .. extension is listed in the top-level extensions field. -.. it seem unclear whether the dependency should be in the spec or handled by implementations, +.. it seem unclear whether the dependency should be in the spec or handled by implementations, .. there are pro and cons to both. It seem like listing the dependencies in the spec is not necessary. For example, below is an entry point metadata document, specifying that @@ -792,7 +792,7 @@ JSON is being used for encoding of group and array metadata:: { "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", - "metadata_encoding": "https://tools.ietf.org/html/rfc8259", + "metadata_encoding": "https://purl.org/zarr/spec/protocol/core/3.0", "extensions": [] } @@ -802,7 +802,7 @@ ignored if not understood:: { "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0", - "metadata_encoding": "https://tools.ietf.org/html/rfc8259", + "metadata_encoding": "https://purl.org/zarr/spec/protocol/core/3.0", "extensions": [ { "extension": "http://example.org/zarr/extension/foo", @@ -813,8 +813,6 @@ ignored if not understood:: } ] } - -TODO: clarify / test: do we allow the root to also be a group/array. I don't see any issue with doing so as the entrypoint md has a different name. Array metadata @@ -855,13 +853,13 @@ following names: TODO: should data_type extension be also available in the entry-point metadata ? I'm going to guess a number of things can't won't be able to - dynamically load extensions so need to know immediately. + dynamically load extensions so need to know immediately. That would also allow to define datatypes only once on top-level hierarchy. - Do we allow multiple version of the same extensions ? + Do we allow multiple version of the same extensions ? Restrict extensions types charset/format to not block the core types to be extended ? Allow different extension to have same named type in multiple - context and dispatch depending on extension ? + context and dispatch depending on extension ? ``chunk_grid`` @@ -940,7 +938,7 @@ following names: See the top level metadata extension section for the time being. - TODO: Add the fact that extensions only affects arrays, and expand on the must-understand. + TODO: Add the fact that extensions only affects arrays, and expand on the must-understand. TODO: Example with named dimensions. ``attributes`` @@ -966,7 +964,7 @@ compressed using gzip compression prior to storage:: "data_type": ">> z.delete_group('/path/to/') Even if an explicit group was not explicitly created. - - - - - **Determine if a node exists** - To determine if a node exists at path `P`, perform `exist()` - exists is tricky, you can query `meta/P.*`, as P might be implicit, - so you need to query both `meta/P.*` and `meta/P/*`. - + To determine if a node exists at path `P`, you need to check the existence + of one of ``get("meta/root"+P+".array")``, ``get("meta/root"+P+".group")`` + or ``get("meta/root"+P+"/")``. + TODO: this is up to 3 requests, can we decrease this ? Protocol extensions =================== -@@TODO define different types of protocol extension. - Many types of extensions can exists for a Zarr Protocol, they can be regrouped -in mostly 3 categories: +in mostly 3 categories: - Core Datatypes Extensions – for example adding ability store fixed size - types like complex and datetime in chunks. + types like complex and datetime in chunks. - this seem to split into two categories: Purely declarative, and more complex. - Arrays Extensions – Non rectilinear grids, and - variable length types. - - Group extensions – Better understanding of a group layout : example pyramidal - resolution. + variable length types. + - There are no group extensions as as Zarr v3.0 see https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions From dcdcf8d6b90fe6b5ee6722e984c786cdd0d3fb05 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Sat, 8 Aug 2020 16:10:26 -0700 Subject: [PATCH 19/34] rst syntax --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 54d73a33..c59a7584 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -443,7 +443,7 @@ the next multiple of 8 bits. For example, here is a array of 5 elements of opaque type ``r3`` will be stored as - - ``AAABBBCC CDDDEEE_``, 3*5 = 15 bits, rounded to 2 bytes, with arbitrary + - ``AAABBBCC CDDDEEE_``, 3*5 = 15 bits, rounded to 2 bytes, with arbitrary last bit. - 103*r11 items would take 1136 bits (142 bytes), with the 3 last bits From 3347f67486b426d138435f1da2e80cd01c275f14 Mon Sep 17 00:00:00 2001 From: Matthias BUSSONNIER Date: Sat, 8 Aug 2020 23:25:49 +0000 Subject: [PATCH 20/34] cleanup some TODOs --- docs/protocol/core/v3.0.rst | 31 +++++++------------------------ 1 file changed, 7 insertions(+), 24 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index c59a7584..431dd1c4 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -851,17 +851,6 @@ following names: as the data type. For fallback types that do not correspond to base known types, extensions can fallback on on a raw number of bytes using - TODO: should data_type extension be also available in the entry-point - metadata ? I'm going to guess a number of things can't won't be able to - dynamically load extensions so need to know immediately. - That would also allow to define datatypes only once on top-level hierarchy. - Do we allow multiple version of the same extensions ? - - Restrict extensions types charset/format to not block the core types to be - extended ? Allow different extension to have same named type in multiple - context and dispatch depending on extension ? - - ``chunk_grid`` The chunk grid of the Zarr array. If the chunk grid is a regular @@ -937,7 +926,7 @@ following names: ``extensions`` See the top level metadata extension section for the time being. - + TODO: Add the fact that extensions only affects arrays, and expand on the must-understand. TODO: Example with named dimensions. @@ -1179,10 +1168,6 @@ operations: that stars with prefix. - TODO : Unlike ``list_dir`` all keys in list prefix are (likely?) to end up with no - slash as they are all terminal (unless you have a dangling implicit group - when a store does not cleanup after itself?); also not this in list. - ``list_dir`` - Retrieve all keys and prefixes with a given prefix and which do not contain the character "/" after the given prefix. @@ -1385,14 +1370,11 @@ with grid coordinates (`i`, `j`, ...). Let "+" be the string concatenation operator. -TODO: This kinda force creating a group or an array to check whether the same -key is also a group or an array, which may create some latencies. I'm assuming -we don't wan to allow both groups and array to be at the same path P right ? -Also have to think about how to handle that for implicit groups, That is to say -if path /g1/ is a array and we want to create /g1/g2/g3/g4/dataset, we need to check -all the parents to make sure they are not datasets. +.. note:: -I would suggest to tell store they can assume clients will not try to do the wrong thing. + Store and implmentation can assume that a client will not try to + create both an *array* and *group* at the same path, and thus + may skip check of existance of a group/array of the same name. **Create a group** @@ -1495,7 +1477,8 @@ I would suggest to tell store they can assume clients will not try to do the wro of one of ``get("meta/root"+P+".array")``, ``get("meta/root"+P+".group")`` or ``get("meta/root"+P+"/")``. - TODO: this is up to 3 requests, can we decrease this ? + .. note:: + For listable store, ``listdir(parent(P))`` can be an alternative. Protocol extensions From 3160d30ebe992fd47fcb4334eb99c25dd03b73e1 Mon Sep 17 00:00:00 2001 From: Matthias BUSSONNIER Date: Sat, 8 Aug 2020 23:44:33 +0000 Subject: [PATCH 21/34] Add info on how to add/retrieve array elements. --- docs/protocol/core/v3.0.rst | 52 +++++++++++++++++-------------------- 1 file changed, 24 insertions(+), 28 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 431dd1c4..12e17503 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -452,14 +452,20 @@ For example, here is a array of 5 elements of opaque type ``r3`` will be stored This is meant to help replacing the packbits filter from numcodecs, as well as storing categorical data via extension in Zarr v3 +.. note:: + + We are explicitely looking for more feedback and prototypes of code using the ``r*``, + raw bits, for various endianess and wether the spec coudl be made clearer. + +.. note:: -Note: currently only fixed size elements are supported as a core data type. -There are many request for variable length element encoding. There are many -way to encode variable length and we want to keep flexibility. While we seem -to agree that for random access the most likely contender is to have two -arrays, one with the actual variable length data and one with fixed size -(pointer + length) to the variable size data we do not want to commit to such -a structure. + currently only fixed size elements are supported as a core data type. + There are many request for variable length element encoding. There are many + way to encode variable length and we want to keep flexibility. While we seem + to agree that for random access the most likely contender is to have two + arrays, one with the actual variable length data and one with fixed size + (pointer + length) to the variable size data we do not want to commit to such + a structure. Chunk grids @@ -751,7 +757,7 @@ containing the following names: TODO: can we have must-understand to "true" but limited to a subset of datasets / groups using this extension ? Should we also precise read/writes ? Example checksums ? - TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. + TODO : We seem to agree that "must_understand" is too restrictive, and will need to be relaxed. .. This seem to be difficult to achieve if there are no locks. .. @@ -764,29 +770,13 @@ containing the following names: .. it is considered a viable trade off that removing extensions from the .. entrypoint metadata requires a full rescan of the hierarchy. - - - - .. suggestion: -.. affects : "root", "groups", "arrays" +.. affects : "root", "arrays" .. .. affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. -.. affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. .. affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. - -.. Each extension can also have a field ``requires``, that contain an unordered -.. list of URI to other extensions specification on which they rely for proper -.. behavior. Implementation MAY rely on the value of these requires to decide -.. in which order to load extensions and resolve types. If an extension -.. requires another one, then implementation should ensure that the required -.. extension is listed in the top-level extensions field. - -.. it seem unclear whether the dependency should be in the spec or handled by implementations, -.. there are pro and cons to both. It seem like listing the dependencies in the spec is not necessary. - For example, below is an entry point metadata document, specifying that JSON is being used for encoding of group and array metadata:: @@ -1372,7 +1362,7 @@ Let "+" be the string concatenation operator. .. note:: - Store and implmentation can assume that a client will not try to + Store and implementation can assume that a client will not try to create both an *array* and *group* at the same path, and thus may skip check of existance of a group/array of the same name. @@ -1400,11 +1390,17 @@ Let "+" be the string concatenation operator. **Store element values in an array** - @@TODO + To store element in an array at path `P` and coordinate (`i`, `j`, ...) + perform ``set(data_key(P, i, j, ...), value)``, where `value` is the + serialisation of the corresponding chunk following the metadata that is + or will be stored in ``array_meta_key(P)``. **Retrieve element values in an array** - @@TODO + To retrieve element in an array at path `P` and coordinate (`i`, `j`, ...) + perform ``get(data_key(P, i, j, ...), value)``, where `value` is the + serialisation of the corresponding chunk following the metadata stored at + ``array_meta_key(P)``. **Discover children of a group** From 55b16653aa3ed83e0947f2edb9b7e6b5ad7510af Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 12 Aug 2020 10:31:15 -0700 Subject: [PATCH 22/34] some more cleanup --- docs/protocol/core/v3.0.rst | 53 +++++++++++-------------------------- 1 file changed, 16 insertions(+), 37 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index c59a7584..2c02b493 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -78,6 +78,18 @@ This means that a store cannot be opened at an arbitrary point, and need to be opened at the root. User facing convenience functions could walk up a given hierarchy and return a sub-group, but this is not part of the API. +Questions that still need to be resolved +---------------------------------------- + + - https://github.com/zarr-developers/zarr-specs/issues/72 to potentially split large metadata documents. + - extensions and ``must_understand = True`` might be too restrictive. Work a draft implementation with extensions ans + see how far we can go. List of extensions to implement: + - Boolean + - Complex + - Datetime + - Named dimensions + + Document conventions ==================== @@ -753,40 +765,6 @@ containing the following names: TODO : We seem to agree that "must_understand" is too restrictive,and will need to be relaxed. -.. This seem to be difficult to achieve if there are no locks. -.. -.. Note: All extensions that are used in a zarr hierarchy must be stored in -.. the entry point metadata and implementation should expect to not -.. encounter any extensions not defined in the metadata entry points. -.. Implementation must ensure that any data written using an extension have -.. this extension added to the top level entry point. The entry point -.. metadata may contain extensions that are not in use in the hierarchy and -.. it is considered a viable trade off that removing extensions from the -.. entrypoint metadata requires a full rescan of the hierarchy. - - - - - -.. suggestion: -.. affects : "root", "groups", "arrays" -.. -.. affects: root : This extension is necessary to understand the complete layout of this hierarchy. For example the groups themselves have a specific structure. -.. affects: groups: This extension may only affect some groups, implementation may decide to only bail out when encountering groups that need this extension to be understood, or lazily load this extension when entering those groups. Example Multiresolution convention in groups. -.. affects: arrays: This extension will only affect understanding the arras storage or layout, browsing the hierachy still make sens. - - - -.. Each extension can also have a field ``requires``, that contain an unordered -.. list of URI to other extensions specification on which they rely for proper -.. behavior. Implementation MAY rely on the value of these requires to decide -.. in which order to load extensions and resolve types. If an extension -.. requires another one, then implementation should ensure that the required -.. extension is listed in the top-level extensions field. - -.. it seem unclear whether the dependency should be in the spec or handled by implementations, -.. there are pro and cons to both. It seem like listing the dependencies in the spec is not necessary. - For example, below is an entry point metadata document, specifying that JSON is being used for encoding of group and array metadata:: @@ -1502,16 +1480,17 @@ Protocol extensions =================== Many types of extensions can exists for a Zarr Protocol, they can be regrouped -in mostly 3 categories: +in mostly 2 categories: - Core Datatypes Extensions – for example adding ability store fixed size types like complex and datetime in chunks. - this seem to split into two categories: Purely declarative, and more complex. - Arrays Extensions – Non rectilinear grids, and variable length types. - - There are no group extensions as as Zarr v3.0 - see https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions +There are no group extensions as as Zarr v3.0 + +See https://github.com/zarr-developers/zarr-specs/issues/49 for a list of potential extensions References From 84daacb634b198be86032f76cad38b634fe9a259 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Wed, 12 Aug 2020 11:09:43 -0700 Subject: [PATCH 23/34] please sphinx --- docs/protocol/core/v3.0.rst | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 05a631bd..0fc3c29e 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -1,7 +1,5 @@ .. This file is in restructured text format: http://docutils.sourceforge.net/rst.html - - ================================== Zarr core protocol (version 3.0) ================================== @@ -84,6 +82,7 @@ Questions that still need to be resolved - https://github.com/zarr-developers/zarr-specs/issues/72 to potentially split large metadata documents. - extensions and ``must_understand = True`` might be too restrictive. Work a draft implementation with extensions ans see how far we can go. List of extensions to implement: + - Boolean - Complex - Datetime @@ -353,8 +352,6 @@ identify a data type. Core data types --------------- -.. endianess disapear - .. list-table:: Data types :header-rows: 1 From c5a4bf7f00611958aa83ac464cd7efa46f079cd5 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Mon, 24 Aug 2020 09:10:00 -0700 Subject: [PATCH 24/34] remove todo --- docs/protocol/core/v3.0.rst | 12 +----------- 1 file changed, 1 insertion(+), 11 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 0fc3c29e..ac679d38 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -666,8 +666,7 @@ zarr-specs GitHub repository. However, codec specifications may be maintained by any group or organisation and published in any location on the Web. For further details of the process for contributing a codec specification to the zarr-specs GitHub repository, see the Zarr -community process specification @@TODO Add link once -https://github.com/zarr-developers/zarr-specs/pull/12 is merged@@. +community process specification. Further details of how a compressor is configured for an array are given in the section below on `Array metadata`_. @@ -763,11 +762,6 @@ containing the following names: ``must_understand`` is ``true`` then processing must terminate and an appropriate error raised. - TODO: can we have must-understand to "true" but limited to a subset of - datasets / groups using this extension ? Should we also precise read/writes ? Example checksums ? - - TODO : We seem to agree that "must_understand" is too restrictive, and will need to be relaxed. - For example, below is an entry point metadata document, specifying that JSON is being used for encoding of group and array metadata:: @@ -908,16 +902,12 @@ following names: See the top level metadata extension section for the time being. - TODO: Add the fact that extensions only affects arrays, and expand on the must-understand. - TODO: Example with named dimensions. ``attributes`` The value must be an object. The object may contain any name/value pairs. -TODO see https://github.com/zarr-developers/zarr-specs/issues/72 to potentially split large metadata documents. - All other names within the array metadata object are reserved for future versions of this specification. From d489d90c42748387ccb1fb6cbd61ee28528805b7 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Mon, 24 Aug 2020 09:21:30 -0700 Subject: [PATCH 25/34] cleanup raw types --- docs/protocol/core/v3.0.rst | 22 +++++----------------- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index ac679d38..d2ebdb7a 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -433,7 +433,7 @@ Core data types - big-endian * - ``r*`` (Optional) - raw bits, use for extension type fallbacks - - variable, given by ``*``, is not limited to be a multiple of 8. + - variable, given by ``*``, is limited to be a multiple of 8. - N/A @@ -442,24 +442,12 @@ defined by IEEE 754-2008. Additionally to these base types, an implementation should also handle the raw/opaque pass through type designated by the lowercase letter ``r`` followed -by the number of bits. For example, ``r3``, ``r11`` should be understood as -fallback types of respectively 3 and 11 bits longs, and implementation should -ensure to pass or accept those types unchanged when using extension fallback. -For array whose total length is not a multiple of 8 bits, Zarr implementation can -expect a buffer which length is a multiple of 8 bits, padded with -appropriate value at the end, and will in the same way return buffers padded to -the next multiple of 8 bits. +by the number of bits, multiple of 8. For example, ``r8``, ``r16``, ``r24`` +should be understood as fallback types of respectively 1, 2, and 3 bytes long. -For example, here is a array of 5 elements of opaque type ``r3`` will be stored as +Zarr v3.0 is limited to types lenght that are multiple of 8 bits but may open +other values in later version of the spec. - - ``AAABBBCC CDDDEEE_``, 3*5 = 15 bits, rounded to 2 bytes, with arbitrary - last bit. - - - 103*r11 items would take 1136 bits (142 bytes), with the 3 last bits - undefined. - -This is meant to help replacing the packbits filter from numcodecs, as well as -storing categorical data via extension in Zarr v3 .. note:: From 9c69fb3178fffebd8df0ef8500dd31bf86e72cd6 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 28 Aug 2020 08:01:03 -0700 Subject: [PATCH 26/34] Apply suggestions from code review Co-authored-by: David Brochart --- docs/protocol/core/v3.0.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index d2ebdb7a..ad473979 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -52,22 +52,22 @@ or by making a pull request against the This document was produced by the `Zarr core development team `_. -Goal of v3 spec and main difference from v2 +Goal of v3 spec and main difference with v2 =========================================== Zarr spec v2 was originally designed around local filesystem, but Zarr has grown and is now regularly deployed on cloud / object storage. Those kind of storage have characteristics, capabilities and usage patterns that can widely -differ from the assumptions of spec v2. V3 is design to consider online -stores in particular we want to achieve the following. +differ from the assumptions of spec v2. V3 is designed to consider online +stores, in particular we want to achieve the following: - - No assumption that the underlying store have locking ability. - - Ability to do concurrent writes with the assumption that writes from clients will be consistent, but non atomic. + - No assumption that the underlying store has locking ability. + - Ability to do concurrent writes with the assumption that writes from clients will be consistent, but not atomic. -Unlike Zarr spec version 2, the spec V3 have mainly the following difference: - - V3 is a flat key-value store instead of an hierarchical store. Hierarchy is implied. - - V3 has an explicit root, while v2 roots and group could not be distinguished. +Unlike Zarr spec v2, the spec v3 has mainly the following differences: + - V3 is a flat key-value store instead of a hierarchical store. Hierarchy is implied. + - V3 has an explicit root, while v2 roots and groups could not be distinguished. - Separation of the data and metadata key space. - Explicit support for extensions. - chunk separator is ``/`` by default. From 62cbe4c08d919c174509b6dcb05966024318f7f3 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 28 Aug 2020 08:04:43 -0700 Subject: [PATCH 27/34] typo --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index ad473979..251113a8 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -392,7 +392,7 @@ Core data types - 8 - big-endian * - ``u1`` - - signed integer + - unsigned integer - 1 - None * - `` Date: Fri, 28 Aug 2020 08:08:25 -0700 Subject: [PATCH 28/34] Update docs/protocol/core/v3.0.rst Co-authored-by: David Brochart --- docs/protocol/core/v3.0.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 251113a8..b19ac603 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -72,8 +72,8 @@ Unlike Zarr spec v2, the spec v3 has mainly the following differences: - Explicit support for extensions. - chunk separator is ``/`` by default. -This means that a store cannot be opened at an arbitrary point, and need to be -opened at the root. User facing convenience functions could walk up a given +This means that a store cannot be opened at an arbitrary point, but needs to be +opened at the root. User facing convenience functions could walk a given hierarchy and return a sub-group, but this is not part of the API. Questions that still need to be resolved From dedd8164972acb691f6a8fa13dd54629c5f577be Mon Sep 17 00:00:00 2001 From: David Brochart Date: Fri, 28 Aug 2020 17:16:45 +0200 Subject: [PATCH 29/34] Fix typo --- docs/protocol/core/v3.0.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index b19ac603..d0472ee8 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -302,7 +302,7 @@ Node names ========== Except for the root node, each node in a hierarchy must have a name, -which is a string characters. To ensure consistent behaviour +which is a string of characters. To ensure consistent behaviour across different storage systems, the following constraints apply to node names: From 18c56af09d045e48651bda076ee764b647509f3e Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Tue, 1 Sep 2020 08:56:30 -0700 Subject: [PATCH 30/34] comments from david --- docs/protocol/core/v3.0.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 251113a8..540a804e 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -139,6 +139,9 @@ conceptual model underpinning the Zarr protocol. elements_ in an array conform to the same `data type`_. An array may not have child nodes. + TODO: clarify how arrays with 0 dimension are store, and in particular what should be done WRT chunk naming, and how + a 1d array of 1 elelement differs from a scalar. + .. _name: .. _names: From daad54b064f690449c0d540c42c07bc422e6f9b7 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 4 Sep 2020 10:38:30 -0700 Subject: [PATCH 31/34] Fold changes from september 4: - Empty array and chunk keys, chunk keys now start with 'c' unconditionally - data/ now mirrors meta/ by having root --- docs/protocol/core/v3.0.rst | 47 +++++++++++++++++++------------------ 1 file changed, 24 insertions(+), 23 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 01d9dfee..50706696 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -80,14 +80,13 @@ Questions that still need to be resolved ---------------------------------------- - https://github.com/zarr-developers/zarr-specs/issues/72 to potentially split large metadata documents. - - extensions and ``must_understand = True`` might be too restrictive. Work a draft implementation with extensions ans + - extensions and ``must_understand = True`` might be too restrictive. Work a draft implementation with extensions and see how far we can go. List of extensions to implement: - Boolean - Complex - Datetime - Named dimensions - Document conventions @@ -139,9 +138,6 @@ conceptual model underpinning the Zarr protocol. elements_ in an array conform to the same `data type`_. An array may not have child nodes. - TODO: clarify how arrays with 0 dimension are store, and in particular what should be done WRT chunk naming, and how - a 1d array of 1 elelement differs from a scalar. - .. _name: .. _names: @@ -189,7 +185,8 @@ conceptual model underpinning the Zarr protocol. The shape of an array_ is the tuple of dimension_ lengths. For example, if an array_ has 2 dimensions_, where the length of the first dimension_ is 100 and the length of the second dimension_ is - 20, then the shape of the array_ is (100, 20). + 20, then the shape of the array_ is (100, 20). A shape can be the empty + tuple in the case of zero-dimension arrays (scalar) .. _element: .. _elements: @@ -224,7 +221,10 @@ conceptual model underpinning the Zarr protocol. hyperrectangle defined by a tuple of intervals, one for each dimension_ of the array_. The chunk shape is the tuple of interval lengths, and the chunk size (i.e., number of elements_ contained - within the chunk) is the product of its interval lengths. + within the chunk) is the product of its interval lengths. + + The chunk shape elements are non-zero when the corresponding dimensions of + the arrays are of non-zero length. .. _grid: .. _grids: @@ -448,8 +448,8 @@ raw/opaque pass through type designated by the lowercase letter ``r`` followed by the number of bits, multiple of 8. For example, ``r8``, ``r16``, ``r24`` should be understood as fallback types of respectively 1, 2, and 3 bytes long. -Zarr v3.0 is limited to types lenght that are multiple of 8 bits but may open -other values in later version of the spec. +Zarr v3.0 is limited to types length that are multiple of 8 bits but may open +other values in later version of this specification. .. note:: @@ -542,13 +542,13 @@ is contained within the chunk at grid index (1, 7, 2) and has coordinates The identifier for chunk with grid index (``i``, ``j``, ``k``, ...) is formed by joining together ASCII string representations of each index -using a separator. The default value for the separator is the period +using a separator and prefixed with the string `'c'`. The default value for the separator is the period character ( by default ``/``), but this may be configured by providing a ``separator`` value within the ``chunk_grid`` metadata object, see the section on `Array metadata`_ below. For example, in a 3 dimensional array, the identifier for the chunk at -grid index (1, 23, 45) is the string "1/23/45". +grid index (1, 23, 45) is the string "c1/23/45". Note that this specification does not consider the case where the chunk grid and the array space are not aligned at the origin vertices @@ -564,6 +564,9 @@ arbitrary length in a "negative" direction along any dimension. decrease the maximum number of items in hierarchical stores like directory stores. +.. note:: Arrays may have 0 dimension (when for example representing scalars), + in which case the coordinate of a chunk is the empty tuple, and the chunk key + will consist of the string `'c'` Chunk memory layouts ==================== @@ -1058,10 +1061,10 @@ protocol implementation MUST pass trailing slashes to underlying stores. For example, a store containing the following keys: - - ``meta/2018/.group`` - - ``meta/2018-01/.group`` - - ``meta/2018/bar/.array`` - - ``data/2018/bar/0.0`` + - ``meta/root/2018/.group`` + - ``meta/root/2018-01/.group`` + - ``meta/root/2018/bar/.array`` + - ``data/root/2018/bar/0.0`` The following queries are invalid: - ``list_dir('201')`` is invalid as ``"201"`` is not an existing node. @@ -1072,7 +1075,7 @@ This is valid: - ``list_dir('2018-01/')`` This allows store implementation to avoid having to check for trailing slashes, -and avoid issues like "list_dir('2018')" to return values likes ``-01`` +and avoid issues like "list_dir('2018')" returning values likes ``-01`` The store operations are grouped into three sets of capabilities: @@ -1222,8 +1225,6 @@ create a store implementation spec and contribute it to the `zarr-specs GitHub r For an example of a store implementation spec, see the :ref:`file-system-store-v1` specification. -@@TODO more info and example. - Storage protocol ================ @@ -1267,11 +1268,11 @@ To get the path ``P`` from a key, either remove the trailing ``.array`` or For example, for an array at hierarchy path "/foo/baz", the corresponding metadata key is ``meta/root/foo/baz.array``. If the array has two dimensions and a regular chunk grid, the data key for the -chunk with grid coordinates (0, 0) is "data/foo/baz/0/0". +chunk with grid coordinates (0, 0) is "data/root/foo/baz/c0/0". If the root node is a group, the metadata key is ``meta/root.group``. If the root node is an array, the metadata key is "meta/root.array", and -the data keys are formed by concatenating "data/" and the chunk +the data keys are formed by concatenating "data/root/" and the chunk identifier. @@ -1315,7 +1316,7 @@ identifier. - Data key * - `/foo/baz` - `(0, 0)` - - `data/foo/baz/0/0` + - `data/root/foo/baz/c0/0` @@ -1456,8 +1457,8 @@ Many types of extensions can exists for a Zarr Protocol, they can be regrouped in mostly 2 categories: - Core Datatypes Extensions – for example adding ability store fixed size - types like complex and datetime in chunks. - - this seem to split into two categories: Purely declarative, and more complex. + types like complex and datetime in chunks. These are directly declared in the + array metadata ``data_type`` keys. - Arrays Extensions – Non rectilinear grids, and variable length types. From 17902f7aafe3992ddd6ead5730823c5cc54b76d5 Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 11 Sep 2020 09:00:18 -0700 Subject: [PATCH 32/34] Apply david suggestions --- docs/protocol/core/v3.0.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 50706696..6808ae90 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -542,8 +542,8 @@ is contained within the chunk at grid index (1, 7, 2) and has coordinates The identifier for chunk with grid index (``i``, ``j``, ``k``, ...) is formed by joining together ASCII string representations of each index -using a separator and prefixed with the string `'c'`. The default value for the separator is the period -character ( by default ``/``), but this may be configured by providing a ``separator`` +using a separator and prefixed with the string `'c'`. The default value for the separator is the slash +character (by default ``/``), but this may be configured by providing a ``separator`` value within the ``chunk_grid`` metadata object, see the section on `Array metadata`_ below. From d1cc620848af4ab400ca592460e9db3f91e451ea Mon Sep 17 00:00:00 2001 From: Alistair Miles Date: Fri, 18 Sep 2020 17:55:10 +0100 Subject: [PATCH 33/34] bump date --- docs/protocol/core/v3.0.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 6808ae90..73e42948 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -3,9 +3,9 @@ ================================== Zarr core protocol (version 3.0) ================================== ------------------------------ - Editor's draft 25 April 2020 ------------------------------ +--------------------------------- + Editor's draft 18 September 2020 +--------------------------------- Specification URI: https://purl.org/zarr/spec/protocol/core/3.0 From 40f4944337bd79dbc7a776ed67934654443d466d Mon Sep 17 00:00:00 2001 From: Matthias Bussonnier Date: Fri, 25 Sep 2020 09:10:37 -0700 Subject: [PATCH 34/34] make compressor optional (#94) --- docs/protocol/core/v3.0.rst | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/docs/protocol/core/v3.0.rst b/docs/protocol/core/v3.0.rst index 73e42948..bd3ae0fb 100644 --- a/docs/protocol/core/v3.0.rst +++ b/docs/protocol/core/v3.0.rst @@ -789,7 +789,7 @@ Array metadata Each Zarr array in a hierarchy must have an array metadata document. This document must contain a single object with the -following names: +following mandatory names: ``shape`` @@ -856,15 +856,6 @@ following names: human-readable representation of the specification. The ``type`` is required and the value is defined by the protocol extension. -``compressor`` - - Specifies a codec to be used for encoding and decoding chunks. The - value must be an object containing the name ``codec`` whose value - is a URI that identifies a codec and dereferences to a human - readable representation of the codec specification. The codec - object may also contain a ``configuration`` name whose value is - defined by the corresponding codec specification. - ``fill_value`` Provides an element value to use for uninitialised portions of the @@ -902,6 +893,19 @@ following names: The value must be an object. The object may contain any name/value pairs. +The following names are optional: + +``compressor`` + + Specifies a codec to be used for encoding and decoding chunks. The + value must be an object containing the name ``codec`` whose value + is a URI that identifies a codec and dereferences to a human + readable representation of the codec specification. The codec + object may also contain a ``configuration`` name whose value is + defined by the corresponding codec specification. When the key for this is + absent, this signor fies that no compressor has been used. + + All other names within the array metadata object are reserved for future versions of this specification.