Why do storage transformers need "type" separate from "configuration" #191

rabernat · 2022-12-01T15:09:55Z

This is what V3 currently says about how to specify storage transformers

zarr-specs/docs/core/v3.0.rst

Lines 1165 to 1174 in b509f14

    
           Specifies a stack of `storage transformers`_. Each value in the list must 
        
           be an object containing the names ``extension`` and ``type``. 
        
           The ``extension`` is required and the value must be a URI that identifies 
        
           the extension and dereferences to a human-readable representation 
        
           of the specification.  The ``type`` is required and the value is 
        
           defined by the extension. The 
        
           object may also contain a ``configuration`` object which consists of the 
        
           parameter names and values as defined by the corresponding storage transformer 
        
           specification. When the ``storage_transformers`` name is absent no storage 
        
           transformer is used, same for an empty list.

Why can't we just put type inside configuration? That just seems simpler. Plus, it may not make sense to define type for some storage transformers. That means type is a transformer-specific configuration parameter anyway.

cc @jstriebel

The text was updated successfully, but these errors were encountered:

jstriebel · 2022-12-06T18:44:14Z

Good point, I just took this design over from other extension points. This also touches on the question if all extension-points (e.g. dtype, transformers) also need an entry in the extensions section.

Those are the different points in my mind, trying to order it a bit:

dropping type field for transformers
👍 IMO
drop type for data-types, chunk-grid, chunk-memory-layout and metadata_encoding (the latter two might get removed, see v3 spec: Consider removing metadata encoding #174 & Replace chunk_memory_layout with transpose codec #189)
Unsure about this, since type is useful for the default (but maybe not so much if there's only a single default, e.g. regular for chunk-grid). Maybe we can unify the type and extension field into one? E.g. type may be an id of an extension?
Should all extension points be listed in the extensions list of the respective metadata file? Then configuration would be place just there. E.g. if a datetime data-type extension is used, there would be a respective data_type entry, but would there also be an entry in the array's top-level extension field?

joshmoore · 2022-12-07T09:17:32Z

As an aside, I again get the feeling that I could use lots of examples to help make these types of decisions.

jstriebel · 2023-02-09T09:29:17Z

Here's a proposal for a more coherent terminology and config:

Zarr has extension points, which allow to add new functionality without changing the core specification. Those are

for arrays:
- data-type
- chunk-grid
- codecs
- storage-transformers
- extensions
for groups:
- storage-transformers
- extensions

Additionally, there will be metadata conventions (zarr-developers/zeps#28) (also for groups and arrays), which do not contain functionality needed by zarr implementations itself, but higher-level libs and apps.

Preferably specific extension points should be used over the more generic "extensions", which can be used if non of the others match.

An array config with all different types of extension points could look like this:

{
    "shape": [10000, 1000],
    "data_type": {  // string or extension point object with fallback
        "name": "datetime",
        "configuration": {
            "unit": "ns"
        },
        "fallback": "int64"
    },
    "chunk_grid": {
        "type": {  // string or extension point object
            "name": "hexagonal",
            "configuration": {
                "origin": "…"
            }
        },
        "chunk_shape": [1000, 100],
        "separator" : "/"
    },
    "codecs": [  // list of extension point objects
        {
            "name": "gzip",
            "configuration": {
                "level": 1
            }
        }
    ],
    "storage_transformers": [  // list of extension point objects
        {
          "name": "sharding",
          "configuration": {
            "type": "indexed",
            "chunks_per_shard": [2, 2]
          }
        }
    ],
    "fill_value": null,
    "extensions": [  // list of extension objects with must_understand
        {
            "name": "my_extension",
            "must_understand": false,
            "configuration": {
                "foo": "bar"
            }
        }
    ],
    "attributes": {}
}

This is slightly different from the current version, but uses more coherent extension point objects. The name refers to a name in the spec, and configuration is coherently used in such objects. For the chunk-grid, I'm wondering if the following version is nicer:

{
    "chunk_grid": {
        "name": "hexagonal",
        "configuration": {
            "chunk_shape": [1000, 100],
            "origin": "…",
            "separator" : "/"
        }
    }
}

This would remove one level and might make more sense if any future chunk-grids do not have a chunk_shape as currently defined.

The behavior when an extension point is needed to be able to read array chunk data is the same as now:

extension	metadata	is extension required
data type	`data_type`	no `fallback`
chunk grid	`chunk_grid`	always
codecs	`codecs`	always
storage transformer	`storage_transformers`_	always
array extension	`extensions`	`must_understand`

For groups the storage_transformers and extensions keys would look similar. I'm not sure if we should differentiate on an array level betweend must_understand to read metadata or just array chunk data of included arrays. However, probably any storage_transformer needed to read chunk data should be listed rather in the array's zarr.json instead of a parent group's.

What do you think about this @jbms @joshmoore @rabernat @WardF @jakirkham? Happy to discuss this later in the ZEP meeting.

@jbms Brought up that we might also use the top level object of the metadata for extensions instead of the extensions key and encode the must_understand differently, e.g. via the name. I personally find the current version more clear, and it helps to differentiate between core and non-core features when looking at a metadata document, without knowing the spec details. Let's clarify as well if this should be changed.

jbms · 2023-02-09T11:52:30Z

Thanks, I think this is moving in the right direction. I think it would be nice to be as consistent as possible, so that it is easier to remember when writing manually. In particular, rather than have a mix of "type" and "name", just always use one key.

jstriebel · 2023-02-09T12:10:21Z

In particular, rather than have a mix of "type" and "name", just always use one key.

Yep, meant to use name also for the codecs, corrected this now in the example above.

jbms · 2023-02-09T12:13:32Z

I think for storage transformers it might be better to just use a single name, like "name", rather than both name and type.

jstriebel · 2023-02-09T12:18:26Z

I think for storage transformers it might be better to just use a single name, like "name", rather than both name and type.

Yep, in this case the type is just a part of the transformer-specific config, but I agree that we can probably drop it.

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 5, 2022

jstriebel added this to ZEP1 Dec 5, 2022

jstriebel moved this to Todo in ZEP1 Dec 5, 2022

jstriebel mentioned this issue Dec 6, 2022

Issue overview: finalize and verify extension, store and codec mechanisms #169

Closed

6 tasks

jstriebel moved this from Todo to In Discussion in ZEP1 Feb 9, 2023

jstriebel mentioned this issue Feb 13, 2023

Restructure docs, clarify extension points and configs, use short ids #204

Merged

jstriebel moved this from In Discussion to In Review in ZEP1 Feb 13, 2023

jstriebel closed this as completed in #204 Feb 21, 2023

github-project-automation bot moved this from In Review to Done in ZEP1 Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why do storage transformers need "type" separate from "configuration" #191

Why do storage transformers need "type" separate from "configuration" #191

rabernat commented Dec 1, 2022

jstriebel commented Dec 6, 2022

joshmoore commented Dec 7, 2022

jstriebel commented Feb 9, 2023 •

edited

Loading

jbms commented Feb 9, 2023

jstriebel commented Feb 9, 2023

jbms commented Feb 9, 2023

jstriebel commented Feb 9, 2023

Why do storage transformers need "type" separate from "configuration" #191

Why do storage transformers need "type" separate from "configuration" #191

Comments

rabernat commented Dec 1, 2022

jstriebel commented Dec 6, 2022

joshmoore commented Dec 7, 2022

jstriebel commented Feb 9, 2023 • edited Loading

jbms commented Feb 9, 2023

jstriebel commented Feb 9, 2023

jbms commented Feb 9, 2023

jstriebel commented Feb 9, 2023

jstriebel commented Feb 9, 2023 •

edited

Loading