Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why do storage transformers need "type" separate from "configuration" #191

Closed
rabernat opened this issue Dec 1, 2022 · 7 comments · Fixed by #204
Closed

Why do storage transformers need "type" separate from "configuration" #191

rabernat opened this issue Dec 1, 2022 · 7 comments · Fixed by #204
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@rabernat
Copy link
Contributor

rabernat commented Dec 1, 2022

This is what V3 currently says about how to specify storage transformers

zarr-specs/docs/core/v3.0.rst

Lines 1165 to 1174 in b509f14

Specifies a stack of `storage transformers`_. Each value in the list must
be an object containing the names ``extension`` and ``type``.
The ``extension`` is required and the value must be a URI that identifies
the extension and dereferences to a human-readable representation
of the specification. The ``type`` is required and the value is
defined by the extension. The
object may also contain a ``configuration`` object which consists of the
parameter names and values as defined by the corresponding storage transformer
specification. When the ``storage_transformers`` name is absent no storage
transformer is used, same for an empty list.

Why can't we just put type inside configuration? That just seems simpler. Plus, it may not make sense to define type for some storage transformers. That means type is a transformer-specific configuration parameter anyway.

cc @jstriebel

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Dec 5, 2022
@jstriebel jstriebel added this to ZEP1 Dec 5, 2022
@jstriebel jstriebel moved this to Todo in ZEP1 Dec 5, 2022
@jstriebel
Copy link
Member

Good point, I just took this design over from other extension points. This also touches on the question if all extension-points (e.g. dtype, transformers) also need an entry in the extensions section.

Those are the different points in my mind, trying to order it a bit:

  • dropping type field for transformers
    👍 IMO
  • drop type for data-types, chunk-grid, chunk-memory-layout and metadata_encoding (the latter two might get removed, see v3 spec: Consider removing metadata encoding #174 & Replace chunk_memory_layout with transpose codec #189)
    Unsure about this, since type is useful for the default (but maybe not so much if there's only a single default, e.g. regular for chunk-grid). Maybe we can unify the type and extension field into one? E.g. type may be an id of an extension?
  • Should all extension points be listed in the extensions list of the respective metadata file? Then configuration would be place just there. E.g. if a datetime data-type extension is used, there would be a respective data_type entry, but would there also be an entry in the array's top-level extension field?

@joshmoore
Copy link
Member

As an aside, I again get the feeling that I could use lots of examples to help make these types of decisions.

@jstriebel
Copy link
Member

jstriebel commented Feb 9, 2023

Here's a proposal for a more coherent terminology and config:

Zarr has extension points, which allow to add new functionality without changing the core specification. Those are

  • for arrays:
    • data-type
    • chunk-grid
    • codecs
    • storage-transformers
    • extensions
  • for groups:
    • storage-transformers
    • extensions

Additionally, there will be metadata conventions (zarr-developers/zeps#28) (also for groups and arrays), which do not contain functionality needed by zarr implementations itself, but higher-level libs and apps.

Preferably specific extension points should be used over the more generic "extensions", which can be used if non of the others match.

An array config with all different types of extension points could look like this:

{
    "shape": [10000, 1000],
    "data_type": {  // string or extension point object with fallback
        "name": "datetime",
        "configuration": {
            "unit": "ns"
        },
        "fallback": "int64"
    },
    "chunk_grid": {
        "type": {  // string or extension point object
            "name": "hexagonal",
            "configuration": {
                "origin": "…"
            }
        },
        "chunk_shape": [1000, 100],
        "separator" : "/"
    },
    "codecs": [  // list of extension point objects
        {
            "name": "gzip",
            "configuration": {
                "level": 1
            }
        }
    ],
    "storage_transformers": [  // list of extension point objects
        {
          "name": "sharding",
          "configuration": {
            "type": "indexed",
            "chunks_per_shard": [2, 2]
          }
        }
    ],
    "fill_value": null,
    "extensions": [  // list of extension objects with must_understand
        {
            "name": "my_extension",
            "must_understand": false,
            "configuration": {
                "foo": "bar"
            }
        }
    ],
    "attributes": {}
}

This is slightly different from the current version, but uses more coherent extension point objects. The name refers to a name in the spec, and configuration is coherently used in such objects. For the chunk-grid, I'm wondering if the following version is nicer:

{
    "chunk_grid": {
        "name": "hexagonal",
        "configuration": {
            "chunk_shape": [1000, 100],
            "origin": "…",
            "separator" : "/"
        }
    }
}

This would remove one level and might make more sense if any future chunk-grids do not have a chunk_shape as currently defined.

The behavior when an extension point is needed to be able to read array chunk data is the same as now:

extension metadata is extension required
data type data_type no fallback
chunk grid chunk_grid always
codecs codecs always
storage transformer storage_transformers_ always
array extension extensions must_understand

For groups the storage_transformers and extensions keys would look similar. I'm not sure if we should differentiate on an array level betweend must_understand to read metadata or just array chunk data of included arrays. However, probably any storage_transformer needed to read chunk data should be listed rather in the array's zarr.json instead of a parent group's.

What do you think about this @jbms @joshmoore @rabernat @WardF @jakirkham? Happy to discuss this later in the ZEP meeting.

@jbms Brought up that we might also use the top level object of the metadata for extensions instead of the extensions key and encode the must_understand differently, e.g. via the name. I personally find the current version more clear, and it helps to differentiate between core and non-core features when looking at a metadata document, without knowing the spec details. Let's clarify as well if this should be changed.

@jbms
Copy link
Contributor

jbms commented Feb 9, 2023

Thanks, I think this is moving in the right direction. I think it would be nice to be as consistent as possible, so that it is easier to remember when writing manually. In particular, rather than have a mix of "type" and "name", just always use one key.

@jstriebel
Copy link
Member

In particular, rather than have a mix of "type" and "name", just always use one key.

Yep, meant to use name also for the codecs, corrected this now in the example above.

@jbms
Copy link
Contributor

jbms commented Feb 9, 2023

I think for storage transformers it might be better to just use a single name, like "name", rather than both name and type.

@jstriebel
Copy link
Member

I think for storage transformers it might be better to just use a single name, like "name", rather than both name and type.

Yep, in this case the type is just a part of the transformer-specific config, but I agree that we can probably drop it.

@jstriebel jstriebel moved this from Todo to In Discussion in ZEP1 Feb 9, 2023
@jstriebel jstriebel moved this from In Discussion to In Review in ZEP1 Feb 13, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in ZEP1 Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants