Extension for separate user attributes #72

alimanfoo · 2020-05-20T20:39:42Z

In the current v3 protocol spec draft, an array metadata document contains both core metadata (like array data type, shape, chunk shape, etc.) and user attributes. E.g.:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

The same is true for groups, i.e., both core metadata and user attributes are stored together in a single group metata document, e.g.:

{
    "extensions": [],
    "attributes": {
        "spam": "ham",
        "eggs": 42,
    }
}

The zarr v3 approach of storing core metadata and user attributes together is similar to N5.

Note that this is different from the zarr v2 protocol, where the core metadata and user attributes are stored in separate documents.

Raising this issue to surface any comments or discussion on this approach.

Some possible reasons for bringing core metadata and user attributes together into a single document:

Creating an array with user attributes requires only a single storage request.
Reading all metadata for an array requires only a single storage request.

Some possible reasons for separating core metadata and user attributes into different documents:

If user attributes for an array are reasonably large and rarely used, then there is some overhead when reading core metadata. I.e., you can't read just the core metadata without reading the user attributes too.

joshmoore · 2020-06-17T19:41:53Z

Some discussion around this on today's community meeting. Further discussion on the spec meeting needed.

DennisHeimbigner · 2022-02-21T21:20:19Z

I want to put in a word for keeping the attributes in a separate document.
One persistent problem is that the spec developers do not seem to consider
that data sets exist with very large amounts of metadata, including many groups,
many variables, and many attributes.
The netcdf group at unidata has been grappling with this for some time. We currently
support lazy reading of groups, variables, and attributes just because of this.
The more you combine the metadata, the slower it becomes to open a dataset.
This issue has also come up in the context of version 2 where there is informal
support for the ultimate case of putting all the metadata into a single object.
Bottom line: when building this spec, always consider the case where the object
you create is very large.
merging all the metadata

jstriebel · 2022-11-21T10:30:26Z

I'm wondering if it should be a core feature to have the attributes separate, or if it could be an extension. Any opinions on that? If it's supposed to be a core feature, should it always be in a separate document, or can it be inlined into the other metadata as it currently is, too?

jstriebel · 2022-11-28T14:50:19Z

If there are no objections, I'd leave this as an extension (TBD) for now.

rabernat · 2022-11-28T15:17:00Z

@DennisHeimbigner - do you think that your issues could be addressed via an extension?

DennisHeimbigner · 2023-10-25T21:33:30Z

The meaning of "extension" is completely undefined, so I have no idea if it is possible or not.
My guess is that attributes as a separate document is not doable within any reasonable and limited
notion of "extension".

rabernat · 2023-10-25T21:58:23Z

Thanks for your feedback Dennis. Are you saying that the extension points in the V3 spec are not well defined enough? Any suggestions on how to make that better?

DennisHeimbigner · 2023-10-25T22:19:54Z

Short answer: no they are not well defined enough.
But lets try to think about an extension for moving attributes to a separate object.
As near as I can tell, non of the listed extension points try to create a new separate object.
My first thoughts are that such an extension would need, at least, something like this:

an indication of where the object can be situated in the meta-data JSON tree.
a specification of the contents of that object
If, as in this case, the new object is some combination of information from elsewhere in the metadata tree, then a mapping of some sort between the the new object and the original metadata and an indication of whether or not the original metadata is to be deleted.

In the worst case, we are talking about an arbitrary rewrite of the JSON metadata, which is not desirable, so we also need to somehow limit what kind of transformations are needed.
Perhaps we need something similar to XSLT.

Needs more thought.

DennisHeimbigner · 2023-10-25T22:27:34Z

Another possibility to address this.
We define a notion of a "pointer" and then we allow dict key values be converted to a separate object in the zarr file tree.
Then we allow a new kind of JSON object, more or less a "pointer" to another object in the zarr tree.
The semantics of the pointer would be that in effect the contents of the pointed-to object is in effect
inserted into the JSON tree in place of the pointer object.

DennisHeimbigner · 2023-10-25T22:40:02Z

Another thought about extensions.
It seems like all of current set that address metadata could be subsumed by specifying
an XSLT(?) like path template to a dictionary plus a template specifying a new key plus value that can
be inserted at all points in the metadata that match the path template. So we could write a template
that covers all zarr.json objects with node_type == array and then insert a new key into that zarr.json object
along with some JSON value describing the specific extension.
This approach would also allow extensions to build on other extensions.

DennisHeimbigner · 2023-10-25T22:48:04Z

This might be relevant: https://softwareengineering.stackexchange.com/questions/264706/xslt-equivalent-for-json

jbms · 2023-10-26T15:40:34Z

Agreed that none of the existing "extension points" are applicable for this purpose. However, at least as I see it, an "extension" more generally is really any future change to the spec that maintains some amount of compatibility with the existing spec (as otherwise we would just consider it an independent spec rather than an extension), and that would certainly be able to support this feature.

It sounds like you are proposing a more general mechanism for moving arbitrary subtrees of the JSON to separate keys in the underlying store; that is an interesting possibility. I can also imagine a more limited feature specifically to allow user-defined attributes to be specified separately.

DennisHeimbigner · 2023-10-26T17:22:35Z

My worry is that the notion of "extension" has no current definition. It needs at least a semi-formal
(i.e. prose) definition.
Perhaps we can start by saying who can implement new extensions.
Do they have to be approved by the ZEP committee?

jbms · 2023-10-26T18:31:44Z

Yes, perhaps it should be clarified.

It seems to me that a typical approach for creating an extension would be to first create an (experimental) implementation, then create a ZEP proposal to standardize it.

For extensions outside the realm of the indicated "extension points", any changes to the zarr.json metadata file should take into account interoperability issues:

If you want other implementations that don't implement the extension to fail to open the array/group, then add a new metadata field without {"must_understand": false}.
If you want other implementations that don't implement the extension to ignore it, then add a new metadata field with {"must_understand": true}.

DennisHeimbigner · 2023-10-26T18:38:11Z

Actually, this raises a question. Suppose I want to add a non-standard key to, say, a configuration dictionary.
Can I insert "must_understand: false" in that configuration dictionary, or must the must_understand occur
at the top-level of zarr.json object?

jbms · 2023-10-26T21:31:51Z

That is a good point. Currently the spec does not specify this, and therefore it is left up to the definition of the specific configuration, and currently those aren't very explicit about this. Tensorstore currently only supports must_understand at the top level, but probably it would be better to support this at all levels. This would be an easy, backwards-compatible change.

alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label May 20, 2020

rabernat mentioned this issue Aug 10, 2022

ZEP0001 - Core v3.0 spec for review #149

Closed

jstriebel mentioned this issue Nov 23, 2022

Support for inf, nan, binary data in attributes #141

Open

jstriebel added the protocol-extension Protocol extension related issue label Nov 28, 2022

jstriebel changed the title ~~Core metadata and user attributes: together or separate documents?~~ Extension for separate user attributes Feb 10, 2023

jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Feb 10, 2023

sbesson mentioned this issue Oct 27, 2023

Codec: clarify expectations regarding unknown metadata fields #270

Open

d-v-b mentioned this issue Aug 12, 2024

Entrypoint for storing metadata that scales with number of chunks #305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension for separate user attributes #72

Extension for separate user attributes #72

alimanfoo commented May 20, 2020

joshmoore commented Jun 17, 2020

DennisHeimbigner commented Feb 21, 2022

jstriebel commented Nov 21, 2022

jstriebel commented Nov 28, 2022

rabernat commented Nov 28, 2022

DennisHeimbigner commented Oct 25, 2023

rabernat commented Oct 25, 2023

DennisHeimbigner commented Oct 25, 2023 •

edited

Loading

DennisHeimbigner commented Oct 25, 2023

DennisHeimbigner commented Oct 25, 2023 •

edited

Loading

DennisHeimbigner commented Oct 25, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

Extension for separate user attributes #72

Extension for separate user attributes #72

Comments

alimanfoo commented May 20, 2020

joshmoore commented Jun 17, 2020

DennisHeimbigner commented Feb 21, 2022

jstriebel commented Nov 21, 2022

jstriebel commented Nov 28, 2022

rabernat commented Nov 28, 2022

DennisHeimbigner commented Oct 25, 2023

rabernat commented Oct 25, 2023

DennisHeimbigner commented Oct 25, 2023 • edited Loading

DennisHeimbigner commented Oct 25, 2023

DennisHeimbigner commented Oct 25, 2023 • edited Loading

DennisHeimbigner commented Oct 25, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 26, 2023

jbms commented Oct 26, 2023

DennisHeimbigner commented Oct 25, 2023 •

edited

Loading

DennisHeimbigner commented Oct 25, 2023 •

edited

Loading