Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extension for separate user attributes #72

Open
alimanfoo opened this issue May 20, 2020 · 16 comments
Open

Extension for separate user attributes #72

alimanfoo opened this issue May 20, 2020 · 16 comments
Labels
protocol-extension Protocol extension related issue

Comments

@alimanfoo
Copy link
Member

In the current v3 protocol spec draft, an array metadata document contains both core metadata (like array data type, shape, chunk shape, etc.) and user attributes. E.g.:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "regular",
        "chunk_shape": [1000, 100]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

The same is true for groups, i.e., both core metadata and user attributes are stored together in a single group metata document, e.g.:

{
    "extensions": [],
    "attributes": {
        "spam": "ham",
        "eggs": 42,
    }
}

The zarr v3 approach of storing core metadata and user attributes together is similar to N5.

Note that this is different from the zarr v2 protocol, where the core metadata and user attributes are stored in separate documents.

Raising this issue to surface any comments or discussion on this approach.

Some possible reasons for bringing core metadata and user attributes together into a single document:

  • Creating an array with user attributes requires only a single storage request.
  • Reading all metadata for an array requires only a single storage request.

Some possible reasons for separating core metadata and user attributes into different documents:

  • If user attributes for an array are reasonably large and rarely used, then there is some overhead when reading core metadata. I.e., you can't read just the core metadata without reading the user attributes too.
@alimanfoo alimanfoo added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label May 20, 2020
@joshmoore
Copy link
Member

Some discussion around this on today's community meeting. Further discussion on the spec meeting needed.

@DennisHeimbigner
Copy link

I want to put in a word for keeping the attributes in a separate document.
One persistent problem is that the spec developers do not seem to consider
that data sets exist with very large amounts of metadata, including many groups,
many variables, and many attributes.
The netcdf group at unidata has been grappling with this for some time. We currently
support lazy reading of groups, variables, and attributes just because of this.
The more you combine the metadata, the slower it becomes to open a dataset.
This issue has also come up in the context of version 2 where there is informal
support for the ultimate case of putting all the metadata into a single object.
Bottom line: when building this spec, always consider the case where the object
you create is very large.
merging all the metadata

@jstriebel
Copy link
Member

I'm wondering if it should be a core feature to have the attributes separate, or if it could be an extension. Any opinions on that? If it's supposed to be a core feature, should it always be in a separate document, or can it be inlined into the other metadata as it currently is, too?

@jstriebel
Copy link
Member

If there are no objections, I'd leave this as an extension (TBD) for now.

@rabernat
Copy link
Contributor

@DennisHeimbigner - do you think that your issues could be addressed via an extension?

@jstriebel jstriebel changed the title Core metadata and user attributes: together or separate documents? Extension for separate user attributes Feb 10, 2023
@jstriebel jstriebel removed the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Feb 10, 2023
@DennisHeimbigner
Copy link

The meaning of "extension" is completely undefined, so I have no idea if it is possible or not.
My guess is that attributes as a separate document is not doable within any reasonable and limited
notion of "extension".

@rabernat
Copy link
Contributor

Thanks for your feedback Dennis. Are you saying that the extension points in the V3 spec are not well defined enough? Any suggestions on how to make that better?

@DennisHeimbigner
Copy link

DennisHeimbigner commented Oct 25, 2023

Short answer: no they are not well defined enough.
But lets try to think about an extension for moving attributes to a separate object.
As near as I can tell, non of the listed extension points try to create a new separate object.
My first thoughts are that such an extension would need, at least, something like this:

  1. an indication of where the object can be situated in the meta-data JSON tree.
  2. a specification of the contents of that object
  3. If, as in this case, the new object is some combination of information from elsewhere in the metadata tree, then a mapping of some sort between the the new object and the original metadata and an indication of whether or not the original metadata is to be deleted.

In the worst case, we are talking about an arbitrary rewrite of the JSON metadata, which is not desirable, so we also need to somehow limit what kind of transformations are needed.
Perhaps we need something similar to XSLT.

Needs more thought.

@DennisHeimbigner
Copy link

Another possibility to address this.
We define a notion of a "pointer" and then we allow dict key values be converted to a separate object in the zarr file tree.
Then we allow a new kind of JSON object, more or less a "pointer" to another object in the zarr tree.
The semantics of the pointer would be that in effect the contents of the pointed-to object is in effect
inserted into the JSON tree in place of the pointer object.

@DennisHeimbigner
Copy link

DennisHeimbigner commented Oct 25, 2023

Another thought about extensions.
It seems like all of current set that address metadata could be subsumed by specifying
an XSLT(?) like path template to a dictionary plus a template specifying a new key plus value that can
be inserted at all points in the metadata that match the path template. So we could write a template
that covers all zarr.json objects with node_type == array and then insert a new key into that zarr.json object
along with some JSON value describing the specific extension.
This approach would also allow extensions to build on other extensions.

@DennisHeimbigner
Copy link

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

Agreed that none of the existing "extension points" are applicable for this purpose. However, at least as I see it, an "extension" more generally is really any future change to the spec that maintains some amount of compatibility with the existing spec (as otherwise we would just consider it an independent spec rather than an extension), and that would certainly be able to support this feature.

It sounds like you are proposing a more general mechanism for moving arbitrary subtrees of the JSON to separate keys in the underlying store; that is an interesting possibility. I can also imagine a more limited feature specifically to allow user-defined attributes to be specified separately.

@DennisHeimbigner
Copy link

My worry is that the notion of "extension" has no current definition. It needs at least a semi-formal
(i.e. prose) definition.
Perhaps we can start by saying who can implement new extensions.
Do they have to be approved by the ZEP committee?

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

Yes, perhaps it should be clarified.

It seems to me that a typical approach for creating an extension would be to first create an (experimental) implementation, then create a ZEP proposal to standardize it.

For extensions outside the realm of the indicated "extension points", any changes to the zarr.json metadata file should take into account interoperability issues:

  • If you want other implementations that don't implement the extension to fail to open the array/group, then add a new metadata field without {"must_understand": false}.
  • If you want other implementations that don't implement the extension to ignore it, then add a new metadata field with {"must_understand": true}.

@DennisHeimbigner
Copy link

Actually, this raises a question. Suppose I want to add a non-standard key to, say, a configuration dictionary.
Can I insert "must_understand: false" in that configuration dictionary, or must the must_understand occur
at the top-level of zarr.json object?

@jbms
Copy link
Contributor

jbms commented Oct 26, 2023

That is a good point. Currently the spec does not specify this, and therefore it is left up to the definition of the specific configuration, and currently those aren't very explicit about this. Tensorstore currently only supports must_understand at the top level, but probably it would be better to support this at all levels. This would be an easy, backwards-compatible change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

6 participants