Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZEP 4: Metadata Conventions #28

Merged
merged 2 commits into from
Jun 29, 2023
Merged

Conversation

rabernat
Copy link
Contributor

This ZEP describes how communities can standardize conventions around metadata and layout of Zarr data
using user-defined attributes in order to meet domain-specific application needs without changes to the
core data model and specification, and without specification extensions.

@rabernat rabernat changed the title First draft of ZEP 4 First draft of ZEP 4: Metadata Conventions Jan 12, 2023
@rabernat rabernat changed the title First draft of ZEP 4: Metadata Conventions ZEP 4: Metadata Conventions Jan 12, 2023
@jakirkham
Copy link
Member

cc @briannapagan (in case this is of interest to you)

@rabernat
Copy link
Contributor Author

This is directly relevant to the forthcoming geozarr work, so that's why I wanted to push it out in draft form

@jstriebel
Copy link
Member

Awesome, thanks a lot! As mentioned in zarr-developers/zarr-specs#169, this is also relevant for the issues zarr-developers/zarr-specs#139 and zarr-developers/zarr-specs#144.

I like that this is becoming a separate ZEP, it never occurred to me to separate this from ZEP 1.

@jbms
Copy link

jbms commented Jan 12, 2023

The current proposal just allows a group/array to have a single convention. Perhaps for some use cases that makes sense. But the example in the proposal is "units", which could easily interoperate with numerous other possible conventions. Instead the naming of attributes could be done to allow multiple conventions to be used at once, for example:

{"zarr_convention": {"units-v1": {"units": "m^2"}, ...}

or

{"units-v1": {"units": "m^2"}, ...}

or

{"units-v1": "m^2", ...}

@rabernat
Copy link
Contributor Author

rabernat commented Jan 12, 2023

Very good point Jeremy.

TBH, I'm on the fence about whether the convention even needs to be explicitly identified. Like, maybe it could be enough to say

Arrays with the units attribute set are assumed to be using this convention

Of the proposals above, I definitely favor the first one because it doesn't touch the name of the actual attribute. I could also imagine

{
    "zarr_convention_units-v1": True,
    "zarr_convention_foobar-v2": True
}

@martindurant
Copy link
Member

Strongly support this concept.

Question:
you mention the currently uncodified (by zarr) conventions in the wild. Is there something to be done about conventions that arise organically and are not described in the zarr docs?

@martindurant
Copy link
Member

I do believe it's useful that, once a convention is listed and given a name, it is explicitly mentioned in the attributes of the data that uses it.

@rabernat
Copy link
Contributor Author

Is there something to be done about conventions that arise organically and are not described in the zarr docs?

I think that it's natural for conventions to arise organically. Once there is sufficient alignment and adoption, they can be proposed as conventions.

For conventions created in the wild, or borrowed from other formats (e.g. CF Conventions), it could be hard to require the presence of the zarr_conventions attribute. (I'm thinking about e.g. converting from NetCDF to Zarr.) There needs to be a way to simply document existing conventions, without prescribing new attributes to be present.

@martindurant
Copy link
Member

it could be hard to require the presence

Recommended, but not required?

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @rabernat! I very much look forward to having the conventions list on zarr-specs. 👍


### Updating a Convention

Conventions should be versioned using incremental integers, starting from 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, reading these two sentences leaves a certain tension. Can we clarify if the first is really a SHOULD or is it more a MAY?

draft/ZEP0004.md Outdated Show resolved Hide resolved
draft/ZEP0004.md Outdated
'example_with_units.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='f4'
)
z.attrs['units'] = 'm^2'
z.attrs['zarr_convention'] = "units-v1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, this is the most (or really only) potentially contentious line in the ZEP. Similar to the ZEP1 discussion last night, I could see discussing at least:

  • if zarr_ is a generally special prefix or if zarr_convention is a special string
  • if it's possible to have more than one convention active in a zgroup/zarray at a time
  • whether other values (like URLs) are acceptable as the values of the convention string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering if this is needed at all. Instead, I'd rather just specify that the field units has a metadata convention. Nobody is forced to follow this convention, and exisiting arrays with such a field would potentially follow it automatically.

@jstriebel
Copy link
Member

I'd propose to remove the zarr_convention key, and simply have a document which defines metadata convention (+ the process to add them as laid out in this ZEP ❤️). The user attributes could follow the metadata conventions (or not^^), e.g.:

{
  "units": "m^2",
  "writer": "zarr-python",
  "origin": [12300, 45600],
  "convention-key": "valuenotfollowingtheconvention",
  "some-other-key": "foo",
  
}

I see the conventions as a good place to discuss & establish standards between implementations, not as a strict mechanism that must be enforced. Also, IMO it can't be enforced, having the zarr_convention key doesn't avoid misusages of such conventions (e.g. other interpretations, using inches when only metric units are defined, bugs, …).

@rabernat rabernat marked this pull request as draft January 19, 2023 17:05
@ivirshup
Copy link
Contributor

ivirshup commented Jan 23, 2023

About specifying the convention: I think it's really quite useful to know which specific convention is being used and to version them.

For example, if two groups want to use the units field, how should I know how to interpret that? What if you want to update the convention?

In anndata, we specify conventions for our data with an encoding-type and encoding-version field in .attrs (both in hdf5 and zarr). We used to not, and it kinda sucked to figure out people's IO errors or making any updates.

I'd also agree with the point made above that only allowing a single convention may be limiting. Maybe instead conventions could be stored like:

z.attrs["conventions"] = {"convention": "version", ...}

@joshmoore
Copy link
Member

I tend to agree with Isaac. Maybe it helps to think through (and/or specify) what processing looks like. I've been debating whether to bring up my favorite soapbox (JSON-LD) as a way of specifying such metadata that already exists, has processing rules, etc. e.g.:

| Context                                          | Field    | Interpretation                |   |
|--------------------------------------------------|----------|-------------------------------|---|
| N/A                                              | units    | this-file#units               |   |
| {@context: {units: example.com/}}                | units    | example.com/units             |   |
| {@context: http://some-file.jsonld}              | units    | whatever-some-file-says/units |   |
| {@context: {ns: https://some-other-file.jsonld}} | ns:units | some-other-namespace/units    |   |

Obviously, there are a lot of different edge cases there, but I do like the idea of not building our own.

@rabernat
Copy link
Contributor Author

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.

I think way forward is for me to put together the template referred to above. This should have a section on "How to identify this convention".

@jbms
Copy link

jbms commented Jan 24, 2023

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers. A big use case for us is transcoding NetCDF / HDF5 data that already exists into Zarr. This data was written 10+ years ago and the metadata is what it is.

Is it not an option to convert the metadata at the same time as the data conversion happens? It seems that during this data conversion is when you would have the most context for decoding any metadata.

@ivirshup
Copy link
Contributor

From a practical point of view, it may be simply impossible to impose a hard requirement for convention identifiers.

Totally fair. Without any sort of identifier or format requirement this seems to me like a listing of conventions used with zarr.

If this is the direction, I wonder if even the "consensus" requirement for new conventions could be softened to "noteworthy" or removed. If no namespace is being reserved, then I don't think the zarr team needs to take on the responsibility of figuring out if a field has consensus on a file format. Especially since it's so easy to break consensus. For instance, of the given examples aren't Xarray Zarr, GDAL, and GeoZarr competing?

I think there's definitely value in collecting lists of conventions building on top of zarr, and making that visible. However, I wonder if doing more than that (like establishing credentials based on consensus/ use) is something better left to standards repositories like fairsharing.org?

@rabernat
Copy link
Contributor Author

Really good points @ivirshup. Yes, we definitely don't want to give ourselves (zarr developers) the job of mediating standards in different scientific domains. The only intention here is to provide a means to document an existing convention, not do any sort of evaluation or approval. I'll modify to reflect that.

@jbms
Copy link

jbms commented Jan 25, 2023

On my end, I'm most interested in attributes that are relevant to general purpose tools like Neuroglancer, e.g. things like units, different types of labels. If there is no way to unambiguously identify the metadata then it is much more complicated to make use of it.

@normanrz
Copy link
Member

I think namespacing would be a good idea. In the OME-Zarr context, we are thinking about wrapping all the OME-specific metadata under the ome key in the attributes ome/ngff#182. I think that would be useful to allow metadata from different metadata conventions to exist in the same group/array.

@christophenoel
Copy link

christophenoel commented Mar 30, 2023

Is there any existing work on that ?
I don't see item ZEP4 in https://github.com/zarr-developers/zeps/tree/main/draft

I didn't realised it was a pull request sorry :) :)

@christophenoel
Copy link

My main concern is the lack of a concept for grouping conventions for a specific purpose, profile, or topic. This would provide greater flexibility for client applications to selectively support conventions, while still enabling interoperability with other Zarr implementations.

In some domains (e.g. Earth Observation), there may be hundreds of conventions, and client applications may only address subsets of those conventions. Similar to how OGC APIs Implementation Standards are written, I suggest introducing "requirement-classes" (or "convention-classes") as a means of decoupling a set of domain conventions into groups that can be advertised in Zarr as supported or not. For example:

conventions-classes: ["eo-core", "eo-multispectral","eo-multiscale", "eo-quicklook", "eo-symbology"]

Furthermore, I do not see a clear indication in the process of how the conventions become listed on https://zarr-specs.readthedocs.io/ under a specific domain section.

Regards,


### New Convention Process

New conventions are proposed via a pull-request to the `zarr-specs` repo which adds a new conventions document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what the pull-request contains (a domain specific convention document ?). If it contains a convention document, a template should describe its structure ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a template is needed. I need to finish this PR.

@rabernat
Copy link
Contributor Author

rabernat commented Jun 1, 2023

Thanks for everyone's patience, and apologies for being slow to finish up this draft.

I plan to prioritize this over the next few weeks. I'll respond to the comments above and push a new draft that incorporates the feedback.

@rabernat rabernat marked this pull request as ready for review June 29, 2023 15:47
@rabernat
Copy link
Contributor Author

I have finally updated this ZEP. Thanks everyone for the patience. In my update, I incorporated the following changes

  • Distinguish between a "legacy convention" (to acommodate existing data) and an "explicit convention"
  • For explicit conventions zarr_conventions should be an array of strings, allowing multiple conventions to be composed together
  • Addressed the possibility of namespacing
  • Addressed the possibility of versioning

My goal here is to include the very good ideas that have been proposed in the discussion above as recommended best practices while retaining the ability to support legacy conventions and practices already in use in the community.

draft/ZEP0004.md Outdated Show resolved Hide resolved
Co-authored-by: Norman Rzepka <code@normanrz.com>
@MSanKeys963
Copy link
Member

MSanKeys963 commented Jun 29, 2023

Thanks for completing it, @rabernat, and everyone for reviewing this.
Merging this as discussed in the ZEP Meeting today.

ZEP0004 is live here: https://zarr.dev/zeps/draft/ZEP0004.html.

@MSanKeys963 MSanKeys963 merged commit 54394ba into zarr-developers:main Jun 29, 2023
@ivirshup
Copy link
Contributor

@MSanKeys963, did a discussion get opened for this?

@rabernat
Copy link
Contributor Author

Hi Isaac! I believe it's on me to open a PR where the discussion will happen. I will try to do that today.

@ivirshup
Copy link
Contributor

👍

Is it not meant to be a Discussion (as opposed to a PR/ Issue)? I think this kind of discussion heavily benefits from threading.

Maybe this changed: #27 ?

@rabernat
Copy link
Contributor Author

TBH I think the ZEP process still has a lot of details to be ironed out. I agree a discussion makes sense.

@tasansal
Copy link

This idea is excellent; I would love to help push this forward. What is the best way to collaborate?

@rabernat
Copy link
Contributor Author

@tasansal - the discussion is continuing in zarr-developers/zarr-specs#262

The best way to collaborate would be to share your use cases there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.