Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3 spec: Consider removing metadata encoding #174

Closed
jstriebel opened this issue Nov 24, 2022 · 7 comments
Closed

v3 spec: Consider removing metadata encoding #174

jstriebel opened this issue Nov 24, 2022 · 7 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec

Comments

@jstriebel
Copy link
Member

In #171 I consolidated metadata_key_suffix and metadata_encoding into a single object, which can also be an extension point. @jbms raised the question if this explicit extension point is needed, since global extensions could define the same behavior (see conversation here. I'll try to summarize the pro and contra for removing it:

Pro:

  • We can still later add a metadata_encoding attribute if we want, rather than adding it under "extensions" --- since any volume using a different metadata_encoding will only be usable by implementations that support it, the default behavior of failing when encountering any unknown metadata members will do the right thing.

  • The extension point is untested so far and might need further changes later-on.

Contra:

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 24, 2022
@jbms
Copy link
Contributor

jbms commented Nov 24, 2022

Assuming we keep metadata_encoding at all, I think that metadata_key_suffix can be skipped altogether, rather than embedded as a member of metadata_encoding. The specification for each metadata encoding type would simply state what the key suffix is. An implementation that does not know about a given metadata_encoding does not benefit by knowing the extension since it still can't decode it.

As for #37, the original request was for being able to encode metadata as hdf5 attributes which have a different data model. I'm not clear exactly how that request fits in with zarr v3, but I don't think metadata_encoding is actually helpful for that purpose.

As for #141, I also don't think metadata_encoding is a great solution if regular json with its limitations remains the default; in fact I think if the representable values vary depending on the metadata_encoding that would be confusing and problematic.

@rabernat
Copy link
Contributor

What if we made metadata encoding up to the store?

For, some stores (filesystems, object stores), json is a natural choice. For other stores (e.g. document databases), native storage of dictionaries would be more natural.

@joshmoore
Copy link
Member

What if we made metadata encoding up to the store?

How would calling code know which store to use?

@jbms
Copy link
Contributor

jbms commented Jan 18, 2023

What if we made metadata encoding up to the store?

How would calling code know which store to use?

I think you might have intended to ask: "How would the calling code would which encoding to use?"

Conceptually we could say that the store just decodes and re-encoders when writing, and does the reverse when reading, based on the key requested.

Implementations of zarr could choose to make this more efficient by adding to their store abstraction an interface for reading and writing in-memory JSON values directly, in order to avoid the extra encoding and decoding. That would also be necessary if the value cannot be represented as JSON (e.g. nan, infinity, specific nan values), though we might want to avoid such values precisely to ensure consistency across metadata encodings.

@rabernat
Copy link
Contributor

rabernat commented Jan 18, 2023

Right, so the interface might look something like

class Store:

    def store_meta(key: str, metadata: dict[str, ?]) -> None:
        ...

    def store_bytes(key: str, data: bytes) -> None:
        ...

i.e. you would path a native dict to the store.

@joshmoore
Copy link
Member

I think you might have intended to ask: "How would the calling code would which encoding to use?"

... Maybe. Guess as always concrete examples would help. For a database, I agree it's not really our business to say how it's going in. I was more concerned about the middle ground where someone is trying to achieve zarr.json.gz or zarr.bson, etc. i.e. We're still more or less in file space but getting towards binary representations. As long as the cases with "there's no JSON" require code in order to configure the store, then I can see a path forward, but we should be really upfront about that in the spec.

@jstriebel
Copy link
Member Author

We removed the metadata encoding for now since a similar behavior can be specified via group extensions or group storage transformers. If the data is still stored json-like, it might be handled at the level of the store, as indicated here:

In general, a value is a sequence of bytes. Specific stores may choose more specific storage formats, which must be stated in the specification of the respective store. E.g. a database store might encode values of *.json keys with a database-native json type.

I'm closing this issue for now, I'd propose to move to #37 to discuss other encodings further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec
Projects
Status: Done
Development

No branches or pull requests

4 participants