Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for inf, nan, binary data in attributes #141

Open
jbms opened this issue May 4, 2022 · 12 comments
Open

Support for inf, nan, binary data in attributes #141

jbms opened this issue May 4, 2022 · 12 comments
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue

Comments

@jbms
Copy link
Contributor

jbms commented May 4, 2022

In both v2 and v3 specs, attributes are stored as JSON which means they are subject to the limitations of JSON:

  • Infinity and NaN cannot be represented, although currently zarr-python supports these via a non-standard extension to JSON format (directly encoding them as unquoted Infinity or NaN within the JSON output). This support may be accidental rather than intentional, though.
  • Byte strings (as opposed to Unicode strings) cannot be represented.

The v3 spec suggests the possibility of using a binary encoding like CBOR for metadata which would solve these issues. However my understanding is that CBOR would not be the default.

@jakirkham
Copy link
Member

jakirkham commented Jun 15, 2022

Also related ( #81 )

Somewhat related more general Zarr Python discussion ( zarr-developers/zarr-python#216 )

@jakirkham
Copy link
Member

Here's the current code in Zarr Python for handling these values. One thought might be to graduate this to inclusion in the spec as-is. Idk if there are other approaches worth considering.

For more complex objects Base64 encoding has come up as an idea.

@jbms
Copy link
Contributor Author

jbms commented Jun 15, 2022

One difficulty with the current zarr-python approach is that it means the "JSON" metadata is not actually spec-compliant JSON and cannot be parsed by the JavaScript JSON.parse function or by the popular C++ nlohmann json library. However, other libraries like RapidJSON can handle the Infinity/NaN extension.

For binary data it is certainly possible to encode it as a string containing base64 encoding, etc. However, there is a similar problem as with encoding Infinity/NaN as JSON strings --- you then lose the type information, so even if you automatically convert such values when encoding as json, there won't be a way to decode them automatically since they aren't distinguishable from ordinary strings.

In general I think it would be valuable for zarr v3 attributes to be able to "round trip" any supported zarr v3 data type, since storing a data value as an attribute is a very common use case and it would be nice for every individual use case not to have to invent its own convention.

@jakirkham
Copy link
Member

Being able to parse with normal JSON libraries makes sense.

Well the type information of the fill_value should match the data_type of the array, right?

Or is the concern the fill_value wouldn't be understood to be base64 encoded? In comment ( zarr-developers/zarr-python#216 (comment) ), the syntax below was proposed. Maybe there are alternative ways to capture this?

    "fill_value": {
        "base64": "..."
    }

Are there other concerns missed here?

@jbms
Copy link
Contributor Author

jbms commented Jun 15, 2022

For fill_value the data type is already known so there isn't an issue there. zarr-python for v2 already uses a different encoding for fill_value ---- infinity is encoded as "Infinity" rather than Infinity.

I think it would be nice if fill_value can be expressed in a readable way but that isn't critical.

I intended this issue to be about user-defined attributes, though. For example, a common use case might be to store the min/max range of an array, or the bucket partitions for a histogram as attributes.

@jakirkham
Copy link
Member

Ah sorry thought from our conversation earlier this was fill_value focused.

That being said, the issue is more or less the same in either case. We are storing values in JSON that may not representable using standard JSON types.

In these cases we do know how we would encode the data to write in a binary file. That same strategy could be leveraged to binary encode those values as if they were going in a chunk then convert to base64 before storing in JSON as string (possibly with an object indicating the encoding like above).

Understand the concern about human readability. With things like NaN and Infinity, we could quote them if that is sufficient. Zarr Python should be doing this currently. If it is not, that is a bug. We could also attempt JSON encoding these values in a readable way if possible. Only falling back to base64 if that is not an option.

@jbms
Copy link
Contributor Author

jbms commented Jun 16, 2022

The difference between fill_value and user-defined attributes is that for fill_value, the data type is specified elsewhere in the metadata and can be used to decode whatever representation is used.

For example, we probably want to distinguish between the user storing an attribute "a" with a value of the string "Infinity" and the user storing an attribute "a" with a value of the floating-point number Infinity. If both are encoded as {"a": "Infinity"} then that is not possible.

For user-defined attributes, if we want to support arbitrary data values, then we need to decide on the data model and there needs to be a way to unambiguously encode them.

For example, we might choose a data model of JSON augmented with support for +/-Infinity, NaN, and byte strings.

If we want to stick with encoding attributes as valid JSON, we could define an escaping mechanism so that the following Python attribute dictionary:

dict(a="Infinity", b=b"some binary representation", c=float("inf")}

is encoded as the following JSON metadata representation:

{..., "attribute_escape_key": "xxx", "attributes": {"a": "Infinity", "b": {"xxx": "binary", "value": "base64 representation..."}, "c": {"xxx": "Infinity"}}

When encoding, the implementation must choose an escape key "xxx" that does not occur as an actual key in any attribute value that is a dictionary.

@joshmoore
Copy link
Member

I generally like the idea of encoding things as objects if necessary. If possible, I'd prefer to have one clear way of identifying them, e.g. {"@type": "...", ...}.

@rabernat
Copy link
Contributor

Would be great to get the unidata perspective here. NetCDF supports typed attributes, so they have had to confront this already with nczarr. cc @WardF and @DennisHeimbigner.

@DennisHeimbigner
Copy link

First, note that netcdf-c does not allow multiple occurrences of an attribute
with the same name, even if the types are different.
I am not sure if relevant, but NCZarr adds an extra attribute named _nczarr_attr
whose value is a JSON dictionary that contains extra information about the other attributes defined by .zattr.
Specifically, it contains a key called "types" maps each attribute name to
a type such as "<i4".

@joshmoore
Copy link
Member

Good discussion this evening around this topic during the community meeting. Present company seems to be leaning towards a JSON-object-style encoding of one form or another. See the notes for more.

@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
@jstriebel jstriebel added this to ZEP1 Nov 16, 2022
@jstriebel jstriebel moved this to In Discussion in ZEP1 Nov 16, 2022
@jstriebel
Copy link
Member

jstriebel commented Nov 23, 2022

I'm proposing to go forward with JSON encoding as the default for v3 for now, and explicitly allowing arbitrary JSON entries for a given key in the user-attributes, which itself must be a string. This is effectively the same as the spec already defines since JSON object literals must have strings as keys. See #173.

This would currently not allow good representations of inf/nan or bytestrings in user attributes. (As mentioned above, there's special handling for fill-values). If somebody would create a PR to add this to the current spec I think it would be valid to include it.

Alternative encodings to JSON may be added by extensions later, especially if user attributes may be stored in a separate document (see #72).

Just as a side-note: yaml would both support inf/nan/binary, as well as custom type definitions. Might be a useful extension, and IMO more human-readable than CBOR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec protocol-extension Protocol extension related issue
Projects
Status: Maybe extension
Development

No branches or pull requests

6 participants