Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add minimal drafts of a few more dtype extensions #135

Conversation

grlee77
Copy link
Contributor

@grlee77 grlee77 commented Mar 2, 2022

This adds draft extensions for the remaining dtypes supported for Zarr v2 arrays in zarr-python that are not part of the core v3 spec. Merging this would at least give us a URL to point to for these dtypes in zarr-developers/zarr-python#898.

The language here will eventually need updating to be more spec-like and more implementation neutral (e.g. less NumPy/Python specific). What is there now, at least describes the current behavior in zarr-python.

In the last community call @jbms (from from the tensorstore team) mentioned that NumPy's fixed width unicode arrays are not very desirable and variable length strings are likely of more interest. There is not a variable length string dtype in NumPy, but variable length strings can be stored as object arrays (H5Py for example does this). I know Pandas 1.0 also introduced their own StringDType, but I haven't looked into the specifics of that.

is to have an array where each element is another array (and each array can
have a different length). Another use case is to store an array of variable
length strings. It is important to note that such an array actually just stores the references to the Python objects and not the objects themselves. Accessing
an element of the array returns the Python object it refers to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should specify how this data type is actually encoded in the store --- presumably via pickling.

In general this sort of encoding issue might fall under the domain of "filters", which are not yet part of zarr v3 but might be important to add.

In general I have a number of concerns about this data type:

  • The pickle format by its nature is inherently tied to Python and cannot be supported by non-Python implementations.
  • By the nature of the format, the Python unpickle operation can lead to arbitrary code execution and therefore requires that the source data be fully trusted. I think if it is supported it is likely to be used out of convenience even in situations that don't call for that level of trust. In particular, if it is available by default in zarr-python then opening any untrusted zarr array becomes a potential security vulnerability.
  • In cases where the intent is not to store an arbitrary Python object, but e.g. just a variable length string or json value, just using the "object" type leaves the actual type unspecified.

I think it would be much better instead to decouple the zarr v3 data types from numpy data types --- and define as explicit zarr v3 data types:

  • variable length unicode and byte strings
  • json values

I think it would be best to avoid supporting a general Python object type at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: These variable length strings and json values when stored in a numpy array would have to be stored using the numpy object data type. But that does not need to match the data type of the zarr array.

Copy link
Contributor Author

@grlee77 grlee77 Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that the extension should not be specifically for general Python objects and needs to be revised substantially. We actually did not have any tests in core zarr-python involving object arrays, but I found that they were used in the Xarray test suite. Specifically this method on the DatasetIOBase class that the Zarr backend tests are a subclass of.

-------------------------------------------

These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'.
For backward compatibility with Python 2's ``str`` these are zero-terminated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-terminated strings normally refers to the C null-terminated string convention where the length is not explicitly indicated and instead the end of the string is indicated by a '\0' value. I would call this zero-padded rather than zero-terminated, since there is no zero terminator if the string is exactly equal to the specified length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a bit confused regarding this as the NumPy dtype docs refer to this type as "zero-terminated bytes".

- Numerical type
- Size (no. bytes)
- Byte order
* - list of (<name>, <type>) tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NumPy (and zarr v2) supports a third variant, where you have (name, type, shape). The shape specifies the dimensions of an "inner" array of elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had forgotten to list that one and will update it here to describe an optional [, shape]. I think we do test it already.

Copy link
Member

@joshmoore joshmoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have these discussions started, @grlee77! Also 👍 for @jbms' comments, but I could also see getting this into the dev branch before they are resolved as with zarr-developers/zarr-python#898 so that we have more people working from the same starting point.

@@ -11,6 +11,9 @@ Under construction.
extensions/filters/v1.0
extensions/complex-dtypes/v1.0
extensions/datetime-dtypes/v1.0
extensions/object-dtypes/v1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly as a side note as we work out the process, we may want to lower these numbers depending in our confidence and the intended interpretation.

@jbms jbms mentioned this pull request May 4, 2022
@joshmoore
Copy link
Member

joshmoore commented May 6, 2022

Wonderfully, this has no conflicts despite the recent updates! I'm merging and will leave the comments above unresolved for capturing as the editing of v3 moves forward. (If anyone feels moved to create issues, please just cross-link)

Note: as mentioned in gitter, merging PRs into core-protocol-v3.0-dev doesn't imply that a feature is final.

@joshmoore joshmoore merged commit e889419 into zarr-developers:core-protocol-v3.0-dev May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants