-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minimal drafts of a few more dtype extensions #135
Add minimal drafts of a few more dtype extensions #135
Conversation
is to have an array where each element is another array (and each array can | ||
have a different length). Another use case is to store an array of variable | ||
length strings. It is important to note that such an array actually just stores the references to the Python objects and not the objects themselves. Accessing | ||
an element of the array returns the Python object it refers to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should specify how this data type is actually encoded in the store --- presumably via pickling.
In general this sort of encoding issue might fall under the domain of "filters", which are not yet part of zarr v3 but might be important to add.
In general I have a number of concerns about this data type:
- The pickle format by its nature is inherently tied to Python and cannot be supported by non-Python implementations.
- By the nature of the format, the Python unpickle operation can lead to arbitrary code execution and therefore requires that the source data be fully trusted. I think if it is supported it is likely to be used out of convenience even in situations that don't call for that level of trust. In particular, if it is available by default in zarr-python then opening any untrusted zarr array becomes a potential security vulnerability.
- In cases where the intent is not to store an arbitrary Python object, but e.g. just a variable length string or json value, just using the "object" type leaves the actual type unspecified.
I think it would be much better instead to decouple the zarr v3 data types from numpy data types --- and define as explicit zarr v3 data types:
- variable length unicode and byte strings
- json values
I think it would be best to avoid supporting a general Python object type at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: These variable length strings and json values when stored in a numpy array would have to be stored using the numpy object data type. But that does not need to match the data type of the zarr array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that the extension should not be specifically for general Python objects and needs to be revised substantially. We actually did not have any tests in core zarr-python involving object arrays, but I found that they were used in the Xarray test suite. Specifically this method on the DatasetIOBase
class that the Zarr backend tests are a subclass of.
------------------------------------------- | ||
|
||
These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'. | ||
For backward compatibility with Python 2's ``str`` these are zero-terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zero-terminated strings normally refers to the C null-terminated string convention where the length is not explicitly indicated and instead the end of the string is indicated by a '\0'
value. I would call this zero-padded rather than zero-terminated, since there is no zero terminator if the string is exactly equal to the specified length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was a bit confused regarding this as the NumPy dtype docs refer to this type as "zero-terminated bytes".
- Numerical type | ||
- Size (no. bytes) | ||
- Byte order | ||
* - list of (<name>, <type>) tuples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NumPy (and zarr v2) supports a third variant, where you have (name, type, shape)
. The shape specifies the dimensions of an "inner" array of elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had forgotten to list that one and will update it here to describe an optional [, shape]. I think we do test it already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have these discussions started, @grlee77! Also 👍 for @jbms' comments, but I could also see getting this into the dev branch before they are resolved as with zarr-developers/zarr-python#898 so that we have more people working from the same starting point.
@@ -11,6 +11,9 @@ Under construction. | |||
extensions/filters/v1.0 | |||
extensions/complex-dtypes/v1.0 | |||
extensions/datetime-dtypes/v1.0 | |||
extensions/object-dtypes/v1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly as a side note as we work out the process, we may want to lower these numbers depending in our confidence and the intended interpretation.
Wonderfully, this has no conflicts despite the recent updates! I'm merging and will leave the comments above unresolved for capturing as the editing of v3 moves forward. (If anyone feels moved to create issues, please just cross-link) Note: as mentioned in gitter, merging PRs into core-protocol-v3.0-dev doesn't imply that a feature is final. |
This adds draft extensions for the remaining dtypes supported for Zarr v2 arrays in zarr-python that are not part of the core v3 spec. Merging this would at least give us a URL to point to for these dtypes in zarr-developers/zarr-python#898.
The language here will eventually need updating to be more spec-like and more implementation neutral (e.g. less NumPy/Python specific). What is there now, at least describes the current behavior in zarr-python.
In the last community call @jbms (from from the tensorstore team) mentioned that NumPy's fixed width unicode arrays are not very desirable and variable length strings are likely of more interest. There is not a variable length string dtype in NumPy, but variable length strings can be stored as object arrays (H5Py for example does this). I know Pandas 1.0 also introduced their own
StringDType
, but I haven't looked into the specifics of that.