Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't have numpy datatypes in attributes #156

Open
rabernat opened this issue Oct 8, 2017 · 11 comments
Open

can't have numpy datatypes in attributes #156

rabernat opened this issue Oct 8, 2017 · 11 comments
Labels
enhancement New features or improvements

Comments

@rabernat
Copy link
Contributor

rabernat commented Oct 8, 2017

We are working on the zarr backend for XArray (pydata/xarray#1528). XArray likes to put all kinds of weird stuff into attributes, including numpy datatypes and even numpy arrays. This is because the netCDF data model allows attributes to have all of the same types as variables.

Instead, in zarr, the attributes have to be json-serializable. So this doesn't work:

za = zarr.create(shape=(1), store='tmp_file')
za.attrs['foo'] = np.float32(0)

It raises TypeError: Object of type 'float32' is not JSON serializable.

We will need some sort of workaround for this in order to make zarr work as a store for xarray.

@alimanfoo
Copy link
Member

alimanfoo commented Oct 8, 2017 via email

@alimanfoo alimanfoo added the enhancement New features or improvements label Nov 21, 2017
@jakirkham
Copy link
Member

Is this still of interest, @rabernat?

@chairmank
Copy link

chairmank commented Jun 21, 2018

I am very interested in this issue. I need to store exact binary values and datetime objects as attributes. To work around the limitations of JSON, I currently encode these attributes as strings and put the burden on the consumer of the data to correctly decode them to the actual data types. This is not ideal. Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model).

This issue seems to be related to #244 and #216

One approach that might address both issues is to allow .zarray and .zattrs to use a binary serialization format (e.g. using numcodecs.MsgPack), the same way that arbitrary variable-length array elements can be encoded.

@jakirkham
Copy link
Member

Ideally, any data type that is valid for an array ought to be valid for an attribute (like it is in the netCDF model)

Could you please elaborate on this point a bit? What sorts of things are you imagining storing here?

@chairmank
Copy link

chairmank commented Jun 21, 2018

I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.

Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are

  • datetime64 and timedelta
  • Floating-point numbers that I need to represent with exact precision (e.g. f8 versus f4), which JSON doesn't distinguish
    • A special problem is NaN, which has an exact representation as a Zarr/NumPy floating-point value but can not be represented by JSON
  • Structs like [('R','u1'), ('G','u1'), ('B','u1'), ('A','u1')]

I am also excited by the possibility of storing attributes that are arbitrary objects, such as JSON documents, although I haven't expressly encountered this requirement yet.

It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays:

An attribute has an associated variable (the null "global variable" for a global or group-level attribute), a name, a data type, a length, and a value. The current version treats all attributes as vectors; scalar values are treated as single-element vectors.

@jakirkham
Copy link
Member

Sorry for the very long delay.

It is worth noting that, in NetCDF, attribute values are really 1-dimensional arrays...

This is a really great point. Though this raises the question, would the best way to represent this data be an array with attributes that are array values or would it be a group with many arrays?

@jewfro-cuban
Copy link

jewfro-cuban commented Dec 27, 2018

I'm using xarray/zarr and find the attributes usage constraining as well.
I would like to suggest:

https://json-tricks.readthedocs.io/en/latest/

It uses the same api as json and solves many of the common use cases.

@alimanfoo
Copy link
Member

Thanks @jewfro-cuban, I didn't know about json-tricks, looks nice. The encoding format seems generally very sensible, although I guess we'd want to avoid supporting arbitrary class instances as a potential security issue.

Is there a way we could just depend on json-tricks, but with __instance_type__ disabled?

@jakirkham
Copy link
Member

Is there a way we could just depend on json-tricks, but with __instance_type__ disabled?

We could always check if that shows up in the result and error out if so.

@nritsche
Copy link

nritsche commented Apr 12, 2021

I would like to store as attributes any of the data types described in the "Data Type Encoding" section of the Zarr specification.

Specifically, in my real-world usage, I have encountered inconvenience with attribute values that are

* `datetime64` and `timedelta`

I encountered the same problem, and I would like to add that for me it would be enough if I could pass a custom JSONDecoder to zarr. It just needs to offer that as an argument to open_group etc (see https://github.com/zarr-developers/zarr-python/pull/533/files).

@miccoli
Copy link

miccoli commented Aug 6, 2022

I was recently hit by this very same problem, with reference to HDF5 files, which also allow for array attributes.

For example from h5dump I have

         ATTRIBUTE "data channels" {
            DATATYPE  H5T_STD_I64LE
            DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
            DATA {
            (0): 1, 2, 3, 4
            }
         }
         ATTRIBUTE "data units" {
            DATATYPE  H5T_STRING {
               STRSIZE 8;
               STRPAD H5T_STR_NULLPAD;
               CSET H5T_CSET_ASCII;
               CTYPE H5T_C_S1;
            }
            DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
            DATA {
            (0): "N       ", "m/s^2   ", "m/s^2   ", "m/s^2   "
            }
         }

which are rendered by h5py as

>>> data.attrs['data channels']
array([1, 2, 3, 4])
>>> data.attrs['data units']
array([b'N       ', b'm/s^2   ', b'm/s^2   ', b'm/s^2   '], dtype='|S8')

When converting from HDF5 to ZARR, zarr.copy_all fails with

TypeError: Object of type ndarray is not JSON serializable

Since I have a bunch of files to convert I implemented a quick fix in miccoli/zarr-python@380ee7c07

I'm not sure if this is of general interest, but if there is enough interest I can open a PR.

Open question:

  • just hardcode the np.ndarray -> list mapping, or maybe better, allow the user to override the default JSONEncoder?

See also #933 and #533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

7 participants