Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Object dtype convenience API; datetime64/timedelta64 support #215

Merged
merged 15 commits into from
Dec 8, 2017

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Dec 6, 2017

Numcodecs 0.5.0 adds some new codecs for variable length text (unicode) strings (VLenUTF8), variable length byte strings (VLenBytes) and variable length arrays of primitive numpy types (VLenArray). These codecs all use the Parquet format byte array encoding. The codecs have good performance and encoded data size and provide a platform-independent encoding for these common cases.

This PR leverages these new codecs to propose a convenience API for Zarr object arrays.

dtype=str

If dtype=str (or dtype=unicode on PY2) is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenUTF8(). E.g.:

In [1]: import zarr

In [2]: from numcodecs.tests.common import greetings

In [3]: z = zarr.array(greetings, dtype=str)

In [4]: z[:]
Out[4]: 
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
       'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
       '世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

In [5]: z
Out[5]: <zarr.core.Array (12,) object>

In [6]: z.filters
Out[6]: [VLenUTF8()]

dtype=bytes

If dtype=bytes (or dtype=str on PY2) is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenBytes(). E.g.:

In [8]: z = zarr.array(greetings_bytes, dtype=bytes)

In [9]: z[:]
Out[9]: 
array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
       b'Hei maailma!', b'Xin ch\xc3\xa0o th\xe1\xba\xbf gi\xe1\xbb\x9bi',
       b'Njatjeta Bot\xc3\xab!',
       b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\xba\xcf\x8c\xcf\x83\xce\xbc\xce\xb5!',
       b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe4\xb8\x96\xe7\x95\x8c',
       b'\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x8c\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x81',
       b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
       b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)

In [10]: z
Out[10]: <zarr.core.Array (12,) object>

In [11]: z.filters
Out[11]: [VLenBytes()]

dtype='array:T'

If dtype='array:T' is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenArray('T'). E.g.:

In [12]: z = zarr.array([[1, 5, 7], [4], [2, 9]], dtype='array:int')

In [13]: z[:]
Out[13]: array([array([1, 5, 7]), array([4]), array([2, 9])], dtype=object)

In [14]: z
Out[14]: <zarr.core.Array (3,) object>

In [15]: z.filters
Out[15]: [VLenArray(dtype='<i8')]

Extensibility/configuration

This is not something the average user will need to know, but the mapping from object types to object codecs can be updated to modify the default behaviour.

E.g., to change the default codec for str objects:

In [6]: zarr.util.object_codecs['str'] = 'msgpack'

In [7]: z = zarr.array(greetings, dtype=str)

In [8]: z
Out[8]: <zarr.core.Array (12,) object>

In [9]: z[:]
Out[9]: 
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
       'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
       '世界,你好!', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

In [10]: z.filters
Out[10]: [MsgPack(encoding='utf-8')]

To give another example, if dtype=object is provided but no object_codec is provided, this usually raises an error:

In [11]: z = zarr.array([42, 'foo', ['bar', 'baz'], {'a': 'b'}], dtype=object)
---------------------------------------------------------------------------
...
ValueError: missing object_codec for object array

However, a default codec for the object class (or any class) can be set, e.g.:

In [12]: zarr.util.object_codecs['object'] = 'json'

In [13]: z = zarr.array([42, 'foo', ['bar', 'baz'], {'a': 'b'}], dtype=object)

In [14]: z[:]
Out[14]: array([42, 'foo', list(['bar', 'baz']), {'a': 'b'}], dtype=object)

In [15]: z
Out[15]: <zarr.core.Array (4,) object>

In [16]: z.filters
Out[16]: 
[JSON(encoding='utf-8', allow_nan=True, check_circular=True, ensure_ascii=True,
      indent=None, separators=(',', ':'), skipkeys=False, sort_keys=True,
      strict=True)]

TODO

  • Fix PY2 test failures.
  • Update tutorial.
  • Release notes.

Resolves #206.

@alimanfoo alimanfoo added the in progress Someone is currently working on this label Dec 6, 2017
@alimanfoo alimanfoo added this to the v2.2 milestone Dec 6, 2017
@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 6, 2017

cc @jakirkham, @shoyer, @rabernathy, @jcrist, @mrocklin, @martindurant, any comments welcome as always.

@martindurant
Copy link
Member

I wonder if JSON should be used for "anything else" (not just arrays/lists), although this will be a rare and poor-performing path.

I notice the datetime warning in the code (it moved rather than changed) - is there any reason this isn't automatically stored as int64? Seems like xarrays may well want this.

@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 6, 2017 via email

@martindurant
Copy link
Member

Oh no, the bytes and utf8 encodings are good, I meant for everything else, because although there are very good encodings that could be used, I doubt in practice that there's much data out there to make use of them.

I would imagine the to/from datetimes could be automated and would be one less thing for users to worry about. xarrays do store times, I imagine, especially for the coordinates.

@shoyer
Copy link
Contributor

shoyer commented Dec 6, 2017

This looks fantastic, thank you!

I don't have a strong need for encoding datetime64 in zarr. We already have logic on the xarray side for encoding/decoding datetime64. This would potentially make sense to port into numcodecs but it already works OK.

@jakirkham
Copy link
Member

jakirkham commented Dec 6, 2017

I think it would be a good idea to get that datetime logic into numcodecs and supported in zarr. That said, I gathered from past discussions that this was tricky to do correctly. So would think it should be saved for subsequent PRs.

@shoyer
Copy link
Contributor

shoyer commented Dec 6, 2017

The tricky thing about datetime decoding is that in xarray we need to be highly flexible at reading datetime metadata according to conventions used for existing netCDF files. If we were creating a codec for datetime64 from scratch, we could make something much simpler and more robust.

@jakirkham
Copy link
Member

Sure. Makes sense. Maybe we should start a new issue to figure out what a datetime codec should look like over at Numcodecs?

@alimanfoo
Copy link
Member Author

Re support for datetime64 and timedelta64, I previously decided to raise an exception for these because there was a technical issue regarding buffer access. However, since then I added a workaround for this in numcodecs. So it is actually now straightforward to add native support for these dtypes to zarr, without needing any special encoding. I know I'm overloading this PR a bit but it's a relatively small change and so I've gone ahead and done it. E.g., the following now works:

In [1]: import zarr

In [2]: z = zarr.array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='M8[D]')

In [3]: z[:]
Out[3]: array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='datetime64[D]')

In [4]: z
Out[4]: <zarr.core.Array (3,) datetime64[D]>

In [5]: z[0]
Out[5]: numpy.datetime64('2017-09-12')

In [6]: z[0] = '1999-12-31'

In [7]: z[:]
Out[7]: array(['1999-12-31', '2015-12-22', '2017-01-02'], dtype='datetime64[D]')

I've also edited the storage spec to state that the units MUST be specified. Zarr raises a value error if you try to create an array with generic units:

In [9]: z = zarr.zeros(10, dtype='M8')
---------------------------------------------------------------------------
...
ValueError: datetime64 and timedelta64 dtypes with generic units are not supported, please specify units (e.g., "M8[ns]")

I guess this won't completely solve things for xarray where there may still be encoding issues to deal with, but still maybe useful functionality to have.

@alimanfoo alimanfoo changed the title Object dtype convenience API Object dtype convenience API; datetime64/timedelta64 support Dec 7, 2017
@alimanfoo
Copy link
Member Author

alimanfoo commented Dec 7, 2017

Found some fill value issues which required a fix in numcodecs. Have added some tests and bumped the numcodecs requirement, tests are passing locally, will wait for CI. There is still an issue with fill values for object arrays in that currently only JSON-encodable fill values will work (#216), but it's hard to see a complete fix for that without modifying the storage spec and/or breaking compatibility with previous data, so I think it's better to just live with that for the moment. If CI looks OK and no objections I'll merge this tomorrow.

@alimanfoo alimanfoo added the enhancement New features or improvements label Dec 7, 2017
@alimanfoo alimanfoo merged commit 89f4107 into master Dec 8, 2017
@alimanfoo alimanfoo deleted the object-convenience-20171206 branch December 8, 2017 08:43
@alimanfoo alimanfoo removed the in progress Someone is currently working on this label Dec 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants