Object dtype convenience API; datetime64/timedelta64 support #215

alimanfoo · 2017-12-06T12:49:04Z

Numcodecs 0.5.0 adds some new codecs for variable length text (unicode) strings (VLenUTF8), variable length byte strings (VLenBytes) and variable length arrays of primitive numpy types (VLenArray). These codecs all use the Parquet format byte array encoding. The codecs have good performance and encoded data size and provide a platform-independent encoding for these common cases.

This PR leverages these new codecs to propose a convenience API for Zarr object arrays.

`dtype=str`

If dtype=str (or dtype=unicode on PY2) is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenUTF8(). E.g.:

In [1]: import zarr

In [2]: from numcodecs.tests.common import greetings

In [3]: z = zarr.array(greetings, dtype=str)

In [4]: z[:]
Out[4]: 
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
       'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
       '世界，你好！', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

In [5]: z
Out[5]: <zarr.core.Array (12,) object>

In [6]: z.filters
Out[6]: [VLenUTF8()]

`dtype=bytes`

If dtype=bytes (or dtype=str on PY2) is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenBytes(). E.g.:

In [8]: z = zarr.array(greetings_bytes, dtype=bytes)

In [9]: z[:]
Out[9]: 
array([b'\xc2\xa1Hola mundo!', b'Hej V\xc3\xa4rlden!', b'Servus Woid!',
       b'Hei maailma!', b'Xin ch\xc3\xa0o th\xe1\xba\xbf gi\xe1\xbb\x9bi',
       b'Njatjeta Bot\xc3\xab!',
       b'\xce\x93\xce\xb5\xce\xb9\xce\xb1 \xcf\x83\xce\xbf\xcf\x85 \xce\xba\xcf\x8c\xcf\x83\xce\xbc\xce\xb5!',
       b'\xe3\x81\x93\xe3\x82\x93\xe3\x81\xab\xe3\x81\xa1\xe3\x81\xaf\xe4\xb8\x96\xe7\x95\x8c',
       b'\xe4\xb8\x96\xe7\x95\x8c\xef\xbc\x8c\xe4\xbd\xa0\xe5\xa5\xbd\xef\xbc\x81',
       b'Hell\xc3\xb3, vil\xc3\xa1g!', b'Zdravo svete!',
       b'\xe0\xb9\x80\xe0\xb8\xae\xe0\xb8\xa5\xe0\xb9\x82\xe0\xb8\xa5\xe0\xb9\x80\xe0\xb8\xa7\xe0\xb8\xb4\xe0\xb8\xa5\xe0\xb8\x94\xe0\xb9\x8c'], dtype=object)

In [10]: z
Out[10]: <zarr.core.Array (12,) object>

In [11]: z.filters
Out[11]: [VLenBytes()]

`dtype='array:T'`

If dtype='array:T' is provided, this is treated as a short-hand for an array with dtype=object and object_codec=numcodecs.VLenArray('T'). E.g.:

In [12]: z = zarr.array([[1, 5, 7], [4], [2, 9]], dtype='array:int')

In [13]: z[:]
Out[13]: array([array([1, 5, 7]), array([4]), array([2, 9])], dtype=object)

In [14]: z
Out[14]: <zarr.core.Array (3,) object>

In [15]: z.filters
Out[15]: [VLenArray(dtype='<i8')]

Extensibility/configuration

This is not something the average user will need to know, but the mapping from object types to object codecs can be updated to modify the default behaviour.

E.g., to change the default codec for str objects:

In [6]: zarr.util.object_codecs['str'] = 'msgpack'

In [7]: z = zarr.array(greetings, dtype=str)

In [8]: z
Out[8]: <zarr.core.Array (12,) object>

In [9]: z[:]
Out[9]: 
array(['¡Hola mundo!', 'Hej Världen!', 'Servus Woid!', 'Hei maailma!',
       'Xin chào thế giới', 'Njatjeta Botë!', 'Γεια σου κόσμε!', 'こんにちは世界',
       '世界，你好！', 'Helló, világ!', 'Zdravo svete!', 'เฮลโลเวิลด์'], dtype=object)

In [10]: z.filters
Out[10]: [MsgPack(encoding='utf-8')]

To give another example, if dtype=object is provided but no object_codec is provided, this usually raises an error:

In [11]: z = zarr.array([42, 'foo', ['bar', 'baz'], {'a': 'b'}], dtype=object)
---------------------------------------------------------------------------
...
ValueError: missing object_codec for object array

However, a default codec for the object class (or any class) can be set, e.g.:

In [12]: zarr.util.object_codecs['object'] = 'json'

In [13]: z = zarr.array([42, 'foo', ['bar', 'baz'], {'a': 'b'}], dtype=object)

In [14]: z[:]
Out[14]: array([42, 'foo', list(['bar', 'baz']), {'a': 'b'}], dtype=object)

In [15]: z
Out[15]: <zarr.core.Array (4,) object>

In [16]: z.filters
Out[16]: 
[JSON(encoding='utf-8', allow_nan=True, check_circular=True, ensure_ascii=True,
      indent=None, separators=(',', ':'), skipkeys=False, sort_keys=True,
      strict=True)]

TODO

Fix PY2 test failures.
Update tutorial.
Release notes.

Resolves #206.

alimanfoo · 2017-12-06T12:58:51Z

cc @jakirkham, @shoyer, @rabernathy, @jcrist, @mrocklin, @martindurant, any comments welcome as always.

martindurant · 2017-12-06T15:36:35Z

I wonder if JSON should be used for "anything else" (not just arrays/lists), although this will be a rare and poor-performing path.

I notice the datetime warning in the code (it moved rather than changed) - is there any reason this isn't automatically stored as int64? Seems like xarrays may well want this.

alimanfoo · 2017-12-06T16:30:12Z

I wonder if JSON should be used for "anything else" (not just arrays/lists), although this will be a rare and poor-performing path.

Do you mean use JSON as the default codec if dtype=object? I did wonder about that, although if an array contains objects that cannot be JSON encoded (e.g., byte strings) then the user will get a fairly cryptic error about not being able to encode a value as JSON, and it might not be very obvious they need to solve their problem by choosing a different object codec. But I'm open to discussion.

I notice the datetime warning in the code (it moved rather than changed) - is there any reason this isn't automatically stored as int64? Seems like xarrays may well want this.

FWIW there is a way to get a view of a zarr int64 array as datetime64, see http://zarr.readthedocs.io/en/master/tutorial.html#datetimes-and-timedeltas I'm open to other ways of handling this if it's an important requirement and this doesn't cut the mustard.

…

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

martindurant · 2017-12-06T18:32:54Z

Oh no, the bytes and utf8 encodings are good, I meant for everything else, because although there are very good encodings that could be used, I doubt in practice that there's much data out there to make use of them.

I would imagine the to/from datetimes could be automated and would be one less thing for users to worry about. xarrays do store times, I imagine, especially for the coordinates.

shoyer · 2017-12-06T18:57:21Z

This looks fantastic, thank you!

I don't have a strong need for encoding datetime64 in zarr. We already have logic on the xarray side for encoding/decoding datetime64. This would potentially make sense to port into numcodecs but it already works OK.

jakirkham · 2017-12-06T19:16:05Z

I think it would be a good idea to get that datetime logic into numcodecs and supported in zarr. That said, I gathered from past discussions that this was tricky to do correctly. So would think it should be saved for subsequent PRs.

shoyer · 2017-12-06T19:24:50Z

The tricky thing about datetime decoding is that in xarray we need to be highly flexible at reading datetime metadata according to conventions used for existing netCDF files. If we were creating a codec for datetime64 from scratch, we could make something much simpler and more robust.

jakirkham · 2017-12-06T19:37:20Z

Sure. Makes sense. Maybe we should start a new issue to figure out what a datetime codec should look like over at Numcodecs?

alimanfoo · 2017-12-07T01:46:52Z

Re support for datetime64 and timedelta64, I previously decided to raise an exception for these because there was a technical issue regarding buffer access. However, since then I added a workaround for this in numcodecs. So it is actually now straightforward to add native support for these dtypes to zarr, without needing any special encoding. I know I'm overloading this PR a bit but it's a relatively small change and so I've gone ahead and done it. E.g., the following now works:

In [1]: import zarr

In [2]: z = zarr.array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='M8[D]')

In [3]: z[:]
Out[3]: array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='datetime64[D]')

In [4]: z
Out[4]: <zarr.core.Array (3,) datetime64[D]>

In [5]: z[0]
Out[5]: numpy.datetime64('2017-09-12')

In [6]: z[0] = '1999-12-31'

In [7]: z[:]
Out[7]: array(['1999-12-31', '2015-12-22', '2017-01-02'], dtype='datetime64[D]')

I've also edited the storage spec to state that the units MUST be specified. Zarr raises a value error if you try to create an array with generic units:

In [9]: z = zarr.zeros(10, dtype='M8')
---------------------------------------------------------------------------
...
ValueError: datetime64 and timedelta64 dtypes with generic units are not supported, please specify units (e.g., "M8[ns]")

I guess this won't completely solve things for xarray where there may still be encoding issues to deal with, but still maybe useful functionality to have.

alimanfoo · 2017-12-07T12:41:58Z

Found some fill value issues which required a fix in numcodecs. Have added some tests and bumped the numcodecs requirement, tests are passing locally, will wait for CI. There is still an issue with fill values for object arrays in that currently only JSON-encodable fill values will work (#216), but it's hard to see a complete fix for that without modifying the storage spec and/or breaking compatibility with previous data, so I think it's better to just live with that for the moment. If CI looks OK and no objections I'll merge this tomorrow.

alimanfoo added 3 commits December 6, 2017 09:15

bump numcodecs

116d168

add vlen tests

e584f68

add object dtype convenience API

ee4a515

alimanfoo added the in progress Someone is currently working on this label Dec 6, 2017

alimanfoo added this to the v2.2 milestone Dec 6, 2017

alimanfoo added 8 commits December 6, 2017 23:09

fix categorize warnings

3865229

fix categorize warnings

7a680e2

comment

31b67e8

support datetime64 and timedelta64

e59822a

require date/time units

bf376e8

numcodecs version bump

4c5e164

modify tutorial for datetime support

31d3c6d

finish up datetime support

e385d27

alimanfoo added 4 commits December 7, 2017 09:27

edit release notes

6748354

add refs

2bb9f79

edit tutorial on strings and objects; bump numcodecs

f380374

extra object tests

f272f88

alimanfoo changed the title ~~Object dtype convenience API~~ Object dtype convenience API; datetime64/timedelta64 support Dec 7, 2017

alimanfoo added the enhancement New features or improvements label Dec 7, 2017

alimanfoo merged commit 89f4107 into master Dec 8, 2017

alimanfoo deleted the object-convenience-20171206 branch December 8, 2017 08:43

alimanfoo removed the in progress Someone is currently working on this label Dec 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Object dtype convenience API; datetime64/timedelta64 support #215

Object dtype convenience API; datetime64/timedelta64 support #215

alimanfoo commented Dec 6, 2017 •

edited

Loading

alimanfoo commented Dec 6, 2017 •

edited

Loading

martindurant commented Dec 6, 2017

alimanfoo commented Dec 6, 2017 via email

martindurant commented Dec 6, 2017

shoyer commented Dec 6, 2017

jakirkham commented Dec 6, 2017 •

edited

Loading

shoyer commented Dec 6, 2017

jakirkham commented Dec 6, 2017

alimanfoo commented Dec 7, 2017

alimanfoo commented Dec 7, 2017 •

edited

Loading

Object dtype convenience API; datetime64/timedelta64 support #215

Object dtype convenience API; datetime64/timedelta64 support #215

Conversation

alimanfoo commented Dec 6, 2017 • edited Loading

dtype=str

dtype=bytes

dtype='array:T'

Extensibility/configuration

TODO

alimanfoo commented Dec 6, 2017 • edited Loading

martindurant commented Dec 6, 2017

alimanfoo commented Dec 6, 2017 via email

martindurant commented Dec 6, 2017

shoyer commented Dec 6, 2017

jakirkham commented Dec 6, 2017 • edited Loading

shoyer commented Dec 6, 2017

jakirkham commented Dec 6, 2017

alimanfoo commented Dec 7, 2017

alimanfoo commented Dec 7, 2017 • edited Loading

alimanfoo commented Dec 6, 2017 •

edited

Loading

`dtype=str`

`dtype=bytes`

`dtype='array:T'`

alimanfoo commented Dec 6, 2017 •

edited

Loading

jakirkham commented Dec 6, 2017 •

edited

Loading

alimanfoo commented Dec 7, 2017 •

edited

Loading