-
-
Notifications
You must be signed in to change notification settings - Fork 278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Object dtype convenience API; datetime64/timedelta64 support #215
Conversation
cc @jakirkham, @shoyer, @rabernathy, @jcrist, @mrocklin, @martindurant, any comments welcome as always. |
I wonder if JSON should be used for "anything else" (not just arrays/lists), although this will be a rare and poor-performing path. I notice the datetime warning in the code (it moved rather than changed) - is there any reason this isn't automatically stored as int64? Seems like xarrays may well want this. |
I wonder if JSON should be used for "anything else" (not just
arrays/lists), although this will be a rare and poor-performing path.
Do you mean use JSON as the default codec if dtype=object? I did wonder
about that, although if an array contains objects that cannot be JSON
encoded (e.g., byte strings) then the user will get a fairly cryptic error
about not being able to encode a value as JSON, and it might not be very
obvious they need to solve their problem by choosing a different object
codec. But I'm open to discussion.
I notice the datetime warning in the code (it moved rather than changed) -
is there any reason this isn't automatically stored as int64? Seems like
xarrays may well want this.
FWIW there is a way to get a view of a zarr int64 array as datetime64, see
http://zarr.readthedocs.io/en/master/tutorial.html#datetimes-and-timedeltas
I'm open to other ways of handling this if it's an important requirement
and this doesn't cut the mustard.
…--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Oh no, the bytes and utf8 encodings are good, I meant for everything else, because although there are very good encodings that could be used, I doubt in practice that there's much data out there to make use of them. I would imagine the to/from datetimes could be automated and would be one less thing for users to worry about. xarrays do store times, I imagine, especially for the coordinates. |
This looks fantastic, thank you! I don't have a strong need for encoding datetime64 in zarr. We already have logic on the xarray side for encoding/decoding datetime64. This would potentially make sense to port into numcodecs but it already works OK. |
I think it would be a good idea to get that datetime logic into numcodecs and supported in zarr. That said, I gathered from past discussions that this was tricky to do correctly. So would think it should be saved for subsequent PRs. |
The tricky thing about datetime decoding is that in xarray we need to be highly flexible at reading datetime metadata according to conventions used for existing netCDF files. If we were creating a codec for datetime64 from scratch, we could make something much simpler and more robust. |
Sure. Makes sense. Maybe we should start a new issue to figure out what a datetime codec should look like over at Numcodecs? |
Re support for datetime64 and timedelta64, I previously decided to raise an exception for these because there was a technical issue regarding buffer access. However, since then I added a workaround for this in numcodecs. So it is actually now straightforward to add native support for these dtypes to zarr, without needing any special encoding. I know I'm overloading this PR a bit but it's a relatively small change and so I've gone ahead and done it. E.g., the following now works: In [1]: import zarr
In [2]: z = zarr.array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='M8[D]')
In [3]: z[:]
Out[3]: array(['2017-09-12', '2015-12-22', '2017-01-02'], dtype='datetime64[D]')
In [4]: z
Out[4]: <zarr.core.Array (3,) datetime64[D]>
In [5]: z[0]
Out[5]: numpy.datetime64('2017-09-12')
In [6]: z[0] = '1999-12-31'
In [7]: z[:]
Out[7]: array(['1999-12-31', '2015-12-22', '2017-01-02'], dtype='datetime64[D]') I've also edited the storage spec to state that the units MUST be specified. Zarr raises a value error if you try to create an array with generic units: In [9]: z = zarr.zeros(10, dtype='M8')
---------------------------------------------------------------------------
...
ValueError: datetime64 and timedelta64 dtypes with generic units are not supported, please specify units (e.g., "M8[ns]") I guess this won't completely solve things for xarray where there may still be encoding issues to deal with, but still maybe useful functionality to have. |
Found some fill value issues which required a fix in numcodecs. Have added some tests and bumped the numcodecs requirement, tests are passing locally, will wait for CI. There is still an issue with fill values for object arrays in that currently only JSON-encodable fill values will work (#216), but it's hard to see a complete fix for that without modifying the storage spec and/or breaking compatibility with previous data, so I think it's better to just live with that for the moment. If CI looks OK and no objections I'll merge this tomorrow. |
Numcodecs 0.5.0 adds some new codecs for variable length text (unicode) strings (VLenUTF8), variable length byte strings (VLenBytes) and variable length arrays of primitive numpy types (VLenArray). These codecs all use the Parquet format byte array encoding. The codecs have good performance and encoded data size and provide a platform-independent encoding for these common cases.
This PR leverages these new codecs to propose a convenience API for Zarr object arrays.
dtype=str
If
dtype=str
(ordtype=unicode
on PY2) is provided, this is treated as a short-hand for an array withdtype=object
andobject_codec=numcodecs.VLenUTF8()
. E.g.:dtype=bytes
If
dtype=bytes
(ordtype=str
on PY2) is provided, this is treated as a short-hand for an array withdtype=object
andobject_codec=numcodecs.VLenBytes()
. E.g.:dtype='array:T'
If
dtype='array:T'
is provided, this is treated as a short-hand for an array withdtype=object
andobject_codec=numcodecs.VLenArray('T')
. E.g.:Extensibility/configuration
This is not something the average user will need to know, but the mapping from object types to object codecs can be updated to modify the default behaviour.
E.g., to change the default codec for
str
objects:To give another example, if
dtype=object
is provided but noobject_codec
is provided, this usually raises an error:However, a default codec for the
object
class (or any class) can be set, e.g.:TODO
Resolves #206.