Description
Basically the title, but what I'm referring to is this:
In [2]: da = xr.DataArray([[0, 1], [2, 3]], name='foo').chunk(1)
In [3]: ds = da.to_dataset()
In [4]: da.chunks
Out[4]: ((1, 1), (1, 1))
In [5]: ds.chunks
Out[5]: Frozen({'dim_0': (1, 1), 'dim_1': (1, 1)})
Why does DataArray.chunks
return a tuple and Dataset.chunks
return a frozen dictionary?
This seems a bit silly, for a few reasons:
-
it means that some perfectly reasonable code might fail unnecessarily if passed a DataArray instead of a Dataset or vice versa, such as
def is_core_dim_chunked(obj, core_dim): return len(obj.chunks[core_dim]) > 1
which will work as intended for a dataset but raises a
TypeError
for a dataarray. -
it breaks the pattern we use for
.sizes
, whereIn [14]: da.sizes Out[14]: Frozen({'dim_0': 2, 'dim_1': 2}) In [15]: ds.sizes Out[15]: Frozen({'dim_0': 2, 'dim_1': 2})
-
if you want the chunks as a tuple they are always accessible via
da.data.chunks
, which is a more sensible place to look to find the chunks without dimension names. -
It's an undocumented difference, as the docstrings for
ds.chunks
andda.chunks
both only say"""Block dimensions for this dataset’s data or None if it’s not a dask array."""
which doesn't tell me anything about the return type, or warn me that the return types are different.
EDIT: In fact
DataArray.chunk
doesn't even appear to be listed on the API docs page at all.
In our codebase this difference is mostly washed out by us using ._to_temp_dataset()
all the time, and also by the way that the .chunk()
method accepts both the tuple and dict form, so both of these invariants hold (but in different ways):
ds == ds.chunk(ds.chunks)
da == da.chunk(da.chunks)
I'm not sure whether making this consistent is worth the effort of a significant breaking change though 😕
(Sort of related to #2103)