Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing Dask Arrays directly to Zarr Group fails #251

Closed
jakirkham opened this issue Mar 28, 2018 · 0 comments · Fixed by #2393
Closed

Storing Dask Arrays directly to Zarr Group fails #251

jakirkham opened this issue Mar 28, 2018 · 0 comments · Fixed by #2393

Comments

@jakirkham
Copy link
Member

The gist of the issue is that when trying to store an array-like object to a Zarr Group, it looks for the chunks attribute to advise on how to chunk. When it finds chunks on Dask Arrays, the chunk sizes are not global and uniform necessarily, but specific sizes are given for each chunk, which may not be uniform. Zarr understandably stumbles over this as the format is not what it expects.

Even if Zarr could handle the Dask chunking format somehow, there is a question of what to do with non-uniform chunk sizes. There are two main options to consider: support non-uniform chunking in Zarr ( https://github.com/zarr-developers/zarr/issues/245 ) and/or rechunk Dask Arrays to be uniform ( dask/dask#3302 ). So some things to think about on both fronts. This should help provide both of those issues more context.

cc @mrocklin


Minimal, reproducible code sample, a copy-pastable example if possible

In [1]: import zarr

In [2]: import dask.array as da

In [3]: z = zarr.open_group("test.zarr")

In [4]: a = da.random.random((100, 110), chunks=10)

In [5]: z["a"] = a
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-142459babd67> in <module>()
----> 1 z["a"] = a

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in __setitem__(self, item, value)
    335 
    336     def __setitem__(self, item, value):
--> 337         self.array(item, value, overwrite=True)
    338 
    339     def __delitem__(self, item):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in array(self, name, data, **kwargs)
    908         """Create an array. Keyword arguments as per
    909         :func:`zarr.creation.array`."""
--> 910         return self._write_op(self._array_nosync, name, data, **kwargs)
    911 
    912     def _array_nosync(self, name, data, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _write_op(self, f, *args, **kwargs)
    628 
    629         with lock:
--> 630             return f(*args, **kwargs)
    631 
    632     def create_group(self, name, overwrite=False):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/hierarchy.py in _array_nosync(self, name, data, **kwargs)
    915         kwargs.setdefault('cache_attrs', self.attrs.cache)
    916         return array(data, store=self._store, path=path, chunk_store=self._chunk_store,
--> 917                      **kwargs)
    918 
    919     def empty_like(self, name, data, **kwargs):

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/creation.py in array(data, **kwargs)
    336 
    337     # instantiate array
--> 338     z = create(**kwargs)
    339 
    340     # fill with data

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/creation.py in create(shape, chunks, dtype, compressor, fill_value, order, store, synchronizer, overwrite, path, chunk_store, filters, cache_metadata, cache_attrs, read_only, object_codec, **kwargs)
    117     init_array(store, shape=shape, chunks=chunks, dtype=dtype, compressor=compressor,
    118                fill_value=fill_value, order=order, overwrite=overwrite, path=path,
--> 119                chunk_store=chunk_store, filters=filters, object_codec=object_codec)
    120 
    121     # instantiate array

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in init_array(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    311                          order=order, overwrite=overwrite, path=path,
    312                          chunk_store=chunk_store, filters=filters,
--> 313                          object_codec=object_codec)
    314 
    315 

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/storage.py in _init_array_metadata(store, shape, chunks, dtype, compressor, fill_value, order, overwrite, path, chunk_store, filters, object_codec)
    332     shape = normalize_shape(shape)
    333     dtype, object_codec = normalize_dtype(dtype, object_codec)
--> 334     chunks = normalize_chunks(chunks, shape, dtype.itemsize)
    335     order = normalize_order(order)
    336     fill_value = normalize_fill_value(fill_value, dtype)

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/util.py in normalize_chunks(chunks, shape, typesize)
    122     # handle None in chunks
    123     chunks = tuple(s if c is None else int(c)
--> 124                    for s, c in zip(shape, chunks))
    125 
    126     return chunks

/zopt/conda2/envs/test/lib/python3.6/site-packages/zarr/util.py in <genexpr>(.0)
    122     # handle None in chunks
    123     chunks = tuple(s if c is None else int(c)
--> 124                    for s, c in zip(shape, chunks))
    125 
    126     return chunks

TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

Problem description

Ideally would like to be able to store Dask Arrays to Zarr with little more than __setitem__. In practice this doesn't work. That said, this borders more on a feature request than a bug report. Given that Dask Arrays are really not NumPy arrays, we may need a from_dask_array method

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.2.0
  • Value of numcodecs.__version__: 0.5.4
  • Version of Python interpreter: 3.6.4
  • Operating system (Linux/Windows/Mac): Mac
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): conda

Also, if you think it might be relevant, please provide the output from pip freeze or
conda env export depending on which was used to install Zarr.

name: test
channels:
- conda-forge
- defaults
dependencies:
- appnope=0.1.0=py36_0
- asciitree=0.3.3=py36_1
- blas=1.1=openblas
- bokeh=0.12.14=py36_1
- ca-certificates=2018.1.18=0
- certifi=2018.1.18=py36_0
- click=6.7=py_1
- cloudpickle=0.5.2=py_0
- cytoolz=0.9.0.1=py36_0
- dask=0.17.2=py_0
- dask-core=0.17.2=py_0
- decorator=4.2.1=py36_0
- distributed=1.21.4=py36_0
- fasteners=0.14.1=py36_2
- heapdict=1.0.0=py36_0
- ipython=6.2.1=py36_1
- ipython_genutils=0.2.0=py36_0
- jedi=0.11.1=py36_0
- jinja2=2.10=py36_0
- libgfortran=3.0.0=0
- locket=0.2.0=py36_1
- markupsafe=1.0=py36_0
- monotonic=1.4=py36_0
- msgpack-python=0.5.5=py36_0
- ncurses=5.9=10
- numcodecs=0.5.4=py36_0
- numpy=1.14.2=py36_blas_openblas_200
- openblas=0.2.20=7
- openssl=1.0.2n=0
- packaging=17.1=py_0
- pandas=0.22.0=py36_0
- parso=0.1.1=py_0
- partd=0.3.8=py36_0
- pexpect=4.4.0=py36_0
- pickleshare=0.7.4=py36_0
- prompt_toolkit=1.0.15=py36_0
- psutil=5.4.3=py36_0
- ptyprocess=0.5.2=py36_0
- pygments=2.2.0=py36_0
- pyparsing=2.2.0=py36_0
- python=3.6.4=0
- python-dateutil=2.7.1=py_0
- pytz=2018.3=py_0
- pyyaml=3.12=py36_1
- readline=7.0=0
- setuptools=39.0.1=py36_0
- simplegeneric=0.8.1=py36_0
- six=1.11.0=py36_1
- sortedcontainers=1.5.9=py36_0
- sqlite=3.20.1=2
- tblib=1.3.2=py36_0
- tk=8.6.7=0
- toolz=0.9.0=py_0
- tornado=5.0.1=py36_1
- traitlets=4.3.2=py36_0
- wcwidth=0.1.7=py36_0
- xz=5.2.3=0
- yaml=0.1.7=0
- zarr=2.2.0=py_1
- zict=0.1.3=py_0
- zlib=1.2.11=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant