-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concatenate arrays with varchunks #374
base: main
Are you sure you want to change the base?
Conversation
I would really like to see some success examples, even if based on POC on POCs, to help justify the whole idea! One thing I have been meaning to check: I believe that passing a zarr array with complex chunks to dask will do the right thing, since it just reads the |
I've got virtual concatenation of sparse arrays working. Dataframes should be easier. Unfortunately I can't use this with existing stores (data has to be rewritten) since the chunk boundaries need to be exact. Dask does not seem to want to work immediately based on: traceback---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[21], line 1
----> 1 da.from_zarr(result_group["data"], chunks=result_group["data"].chunks)
File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3600, in from_zarr(url, component, storage_options, chunks, name, inline_array, **kwargs)
3598 if name is None:
3599 name = "from-zarr-" + tokenize(z, component, storage_options, chunks, **kwargs)
-> 3600 return from_array(z, chunks, name=name, inline_array=inline_array)
File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3483, in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta, inline_array)
3479 asarray = not hasattr(x, "__array_function__")
3481 previous_chunks = getattr(x, "chunks", None)
-> 3483 chunks = normalize_chunks(
3484 chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
3485 )
3487 if name in (None, True):
3488 token = tokenize(x, chunks, lock, asarray, fancy, getitem, inline_array)
File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3098, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
3095 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3097 if shape is not None:
-> 3098 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
3100 if chunks and shape is not None:
3101 chunks = sum(
3102 (
3103 blockdims_from_blockshape((s,), (c,))
(...)
3108 (),
3109 )
File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3098, in <genexpr>(.0)
3095 chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
3097 if shape is not None:
-> 3098 chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
3100 if chunks and shape is not None:
3101 chunks = sum(
3102 (
3103 blockdims_from_blockshape((s,), (c,))
(...)
3108 (),
3109 )
TypeError: unhashable type: 'list' But we can do something like: da.from_array(
zarr_array,
chunks=tuple(tuple(c) if isinstance(c, list) else c for c in zarr_array.chunks))
) |
I was just looking at (cc @rsignell-usgs ) |
I can run the example from the gist. Now trying to understand how I can adapt this to my use-case. @ivirshup What does |
@NikosAlexandris, that's the |
Updated to allow inference of variable chunked output from input with fixed chunking. E.g. can now concatenate arrays like: [zarr.ones(4, chunks=(2,)), zarr.ones(3, chunks=(3,))]
# result has chunking ([2, 2, 3],) Should there be a switch so users can turn this off? It would probably be better to error on this input if you know downstream consumers won't be able to handle the output. |
It would fix my problem if this works!! I tried to apply your approach but I think I missed something and I can not apply it to my workflow...@martindurant any thought? |
Allow kerchunk based concatenation of zarr arrays with variable length chunks. This is mostly to allow me to play around with some downstream usecases.
This is basically feature complete at the moment. Remaining tasks are largely polish (error handling, further testing) or upstream (ZEP 3 approval)
concatenate_arrays
with (slightly) different array shapes #305 (mostly)Works off of zarr-developers/zarr-python#1483
TODO
Upstream tasks
This PR