-
-
Notifications
You must be signed in to change notification settings - Fork 369
Description
I may be missing something, but zarr currently seems to offers a high level interface to the chunked data that appears to be largely transparent to the fact that the underlying data is in fact chunked (much like HDF5). I've got a few use cases where I'd like to be able to directly operate on the underlying chunks rather than the whole large dataset.
My target application is multi-dimensional image processing in large 3D biomedical datasets which there are many situations where it would make sense to perform operations on individual chunks in parallel.
Whilst it would be possible to a) read the chunk size of an array and b) work out how to slice the dataset in multiples of that chunk size, a direct chunk level access might be easier and more efficient. In essence I'm suggesting something which exposes a simplified version of Array._chunk_getitem and Array._chunk_setitem which would only ever get or set a complete chunk.
If you extended this concept somewhat by having the functions return a lightweightChunk object which was essentially a reference to the location of the array, a chunk ID, and a .data property which gave you the data for that chunk and additionally exposed an iterator in the Array class , you could conceivably write code like:
def do_something(chunk):
res = some_processing_function(chunk.data)
with zarr.open('output_file_uri') as z1:
z1.save_chunk(res, chunk.chunk_id)
with multiprocessing.Pool() as pool:
pool.map(do_something, array.chunk_iterator)In the longer term (and I'm not sure how to go about this - I might be better aiming for API compatibility with zarr rather than inclusion in zarr) I'd want to enable data-local processing of individual chunks of an array which was chunked across a distributed file system. I've currently got a python solution for this for 2.5 dimensional data (xyt) but it's pretty specific to one use case and I would like to avoid duplicating other efforts as we make it more general.