Skip to content

API for direct block access #543

@clbarnes

Description

@clbarnes

There are internal methods for retrieving individual blocks, but there are some circumstances where addressing data one block at a time is helpful for end users, and would avoid the user having to do their own pipeline of chunk size -> block indices -> slicing only for zarr to then go slicing -> block indices etc. again.

I envision something like

@dataclass
class ChunkWrapper:
    chunk_idx: Tuple[int, ...]
    chunk_slice: Tuple[slice, ...]  # or an offset-shape pair, or a start-stop pair
    data: np.ndarray

class Array:
    ...
    def get_chunk(self, chunk_idx: Tuple[int, ...]) -> ChunkWrapper:
        ...

    def set_chunk(self, chunk_idx: Tuple[int, ...], data: np.ndarray) -> None:
        # check data is the right shape, handling edge blocks
        ...

    def iter_chunk_idxs(self) -> Iterator[ChunkWrapper]:
        ...

Then e.g. a blockwise operation could be trivially implemented with

for idx in my_array.iter_chunk_idxs():
    chunk = my_array.get_chunk(idx)
    my_array.set_chunk(idx, chunk.data  * 2)

Obviously in this particular case, you could use dask, but the principle is useful elsewhere. My use case is that I have an array of labels which I want to relate to point annotations: I want to get a chunk, see which point annotations exist inside it, and find the relationships, preferably without chunk-mangling boilerplate 😁

This allows tools implementing their own parallelism (dask being one example, but there are many others imaginable) much easier access to the blocked nature of the underlying arrays.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions