Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional extensions to the storage API for batch operations and transactions #384

Open
alimanfoo opened this issue Jan 6, 2019 · 4 comments
Labels
enhancement New features or improvements

Comments

@alimanfoo
Copy link
Member

Currently the zarr storage API comprises the MutableMapping interface from the Python standard library, plus some optional methods listdir, rmdir and getsize. This issue is intended for discussion of possible further optional extensions to the storage API to support two different but related functionalities. I personally don't have immediate concrete use cases for this, but I expect these will arise at some point, so this issue is intended just to park some initial thoughts.

Batch operations

The first functionality is supporting batch operations. For example, when a user is writing data to a region of a zarr array that spans more than one chunk, e.g.:

store = # any zarr-compatible store
z = zarr.array(shape=100, chunks=10, store=store)
z[:] = 42

...currently the zarr core module will communicate the modification for each chunk via a separate API call to the storage layer. In concrete terms, this means multiple calls to the __setitem__() method on the store, one for each key/value pair comprising the new encoded data for each modified chunk.

There are a number scenarios where a store implementation might be able to improve performance or provide some other useful functionality if it were made aware that multiple keys were being updated as part of the same high-level operation. For example, a cloud store or any store where network communication is involved might be able to reduce latency overheads by batching multiple key/value updates into a single request.

Similarly, when a user is reading data from a region of a zarr array that spans more than one chunk, currently the data for each chunk is retrieved from the storage layer via a separate API call to the __getitem__() method on the store. A store might be able to provide some optimisation if it were passed all keys to retrieve in a single API call.

In order to support batch writes, the MutableMapping interface already provides the update() method, and this would provide a possible implementation path. To make use of this, stores that could perform batch writes would provide an appropriate implementation of update(). We could then modify the zarr core module to make use of update() rather than making multiple __setitem__() calls.

In order to support batch reads, there is no appropriate method on the MutableMapping interface, so we'd have to define an optional method. Again, some modification to the zarr core module would be needed to take advantage of it.

Transactions

The second functionality is supporting transactions. In general, a user might want to use a store that supports transactions, and have full control over the granularity (i.e., beginning and end) of transactions, to fit whatever is the logic of the application. Where a store supports transactions, it might be useful to have a standard API, so a user could do something like:

store = # some store that supports transactions
root = zarr.group(store=store)
with store.transaction():
    a = root.create('foo', shape=100, chunks=10)
    a[:] = 42
    b = root.create('spam/eggs', shape=1000000, chunks=100000)
    b[:500] = 1
    # and anything else that should be committed together

Discussion

This is just an initial sketch of some thoughts for discussion, please feel free to comment and share ideas. I'm aware the language above is also a bit vague, so please feel free to suggest clarifications or cleaner ways of thinking about these related issues.

@jakirkham
Copy link
Member

For getting multiple items, there is operator.itemgetter. It may or may not be helpful for our use case, but is probably worth looking into to start.

@alimanfoo
Copy link
Member Author

alimanfoo commented Jan 7, 2019 via email

@jakirkham
Copy link
Member

It may make sense for a fallback. For example, how do we handle dict?

@alimanfoo
Copy link
Member Author

It may make sense for a fallback. For example, how do we handle dict?

I think the same way we currently do for listdir(), rmdir(), etc. I.e., we have code that looks for the presence of a particular method on the store, and if not found falls back to a default implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements
Projects
None yet
Development

No branches or pull requests

3 participants