-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optional extensions to the storage API for batch operations and transactions #384
Comments
For getting multiple items, there is |
I don't think that works for the batch read use case unfortunately, e.g.,
after g = itemgetter(2, 5, 3), the call g(store) returns (store[2],
store[5], store[3]). So the store's __getitem__ method is still being
called multiple times, one for each key.
We'd need something more like a "getitems(keys)" method, where you pass in
a sequence of keys and get back a sequence of key/value pairs.
…On Sun, 6 Jan 2019 at 22:37, jakirkham ***@***.***> wrote:
For getting multiple items, there is operator.itemgetter
<https://docs.python.org/3/library/operator.html#operator.itemgetter>. It
may or may not be helpful for our use case, but is probably worth looking
into to start.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#384 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QiEE9p7PSmcBthklgNSydyZ6pE6nks5vAnqwgaJpZM4ZyMUS>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
It may make sense for a fallback. For example, how do we handle |
I think the same way we currently do for |
Currently the zarr storage API comprises the MutableMapping interface from the Python standard library, plus some optional methods
listdir
,rmdir
andgetsize
. This issue is intended for discussion of possible further optional extensions to the storage API to support two different but related functionalities. I personally don't have immediate concrete use cases for this, but I expect these will arise at some point, so this issue is intended just to park some initial thoughts.Batch operations
The first functionality is supporting batch operations. For example, when a user is writing data to a region of a zarr array that spans more than one chunk, e.g.:
...currently the zarr core module will communicate the modification for each chunk via a separate API call to the storage layer. In concrete terms, this means multiple calls to the
__setitem__()
method on the store, one for each key/value pair comprising the new encoded data for each modified chunk.There are a number scenarios where a store implementation might be able to improve performance or provide some other useful functionality if it were made aware that multiple keys were being updated as part of the same high-level operation. For example, a cloud store or any store where network communication is involved might be able to reduce latency overheads by batching multiple key/value updates into a single request.
Similarly, when a user is reading data from a region of a zarr array that spans more than one chunk, currently the data for each chunk is retrieved from the storage layer via a separate API call to the
__getitem__()
method on the store. A store might be able to provide some optimisation if it were passed all keys to retrieve in a single API call.In order to support batch writes, the
MutableMapping
interface already provides theupdate()
method, and this would provide a possible implementation path. To make use of this, stores that could perform batch writes would provide an appropriate implementation ofupdate()
. We could then modify the zarr core module to make use ofupdate()
rather than making multiple__setitem__()
calls.In order to support batch reads, there is no appropriate method on the
MutableMapping
interface, so we'd have to define an optional method. Again, some modification to the zarr core module would be needed to take advantage of it.Transactions
The second functionality is supporting transactions. In general, a user might want to use a store that supports transactions, and have full control over the granularity (i.e., beginning and end) of transactions, to fit whatever is the logic of the application. Where a store supports transactions, it might be useful to have a standard API, so a user could do something like:
Discussion
This is just an initial sketch of some thoughts for discussion, please feel free to comment and share ideas. I'm aware the language above is also a bit vague, so please feel free to suggest clarifications or cleaner ways of thinking about these related issues.
The text was updated successfully, but these errors were encountered: