-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Requirements of store data #349
Comments
Thanks @jakirkham, very good idea to raise this. FWIW I think as a minimum, a store:
...where a "buffer-like object" is an object exporting (PY2 and PY3) the new-style buffer interface or (PY2 only) the old-style buffer interface. Optionally, a store:
I know this doesn't directly answer your question about tests involving object arrays, but maybe gives a bit more context to that discussion. At least this helps to clarify for me that we shouldn't really be using |
Thanks for clearly outlining this clearly. Based on our discussion here and in ( #348 ) am leaning towards hardening the requirement that Something to think about in the larger context is how we validate a store. Should we have a function that is able to run through a store and make sure it is spec conforming? |
Based on our discussion here and in ( #348
<#348> ) am leaning towards
hardening the requirement that DictStore must hold bytes ( #350
<#350> ) (or at least bytes-like
data) and that Array should use DictStore for in-memory storage to
enforce this requirement ( #351
<#351> ).
+1
Something to think about in the larger context is how we validate a store.
Should we have a function that is able to run through a store and make sure
it is spec conforming?
Interesting question. We have a class zarr.tests.test_storage.StoreTests
which can be sub-classed to create a set of unit tests for a store class.
Is that enough, or do we want something that could be run more dynamically?
|
That's an interesting idea. Was thinking about it in the context of validating stores for use with |
FWIW I'd be happy if we provided developer support so store class developers can thoroughly test a store class implementation, but then at runtime trust that users provide something sensible as a store. We're already pretty defensive, e.g., we normalise all storage paths above the storage layer. We could also check the result of the chunk encoding pipeline is an object supporting the buffer protocol, before passing on to storage. So i.e., guarantee that we'll provide valid keys and values to the storage layer. But after that I think we can just trust stores to do something reasonable with keys and values. |
I just reviewed this thread. I'm thinking ahead towards questions of language inter-operability, and I'm concerned that our definition of a store is too python-centric. While it should always be possible to implement a custom store by following the requirements above, perhaps we should also define a spec for a store that does not depend on python concepts such as mutable-mapping, pickleable, etc. This would make it easier to implement zarr in other languages. |
Hi Ryan, FWIW the format spec does try to remain language-agnostic and
talks in an abstract way about key/value stores. I think the discussion in
this particular issue has been scoped more to the specifics of how store
classes are implemented in Python and what is expected of them there.
Focusing just on the format spec for the moment, do you think that needs to
be made more language-agnostic, or otherwise needs any improvement or
clarifications?
…On Wed, 9 Jan 2019 at 10:34, Ryan Abernathey ***@***.***> wrote:
I just reviewed this thread. I'm thinking ahead towards questions of
language inter-operability, and I'm concerned that our definition of a
store is too python-centric.
While it should always be possible to implement a custom store by
following the requirements above, perhaps we should also define a spec for
a store that does not depend on python concepts such as mutable-mapping,
pickleable, etc. This would make it easier to implement zarr in other
languages.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#349 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QgLvmxwKSrUsOrL34K5gRLZFKbEfks5vBcWtgaJpZM4Y9kXZ>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Thanks for the clarification. I see how this thread is specific to python implementations. I guess I worry that the spec is too vague with regards to the implementation of the key value store, and the methods that can be used to query it:
In terms of operations, "Read", "write", and "delete" doesn't seem like enumeration of operations a store must support. When implementing a store, you also need at least some form of "list" operation; otherwise zarr can't discover what is in the store. (The exception is consolidated metadata stores.) In fact, you have to implement a More generally, how do we ensure that |
Thanks Ryan, good points. We certainly could be more explicit about the set
of operations that a storage system must support, and make sure we include
everything (e.g., listing all keys). We could also state the optional
operations, which are not strictly necessary but allow for some
optimisations or additional features, like being able to list all the keys
that are children of some hierarchy path (the listdir() method in Python
implementations).
We could do this in a language-independent way but still make it clear and
concrete how this corresponds to specific operations supported by a file
system or a cloud object service or whatever.
I think we could also do this as an update to the format spec, without
requiring a new spec version, as these would be clarifications of the
existing spec.
…On Thu, 10 Jan 2019, 10:55 Ryan Abernathey ***@***.*** wrote:
Thanks for the clarification. I see how this thread is specific to python
implementations.
I guess I worry that the spec is too vague with regards to the
implementation of the key value store, and the methods that can be used to
query it:
A Zarr array can be stored in any storage system that provides a key/value
interface, where a key is an ASCII string and a value is an arbitrary
sequence of bytes, and the supported operations are read (get the sequence
of bytes associated with a given key), write (set the sequence of bytes
associated with a given key) and delete (remove a key/value pair).
In terms of operations, "Read", "write", and "delete" doesn't seem like
enumeration of operations a store must support. When implementing a store,
you also need at least some form of "list" operation; otherwise zarr can't
discover what is in the store. (The exception is consolidated metadata
stores.) In fact, you have to implement a MutableMapping
<https://docs.python.org/3/library/collections.abc.html>, which has five
methods: __getitem__, __setitem__, __delitem__, __iter__, and __len__.)
More generally, how do we ensure that DirectoryStore, ZipStore, or any of
the myriad cloud stores that have been developed can truly be read from
different implementations of zarr? I wonder if it would be worth explicitly
defining a spec for certain commonly used stores that gives more detail
about the implementation choices that have already been made in the zarr
python code.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#349 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QldD8wCmVJiYP8yEB07T3b7V7C8Zks5vBxw5gaJpZM4Y9kXZ>
.
|
In PR ( #789 ) we added a Subsequent discussion around the v3 spec and storing standardized data from libraries handles other concerns raised here Were there any other things still needing to be addressed here? |
Closing now that the |
Raising this issue to get an idea of what our requirements are of stores and what can be placed in them.
For instance in many cases we require
Array
s to have anobject_codec
to allow storingobject
types and many stores would have difficulty with this data without explicit conversion to some sort ofbytes
-like object; however, we appear to be placingobject
s in a store as a test. Also we seem to expect stores to be easily comparable; however, this doesn't work if the store has NumPyndarray
s in it. ( #348 )Should we set some explicit requirements about what stores require? If so, what would those requirements be? Also how would we enforce them?
The text was updated successfully, but these errors were encountered: