-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure DictStore
contains only bytes
#350
Ensure DictStore
contains only bytes
#350
Conversation
Just for clarity, would be perfectly happy with any option that enforces that |
Just to note, the spec really talks at an abstract level, so "arbitrary sequence of bytes" should be interpreted as an abstract concept. In a given programming language there may be many different types of object that encapsulate an arbitrary sequence of bytes, and the spec does not constrain which should be used. E.g., Python has the buffer protocol, which is a standard way of objects declaring that they encapsulate a sequence of bytes stored in memory. So in a Python implementation, it would be appropriate to require that stores can accept any value that exports the buffer protocol, but that does not have to be a bytes object. That said, there are separate reasons for thinking that the |
Thanks @jakirkham. FWIW I'm happy to go ahead with this, but if we do I think we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes. Otherwise there will be a performance hit for in-memory zarr usage due to the extra memory copy. |
No worries. Completely agree. By "meet that constraint" wasn't trying to suggest that the spec meant Hope this makes more sense. Please let me know if I'm still missing something. |
Completely agree for the same reasons. Am working on that currently. |
By "meet that constraint" wasn't trying to suggest that the spec meant
bytes specifically. Was more meaning that by using ensure_bytes we would
reject anything that was not spec conforming from being stored. Tried to
clarify this with the follow-up comment above. Sorry if that was still
unclear.
Ah, perfect, I get it now, thanks!
|
PR ( zarr-developers/numcodecs#155 ) should handle this. |
As the spec requires that the data in a store be a sequence of `bytes`, make sure that non-`DictStore` input meets this requirement when setting values. This effectively ensures that other `DictStore` meet this requirement as well. So we don't need to go through and check their values too.
As everything in `DictStore` must either be another `DictStore` or `bytes`, there shouldn't be any cases where the size is undefined nor cases that this exception should need handling. Given this go ahead and drop the special casing for unknown sizes in `DictStore`.
While this test case does test a useful subset of the `getsize` API, the contents being added to the store here are non-conforming to our expectations of store contents. Namely the store should only contain values that are an "arbitrary sequence of bytes", which this test case is not.
This creates a non-conforming store to make sure that `getsize` handles its contents in the expected way. Namely that it returns `-1`.
Please let me know if there is anything else needed for this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jakirkham, I'm happy for this to go in as-is. Couple of thoughts but up to your discretion whether to address in this PR:
- Do you want to handle the renaming of DictStore to MemoryStore here? Or better in a separate PR?
- Maybe worth a test to verify that trying to set a non-buffer-like value in a DictStore raises a TypeError?
Add a test to ensure that a non-buffer supporting object when stored into a valid store, will raise a `TypeError` instead of storing it. Disable this checking for generic `MappingStore`s (e.g. `dict`) as they do not perform this sort of checking on the data they accept as values.
Provide a simple test for `DictStore` to ensure that non-`bytes` is coerced to `bytes` before storing it and is retrieved as `bytes`.
IMHO opinion renaming the store is a separate topic. Have raised issue ( #356 ) to track it. Though I agree it is a good idea. Adding some tests seems reasonable for this PR. Have included handling of the non-buffer case (disabled for |
Great, thanks.
Looks good. Just had one further comment re implementation of |
Make sure that users are only able to add data to the `DictStore`. Disallow the storing of a nested `DictStore` though.
Thanks @jakirkham. |
Partially addresses issue ( #348 ) and issue ( #349 ).
As the spec notes stores values must be an "arbitrary sequence of bytes", this change ensures that values in
DictStore
meet that constraint. Of course nesting ofDictStore
s are still allowed per usual. However these really just map to a variety of keys, which is fine.Since the
DictStore
's values are justbytes
, there shouldn't be any cases where the size of these values cannot be determined. So drop handling for unknown sizes inbuffer_size
. Also drop the associated test forDictStore
as this cannot occur.Add a test case for
getsize
with a non-conformingdict
-based store where sizes are unknown to make sure that case is tested and handled appropriately.TODO:
tox -e docs
)