Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure DictStore contains only bytes #350

Merged
merged 8 commits into from
Dec 7, 2018
Merged

Ensure DictStore contains only bytes #350

merged 8 commits into from
Dec 7, 2018

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Dec 3, 2018

Partially addresses issue ( #348 ) and issue ( #349 ).

As the spec notes stores values must be an "arbitrary sequence of bytes", this change ensures that values in DictStore meet that constraint. Of course nesting of DictStores are still allowed per usual. However these really just map to a variety of keys, which is fine.

Since the DictStore's values are just bytes, there shouldn't be any cases where the size of these values cannot be determined. So drop handling for unknown sizes in buffer_size. Also drop the associated test for DictStore as this cannot occur.

Add a test case for getsize with a non-conforming dict-based store where sizes are unknown to make sure that case is tested and handled appropriately.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@jakirkham
Copy link
Member Author

Just for clarity, would be perfectly happy with any option that enforces that bytes-like data is in DictStore. Simply view this as a step in that direction. We can of course change to using memoryviews or anything else internally that makes sense and we are comfortable with.

@alimanfoo
Copy link
Member

As the spec notes stores values must be an "arbitrary sequence of bytes", this change ensures that values in DictStore meet that constraint.

Just to note, the spec really talks at an abstract level, so "arbitrary sequence of bytes" should be interpreted as an abstract concept. In a given programming language there may be many different types of object that encapsulate an arbitrary sequence of bytes, and the spec does not constrain which should be used. E.g., Python has the buffer protocol, which is a standard way of objects declaring that they encapsulate a sequence of bytes stored in memory. So in a Python implementation, it would be appropriate to require that stores can accept any value that exports the buffer protocol, but that does not have to be a bytes object.

That said, there are separate reasons for thinking that the DictStore should internally convert values to bytes objects, e.g., to ensure they are immutable. Sorry for being pedantic, but just wanted to clarify it is not the spec that requires that.

@alimanfoo
Copy link
Member

Thanks @jakirkham. FWIW I'm happy to go ahead with this, but if we do I think we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes. Otherwise there will be a performance hit for in-memory zarr usage due to the extra memory copy.

@jakirkham
Copy link
Member Author

No worries. Completely agree.

By "meet that constraint" wasn't trying to suggest that the spec meant bytes specifically. Was more meaning that by using ensure_bytes we would reject anything that was not spec conforming from being stored. Tried to clarify this with the follow-up comment above. Sorry if that was still unclear.

Hope this makes more sense. Please let me know if I'm still missing something.

@jakirkham
Copy link
Member Author

...but if we do I think we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes. Otherwise there will be a performance hit for in-memory zarr usage due to the extra memory copy.

Completely agree for the same reasons. Am working on that currently.

@alimanfoo
Copy link
Member

alimanfoo commented Dec 3, 2018 via email

@jakirkham
Copy link
Member Author

...we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes.

Am working on that currently.

PR ( zarr-developers/numcodecs#155 ) should handle this.

@jakirkham jakirkham mentioned this pull request Dec 4, 2018
7 tasks
As the spec requires that the data in a store be a sequence of `bytes`,
make sure that non-`DictStore` input meets this requirement when setting
values. This effectively ensures that other `DictStore` meet this
requirement as well. So we don't need to go through and check their
values too.
As everything in `DictStore` must either be another `DictStore` or
`bytes`, there shouldn't be any cases where the size is undefined nor
cases that this exception should need handling. Given this go ahead and
drop the special casing for unknown sizes in `DictStore`.
While this test case does test a useful subset of the `getsize` API, the
contents being added to the store here are non-conforming to our
expectations of store contents. Namely the store should only contain
values that are an "arbitrary sequence of bytes", which this test case
is not.
This creates a non-conforming store to make sure that `getsize` handles
its contents in the expected way. Namely that it returns `-1`.
@jakirkham jakirkham requested a review from alimanfoo December 5, 2018 03:10
@jakirkham
Copy link
Member Author

jakirkham commented Dec 6, 2018

Please let me know if there is anything else needed for this.

Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakirkham, I'm happy for this to go in as-is. Couple of thoughts but up to your discretion whether to address in this PR:

  • Do you want to handle the renaming of DictStore to MemoryStore here? Or better in a separate PR?
  • Maybe worth a test to verify that trying to set a non-buffer-like value in a DictStore raises a TypeError?

@jakirkham jakirkham mentioned this pull request Dec 6, 2018
Add a test to ensure that a non-buffer supporting object when stored
into a valid store, will raise a `TypeError` instead of storing it.
Disable this checking for generic `MappingStore`s (e.g. `dict`) as they
do not perform this sort of checking on the data they accept as values.
Provide a simple test for `DictStore` to ensure that non-`bytes` is
coerced to `bytes` before storing it and is retrieved as `bytes`.
@jakirkham
Copy link
Member Author

IMHO opinion renaming the store is a separate topic. Have raised issue ( #356 ) to track it. Though I agree it is a good idea.

Adding some tests seems reasonable for this PR. Have included handling of the non-buffer case (disabled for dict). Also have added a test to ensure that DictStore coerces data stored to bytes.

zarr/storage.py Outdated Show resolved Hide resolved
@alimanfoo
Copy link
Member

IMHO opinion renaming the store is a separate topic. Have raised issue ( #356 ) to track it.

Great, thanks.

Adding some tests seems reasonable for this PR. Have included handling of the non-buffer case (disabled for dict). Also have added a test to ensure that DictStore coerces data stored to bytes.

Looks good.

Just had one further comment re implementation of DictStore.__setitem__().

Make sure that users are only able to add data to the `DictStore`.
Disallow the storing of a nested `DictStore` though.
@alimanfoo alimanfoo merged commit 8ebb16c into zarr-developers:master Dec 7, 2018
@alimanfoo
Copy link
Member

Thanks @jakirkham.

@jakirkham jakirkham deleted the ensure_DictStore_contains_only_bytes branch December 7, 2018 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants