Ensure `DictStore` contains only `bytes` #350

jakirkham · 2018-12-03T15:57:24Z

Partially addresses issue ( #348 ) and issue ( #349 ).

As the spec notes stores values must be an "arbitrary sequence of bytes", this change ensures that values in DictStore meet that constraint. Of course nesting of DictStores are still allowed per usual. However these really just map to a variety of keys, which is fine.

Since the DictStore's values are just bytes, there shouldn't be any cases where the size of these values cannot be determined. So drop handling for unknown sizes in buffer_size. Also drop the associated test for DictStore as this cannot occur.

Add a test case for getsize with a non-conforming dict-based store where sizes are unknown to make sure that case is tested and handled appropriately.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

jakirkham · 2018-12-03T16:17:09Z

Just for clarity, would be perfectly happy with any option that enforces that bytes-like data is in DictStore. Simply view this as a step in that direction. We can of course change to using memoryviews or anything else internally that makes sense and we are comfortable with.

alimanfoo · 2018-12-03T17:09:28Z

As the spec notes stores values must be an "arbitrary sequence of bytes", this change ensures that values in DictStore meet that constraint.

Just to note, the spec really talks at an abstract level, so "arbitrary sequence of bytes" should be interpreted as an abstract concept. In a given programming language there may be many different types of object that encapsulate an arbitrary sequence of bytes, and the spec does not constrain which should be used. E.g., Python has the buffer protocol, which is a standard way of objects declaring that they encapsulate a sequence of bytes stored in memory. So in a Python implementation, it would be appropriate to require that stores can accept any value that exports the buffer protocol, but that does not have to be a bytes object.

That said, there are separate reasons for thinking that the DictStore should internally convert values to bytes objects, e.g., to ensure they are immutable. Sorry for being pedantic, but just wanted to clarify it is not the spec that requires that.

alimanfoo · 2018-12-03T17:15:11Z

Thanks @jakirkham. FWIW I'm happy to go ahead with this, but if we do I think we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes. Otherwise there will be a performance hit for in-memory zarr usage due to the extra memory copy.

jakirkham · 2018-12-03T17:15:34Z

No worries. Completely agree.

By "meet that constraint" wasn't trying to suggest that the spec meant bytes specifically. Was more meaning that by using ensure_bytes we would reject anything that was not spec conforming from being stored. Tried to clarify this with the follow-up comment above. Sorry if that was still unclear.

Hope this makes more sense. Please let me know if I'm still missing something.

jakirkham · 2018-12-03T17:16:28Z

...but if we do I think we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes. Otherwise there will be a performance hit for in-memory zarr usage due to the extra memory copy.

Completely agree for the same reasons. Am working on that currently.

alimanfoo · 2018-12-03T17:18:39Z

By "meet that constraint" wasn't trying to suggest that the spec meant bytes specifically. Was more meaning that by using ensure_bytes we would reject anything that was not spec conforming from being stored. Tried to clarify this with the follow-up comment above. Sorry if that was still unclear.

Ah, perfect, I get it now, thanks!

jakirkham · 2018-12-03T17:25:22Z

...we should back out the changes in numcodecs that modified encode() methods to return ndarray when previously they returned bytes.

Am working on that currently.

PR ( zarr-developers/numcodecs#155 ) should handle this.

As the spec requires that the data in a store be a sequence of `bytes`, make sure that non-`DictStore` input meets this requirement when setting values. This effectively ensures that other `DictStore` meet this requirement as well. So we don't need to go through and check their values too.

As everything in `DictStore` must either be another `DictStore` or `bytes`, there shouldn't be any cases where the size is undefined nor cases that this exception should need handling. Given this go ahead and drop the special casing for unknown sizes in `DictStore`.

While this test case does test a useful subset of the `getsize` API, the contents being added to the store here are non-conforming to our expectations of store contents. Namely the store should only contain values that are an "arbitrary sequence of bytes", which this test case is not.

This creates a non-conforming store to make sure that `getsize` handles its contents in the expected way. Namely that it returns `-1`.

jakirkham · 2018-12-06T04:46:36Z

Please let me know if there is anything else needed for this.

alimanfoo

Thanks @jakirkham, I'm happy for this to go in as-is. Couple of thoughts but up to your discretion whether to address in this PR:

Do you want to handle the renaming of DictStore to MemoryStore here? Or better in a separate PR?
Maybe worth a test to verify that trying to set a non-buffer-like value in a DictStore raises a TypeError?

Add a test to ensure that a non-buffer supporting object when stored into a valid store, will raise a `TypeError` instead of storing it. Disable this checking for generic `MappingStore`s (e.g. `dict`) as they do not perform this sort of checking on the data they accept as values.

Provide a simple test for `DictStore` to ensure that non-`bytes` is coerced to `bytes` before storing it and is retrieved as `bytes`.

jakirkham · 2018-12-06T17:32:51Z

IMHO opinion renaming the store is a separate topic. Have raised issue ( #356 ) to track it. Though I agree it is a good idea.

Adding some tests seems reasonable for this PR. Have included handling of the non-buffer case (disabled for dict). Also have added a test to ensure that DictStore coerces data stored to bytes.

zarr/storage.py

alimanfoo · 2018-12-07T02:18:34Z

IMHO opinion renaming the store is a separate topic. Have raised issue ( #356 ) to track it.

Great, thanks.

Adding some tests seems reasonable for this PR. Have included handling of the non-buffer case (disabled for dict). Also have added a test to ensure that DictStore coerces data stored to bytes.

Looks good.

Just had one further comment re implementation of DictStore.__setitem__().

Make sure that users are only able to add data to the `DictStore`. Disallow the storing of a nested `DictStore` though.

alimanfoo · 2018-12-07T09:21:15Z

Thanks @jakirkham.

This was referenced Dec 3, 2018

Comparison of non-trivial, uncompressed, in-memory Zarr Arrays fails #348

Closed

WIP: Make DictStore the default Array store #351

Closed

Requirements of store data #349

Closed

jakirkham mentioned this pull request Dec 4, 2018

Bump Numcodecs requirement to 0.6.1 #347

Closed

7 tasks

jakirkham added 4 commits December 4, 2018 21:51

Test getsize with an unknown size case

425e7c4

This creates a non-conforming store to make sure that `getsize` handles its contents in the expected way. Namely that it returns `-1`.

jakirkham requested a review from alimanfoo December 5, 2018 03:10

Note DictStore only contains bytes now [ci skip]

726bdc0

alimanfoo approved these changes Dec 6, 2018

View reviewed changes

jakirkham mentioned this pull request Dec 6, 2018

Renaming DictStore #356

Closed

jakirkham added 2 commits December 6, 2018 12:16

Check that DictStore coerces all data to bytes

3976f96

Provide a simple test for `DictStore` to ensure that non-`bytes` is coerced to `bytes` before storing it and is retrieved as `bytes`.

alimanfoo reviewed Dec 7, 2018

View reviewed changes

zarr/storage.py Outdated Show resolved Hide resolved

Disallow mutation of the internal DictStore

845de7b

Make sure that users are only able to add data to the `DictStore`. Disallow the storing of a nested `DictStore` though.

alimanfoo approved these changes Dec 7, 2018

View reviewed changes

alimanfoo merged commit 8ebb16c into zarr-developers:master Dec 7, 2018

jakirkham deleted the ensure_DictStore_contains_only_bytes branch December 7, 2018 14:37

jakirkham added this to the v2.3 milestone Feb 18, 2019

jakirkham mentioned this pull request Oct 22, 2020

Checking that the store is an instance of dict seem incorrect. #636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure `DictStore` contains only `bytes` #350

Ensure `DictStore` contains only `bytes` #350

jakirkham commented Dec 3, 2018 •

edited

Loading

jakirkham commented Dec 3, 2018

alimanfoo commented Dec 3, 2018

alimanfoo commented Dec 3, 2018

jakirkham commented Dec 3, 2018

jakirkham commented Dec 3, 2018

alimanfoo commented Dec 3, 2018 via email

jakirkham commented Dec 3, 2018

jakirkham commented Dec 6, 2018 •

edited

Loading

alimanfoo left a comment

jakirkham commented Dec 6, 2018

alimanfoo commented Dec 7, 2018

alimanfoo commented Dec 7, 2018

Ensure DictStore contains only bytes #350

Ensure DictStore contains only bytes #350

Conversation

jakirkham commented Dec 3, 2018 • edited Loading

jakirkham commented Dec 3, 2018

alimanfoo commented Dec 3, 2018

alimanfoo commented Dec 3, 2018

jakirkham commented Dec 3, 2018

jakirkham commented Dec 3, 2018

alimanfoo commented Dec 3, 2018 via email

jakirkham commented Dec 3, 2018

jakirkham commented Dec 6, 2018 • edited Loading

alimanfoo left a comment

Choose a reason for hiding this comment

jakirkham commented Dec 6, 2018

alimanfoo commented Dec 7, 2018

alimanfoo commented Dec 7, 2018

Ensure `DictStore` contains only `bytes` #350

Ensure `DictStore` contains only `bytes` #350

jakirkham commented Dec 3, 2018 •

edited

Loading

jakirkham commented Dec 6, 2018 •

edited

Loading