Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial read #667

Merged
merged 3 commits into from
Jan 25, 2021
Merged

Partial read #667

merged 3 commits into from
Jan 25, 2021

Conversation

andrewfulton9
Copy link
Contributor

@andrewfulton9 andrewfulton9 commented Dec 1, 2020

This PR adds the capability to partially read and decompress chunks in an arrays that are initialized with blosc decompression and an fsspec based backend. Its related to zarr-specs #59

I did some initial benchmarking showing processing time for indexing with various array and chunk shapes filled with random numbers stored on s3:

array_shape=(1000, 1000), chunk_shape=(100, 500)

image

array_shape=(10000, 10000), chunk_shape=(1000, 1000)

image

array_shape=(10000, 10000), chunk_shape=(2000, 5000)

image

array_shape=(10000, 10000), chunk_shape=(5000, 5000)

image

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@pep8speaks
Copy link

pep8speaks commented Dec 1, 2020

Hello @andrewfulton9! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 476:30: E741 ambiguous variable name 'l'
Line 936:39: E203 whitespace before ':'
Line 941:65: E203 whitespace before ':'

Line 2430:9: E265 block comment should start with '# '

Line 578:21: E203 whitespace before ':'

Comment last updated at 2021-01-20 18:14:08 UTC

@codecov
Copy link

codecov bot commented Dec 2, 2020

Codecov Report

Merging #667 (45e8c25) into master (a8ee708) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff            @@
##           master     #667    +/-   ##
========================================
  Coverage   99.94%   99.94%            
========================================
  Files          28       28            
  Lines       10089    10262   +173     
========================================
+ Hits        10083    10256   +173     
  Misses          6        6            
Impacted Files Coverage Δ
zarr/core.py 100.00% <100.00%> (ø)
zarr/creation.py 100.00% <100.00%> (ø)
zarr/errors.py 100.00% <100.00%> (ø)
zarr/indexing.py 100.00% <100.00%> (ø)
zarr/tests/test_core.py 100.00% <100.00%> (ø)
zarr/tests/test_indexing.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)
zarr/util.py 100.00% <100.00%> (ø)

@Carreau
Copy link
Contributor

Carreau commented Dec 11, 2020

I added some release notes (that are conflicting), removed some ded code (see 7452482), and try to find some test code to increase coverage.

z = self.create_array(shape=(500,500, 500), chunks=(50, 50, 50), dtype='<i4')
z[:,:,:] = 0
z[0,:,0].any() # fails

Fails with

zarr/core.py:578: in __getitem__
    return self.get_basic_selection(selection, fields=fields)
zarr/core.py:703: in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
zarr/core.py:746: in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
zarr/core.py:1041: in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
zarr/core.py:1731: in _chunk_getitems
    self._process_chunk(out, cdatas[ckey], chunk_select, drop_axes,
zarr/core.py:1638: in _process_chunk
    chunk_partial = self._decode_chunk(
zarr/core.py:1851: in _decode_chunk
    chunk = self._compressor.decode(cdata)
numcodecs/blosc.pyx:495: in numcodecs.blosc.Blosc.decode
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   RuntimeError: error during blosc decompression: -1

numcodecs/blosc.pyx:399: RuntimeError

I'll try to dive into this.

I'm wondering if it's not start_points_buffer = self.fs.read_block(self.key_path, 16, int(self.nblocks*4)) where int rounds down instead of up. Running in a meeting will look at that later.

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

I'm going to rebase because of conflicts, I may end up squashing commits for simplicity of rebasing. current commit is/was 3ae237f

zarr/indexing.py Outdated
)

# any selection can not be out of the range of the chunk
self.selection_shape = np.empty(self.arr_shape)[self.selection].shape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.selection_shape size is determined by self.selection, which later on is muted with self.selection.pop(); which might be a source of bug; though it might be ok as self.selection_shape seem to be unused....

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

This raises another error

from zarr.storage import (
    FSStore,
)

from zarr.storage import (
    init_array,
)
from zarr.core import Array
import atexit
import shutil
from tempfile import mkdtemp, mktemp
from numcodecs import Blosc

path = './tmp-test'
path = mkdtemp()
atexit.register(shutil.rmtree, path)
store = FSStore(path)
init_array(store, compressor=Blosc(), shape=(50, 50, 50), chunks=(5, 5, 5), dtype="<i4")
z = Array(
    store,
    cache_metadata=True,
    cache_attrs=True,
    partial_decompress=True,
)

z[:, :, :] = 0
assert z[:, 0, :].any()
Traceback (most recent call last):
  File "foo.py", line 31, in <module>
    assert z[:, 0, :].any()
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 604, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 729, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 772, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1067, in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1781, in _chunk_getitems
    self._process_chunk(
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1677, in _process_chunk
    chunk_partial = self._decode_chunk(
  File "/Users/bussonniermatthias/dev/zarr-python/zarr/core.py", line 1933, in _decode_chunk
    chunk = chunk.reshape(expected_shape or self._chunks, order=self._order)
ValueError: cannot reshape array of size 125 into shape (1,1,5)

I've also found some simplification of instance variables that can be local variable I'll push that.

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

this second failure seem to be due to me having older numcodec that does not have decode partial for blosc;but still assuming it only partially decoded.

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

So both failure are due to me having an old version numcodec, I've added a check that completely kip the partial reading logic if the compressor does not have a decode_partial method.

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

I've also restored the .coveragerc file that was nuked in one of your merge with master; so that may help with the coverall. I'm going to wait to see where CI brings us.

zarr/indexing.py Outdated
last_dim_slice = None if self.selection[-1].step > 1 else self.selection.pop()
for i, sl in enumerate(self.selection):
dim_chunk_loc_slices = []
for i, x in enumerate(slice_to_range(sl, arr_shape[i])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you use i in both loop and the argument of the second depends on I; that seem weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inner i seem unused as well.

@Carreau
Copy link
Contributor

Carreau commented Dec 15, 2020

I have a couple more cleanup in https://github.com/Carreau/zarr-python/tree/partial_read (d8464d4), but won't push now on this branch as I want to see what the coverage status is.

@Carreau
Copy link
Contributor

Carreau commented Dec 16, 2020

We ran out of time with travi CI. I'm working on migrating to GH actions; Test won't complete without this. Apologies for the delay.

@jakirkham
Copy link
Member

jakirkham commented Dec 16, 2020

cc @rabernat @jhamman @rsignell-usgs (as I think we discussed this last year at SciPy 2019 so may be of interest)

@Carreau
Copy link
Contributor

Carreau commented Jan 4, 2021

Rebased on master, now that master have all the CI migrated to gh action.
Added some extra cleanup, will see what the total coverage is and iterate on this.

@Carreau Carreau force-pushed the partial_read branch 2 times, most recently from b063054 to 3a6a0d8 Compare January 4, 2021 21:16
docs/release.rst Outdated Show resolved Hide resolved
@Carreau
Copy link
Contributor

Carreau commented Jan 4, 2021

@andrewfulton9

I've turned some of your conditional logics into asserts in d098528; so that there is no uncovered branch. Let me know if that's wrong. I've also use np.prod(..., dtype=int) to avoid polluting intermediate results with floating point numbers when the inner iterable was empty and it was returning 1.0 instead of 1.

I've turned some instance variables into local ones when not used outside the function and sprinkled some docs and and release note. (make Partial read opt-in, and added the option to open_array) .

I've left some ? in the docstring of the partial reader as I'm not always sure if the ints are number of bytes or number of elements. Not sure if it matters much.

Before I pushed last commits, all test seemed to be passing, windows was still pending... though coverage was missing in afew spot. You know the code better than I, so it will likely be easier to know how to cover the missing lines..

zarr/indexing.py Outdated Show resolved Hide resolved
zarr/util.py Show resolved Hide resolved
zarr/util.py Show resolved Hide resolved
@andrewfulton9
Copy link
Contributor Author

@Carreau, thanks for the review and the contributions you added. I'll look over the coverage try to get that back up to 100, and fix the doc strings you added the ? to.

zarr/tests/test_core.py Outdated Show resolved Hide resolved
@Carreau
Copy link
Contributor

Carreau commented Jan 6, 2021

Do you think you can rebase ? There seem to be merge commits that make changes hard to follow.
Other than than +1 to get that in.

@joshmoore
Copy link
Member

Note: discussion (plus notes) during the community call yesterday. General call for feedback so this can get merged to move forward the BaseStore work followed by the V3 updates.

zarr/core.py Outdated Show resolved Hide resolved
zarr/indexing.py Outdated
Comment on lines 832 to 833
def int_to_slice(dim_selection):
return slice(dim_selection, dim_selection + 1, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just inline this since it is simple enough and used in only a couple places

@jakirkham
Copy link
Member

Thanks Andrew! 😄 Had a couple suggestions above

andrewfulton9 and others added 2 commits January 20, 2021 09:52
Co-authored-by: jakirkham <jakirkham@gmail.com>
@andrewfulton9
Copy link
Contributor Author

Just pushed up those changes. Thanks @jakirkham !

@Carreau
Copy link
Contributor

Carreau commented Jan 25, 2021

Thanks, with @jakirkham feedback and the last two commits I think we can move forward ! I'm approving as well and merging.

@Carreau Carreau merged commit 33ec64a into zarr-developers:master Jan 25, 2021
@jakirkham
Copy link
Member

Thanks all! 😀

@martindurant
Copy link
Member

Great to see this! I think it's important enough that we may want to put a blog somewhere. What do the final benchmarks look like? Am I right in thinking that blosc is still the default compressor and will continue to be?

@jakirkham
Copy link
Member

We do have a blog on the webpage repo. So writing something up there makes sense.

More generally we may want to do this to summarize the work done for the CZI grant. (cc @davidbrochart as well)

@davidbrochart
Copy link

Thanks for pinging me @jakirkham, I would be happy to summarize the work done on the C++ side in this blog. We could also have a meeting and present to each other what we have done, what do you think?

@rabernat
Copy link
Contributor

Pangeo is organizing a new webinar series of short tech talks. It would be great to have a presentation on the latest developments from Zarr. Presentations will be recorded, publicized and assigned a DOI.

If anyone wants to sign up to present, just fill out this short form: https://forms.gle/zuU8XcQHHKS6DBvv8

@jakirkham
Copy link
Member

jakirkham commented Jan 29, 2021

Thanks for pinging me @jakirkham, I would be happy to summarize the work done on the C++ side in this blog. We could also have a meeting and present to each other what we have done, what do you think?

Yeah, @davidbrochart, I think that makes sense. Do you want to go ahead and set up a poll (for the meeting time)? I think we should try to schedule this in mid-February if that works

@davidbrochart
Copy link

Actually I submitted a proposal for a presentation on xtensor-zarr in the Pangeo webinar series. If that is accepted, that could be a better option, since people could see the recording. What do you think?

@jakirkham
Copy link
Member

I think both are useful. The blog is also likely not too different from other write ups we are doing. So hopefully doesn’t take too much effort to produce. Plus it gives us an opportunity to share this work with a broader audience and publicly recognize CZI for their support, which I’m sure they will appreciate 🙂

@andrewfulton9
Copy link
Contributor Author

I can do a blog on the partial read work as well.

@martindurant
Copy link
Member

Very late question here: has partial read been implemented for uncompressed data? It should be even easier.

@rabernat
Copy link
Contributor

rabernat commented Sep 2, 2021

No

@naught101
Copy link

Would it be possible to rename this "Partial read of blosc-compressed chunks" or similar?

@jakirkham
Copy link
Member

We could do that. Though this likely still comes in the commit history (and potentially other search paths)

That said, as there is a whole new sharding codec planned (please see ZEP 2), this is probably best handled via that pathway, which is planned as part of the v3 work ( #1583 )

Separately there has been discussion of moving from Blosc 1 to 2 ( zarr-developers/numcodecs#413 ), which would likely entail other changes as well

So it is probably best to direct attention to more recent issues/PRs at this stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants