Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3] fix: zarr v2 compatibility fixes for Dask #2186

Merged
merged 20 commits into from
Oct 1, 2024
Merged

[v3] fix: zarr v2 compatibility fixes for Dask #2186

merged 20 commits into from
Oct 1, 2024

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Sep 14, 2024

  • port normalize_chunks from v2
  • add array.store property
  • default to append in create
  • rename zarr.store to zarr.storage

closes #1953
xref: dask/dask#11388

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

- port normalize_chunks from v2
- add array.store property
- default to append in create
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Show resolved Hide resolved
src/zarr/core/array.py Show resolved Hide resolved
Comment on lines 121 to 125
if all(isinstance(c, (tuple | list)) for c in chunks):
# take first chunk size for each dimension
chunks = (
c[0] for c in chunks
) # TODO: check/error/warn for irregular chunks (e.g. if c[0] != c[1:-1])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for this block, the rest of this function came from 2.x

@will-moore
Copy link

Hi @jhamman - thanks for looking into this!
I tried to remember where I got to when I was looking into this before...

Here is what we're doing to load remote data with dask and Zarr v2...

# dask==2023.5.0
# dask-image==2022.9.0
# zarr==2.18.2
import dask.array as da
from zarr import Array, Group, open
from zarr.storage import FSStore
import json

url = "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr"
store = FSStore(url)
zattrs = json.loads(store.get(".zattrs"))
array_path = zattrs["multiscales"][0]["datasets"][0]["path"]
array_data = da.from_zarr(store, array_path)
print(array_data)
# dask.array<from-zarr, shape=(2, 236, 275, 271), dtype=uint16, chunksize=(1, 1, 275, 271), chunktype=numpy.ndarray>

With zarr v3 I'm not getting as far as opening with dask as I'm failing to access any remote data, using a couple of different approaches here (using zarr from this branch), both failing with exceptions noted below...

import zarr
from zarr.store import RemoteStore
import s3fs

url = "https://uk1s3.embassy.ebi.ac.uk/idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr"
store = RemoteStore(url)
# RuntimeError: Timeout context manager should be used inside a task
g = zarr.group(store = store)


s3 = s3fs.S3FileSystem(anon=True, client_kwargs={'endpoint_url': 'https://uk1s3.embassy.ebi.ac.uk'})
store = s3fs.S3Map(root='idr/share/ome2024-ngff-challenge/0.0.5/6001240.zarr', s3=s3, check=False)
# TypeError - only LocalStore or MemoryStore is supported
g = zarr.group(store = store)

Is there some other way that I can try to load remote data?

Many thanks,
Will

This was referenced Sep 17, 2024
@TomAugspurger
Copy link
Contributor

Thanks for trying this out @will-moore. Can you share the full traceback?

In the meantime, I would expect something like

 g = zarr.open_group(store=zarr.store.RemoteStore("s3://idr", endpoint_url="http://uk1s3.embassy.ebi.ac.uk", anon=True), path="zarr/v0.4/idr0062A/6001240.zarr")

Can you let us know if that works? It doesn't for me, but I'm not sure if some network / authentication setup is required.

tests/v3/test_codecs/test_codecs.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
@jhamman jhamman requested a review from d-v-b September 17, 2024 15:46
@will-moore
Copy link

will-moore commented Sep 17, 2024

That sample you tried there is zarr v2 data, and it also seemed to need https rather than http!
So this works for me on another machine, with zarr==3.0.0a0, reading both zarr v2 and v3 data:

store=zarr.store.RemoteStore("s3://idr", endpoint_url="https://uk1s3.embassy.ebi.ac.uk", anon=True)
g = zarr.open_group(store=store, path="share/ome2024-ngff-challenge/idr0048/9846151.zarr")
print('zarr v3 group', g)

g2 = zarr.open_group(store=store, path="zarr/v0.4/idr0062A/6001240.zarr", zarr_version=2)
print('zarr v2', g2)

gives me:

zarr v3 group Group(_async_group=<AsyncGroup Remote fsspec store: S3FileSystem , idr/share/ome2024-ngff-challenge/idr0048/9846151.zarr>)
zarr v2 Group(_async_group=<AsyncGroup Remote fsspec store: S3FileSystem , idr/zarr/v0.4/idr0062A/6001240.zarr>)

On my machine, I unfortunately get:

this error
    g = zarr.open_group(store=store, path="share/ome2024-ngff-challenge/idr0048/9846151.zarr")
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/api/synchronous.py", line 175, in open_group
    sync(
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/sync.py", line 92, in sync
    raise return_result
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/sync.py", line 51, in _runner
    return await coro
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/api/asynchronous.py", line 523, in open_group
    return await AsyncGroup.open(store_path, zarr_format=zarr_format)
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/group.py", line 150, in open
    zarr_json_bytes = await (store_path / ZARR_JSON).get()
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/store/core.py", line 35, in get
    return await self.store.get(self.path, prototype=prototype, byte_range=byte_range)
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/zarr/store/remote.py", line 98, in get
    await (
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 1128, in _cat_file
    return await _error_wrapper(_call_and_read, retries=self.retries)
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 145, in _error_wrapper
    raise err
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 1115, in _call_and_read
    resp = await self._call_s3(
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 358, in _call_s3
    await self.set_session()
  File "/Users/wmoore/opt/anaconda3/envs/zarr_v3/lib/python3.10/site-packages/s3fs/core.py", line 519, in set_session
    self.session = aiobotocore.session.AioSession(**self.kwargs)
TypeError: AioSession.__init__() got an unexpected keyword argument '//'

Which is an issue that seems to affect just me, as I already reported at ome/ome2024-ngff-challenge#22 and I can't seem to fix on my machine, which is kinda annoying!
Might have to re-install anaconda, or try some other option...

Thanks for your help!

@TomAugspurger TomAugspurger mentioned this pull request Sep 25, 2024
6 tasks
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
Comment on lines 2394 to 2399
parents = [
AsyncGroup(
metadata=GroupMetadata(zarr_format=node.metadata.zarr_format),
store_path=StorePath(store=node.store_path.store, path=""),
)
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger - this was needed to enable zarr.create(store={}, path="a") such that the root zarr.json is created in addition to a/zarr.json.

tests/v3/test_codecs/test_codecs.py Show resolved Hide resolved
src/zarr/core/chunk_grids.py Show resolved Hide resolved
src/zarr/core/array.py Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/api/asynchronous.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
src/zarr/core/array.py Outdated Show resolved Hide resolved
Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go i think

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple minor things.

@@ -635,6 +639,8 @@ async def _save_metadata(self, metadata: ArrayMetadata, ensure_parents: bool = F
# To enable zarr.create(store, path="a/b/c"), we need to create all the intermediate groups.
parents = _build_parents(self)

logger.debug("Ensure parents: %s", parents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep this?

parents = []
store = node.store_path.store
path = node.store_path.path
if not path:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know whether this branch is covered by an existing test? If not, it might be good to add one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this path is highly exercised.

src/zarr/core/array.py Outdated Show resolved Hide resolved
@jhamman jhamman merged commit f3a2e0a into v3 Oct 1, 2024
23 checks passed
@dstansby dstansby deleted the fix/dask-compat branch October 13, 2024 07:31
@dstansby dstansby restored the fix/dask-compat branch October 13, 2024 07:31
@jhamman jhamman deleted the fix/dask-compat branch October 15, 2024 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v3] Array constructor API compatibility
4 participants