Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr reader #271

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
26a94df
wip toward zarr v2 reader
norlandrhagen Oct 24, 2024
cfb7b8d
removed _ARRAY_DIMENSIONS and trimmed down attrs
norlandrhagen Oct 24, 2024
2f26f03
WIP for zarr reader
norlandrhagen Oct 24, 2024
eab87a6
adding in the key piece, the reader
norlandrhagen Oct 24, 2024
13db375
virtual dataset is returned! Now to deal with fill_value
norlandrhagen Oct 31, 2024
cc30ad7
Merge branch 'main' into zarr_reader
norlandrhagen Nov 12, 2024
a047ff9
Update virtualizarr/readers/zarr.py
norlandrhagen Nov 12, 2024
072bead
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen Nov 12, 2024
f7c9a3f
replace fsspec ls with zarr.getsize
norlandrhagen Nov 15, 2024
2024606
lint
norlandrhagen Nov 15, 2024
443435b
wip test_zarr
norlandrhagen Nov 15, 2024
50fd8b5
removed pdb
norlandrhagen Nov 15, 2024
d93c932
zarr import in type checking
norlandrhagen Nov 19, 2024
39be1c5
moved get_chunk_paths & get_chunk_size async funcs outside of constru…
norlandrhagen Nov 19, 2024
e718240
added a few notes from PR review.
norlandrhagen Nov 19, 2024
bbcd473
removed array encoding
norlandrhagen Nov 19, 2024
ed9f2b4
v2 passing, v3 skipped for now
norlandrhagen Nov 19, 2024
db89da7
added missed staged files
norlandrhagen Nov 19, 2024
e3d4318
fixed merge conflicts with main
norlandrhagen Nov 19, 2024
410b2a3
missing return
norlandrhagen Nov 19, 2024
8a69963
add network
norlandrhagen Nov 19, 2024
3fca8e6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 19, 2024
34053b0
conftest fix
norlandrhagen Nov 19, 2024
5c26b1f
naming
norlandrhagen Nov 19, 2024
fb784dc
comment out integration test for now
norlandrhagen Nov 19, 2024
0444fd4
refactored test_dataset_from_zarr ZArray tests
norlandrhagen Nov 20, 2024
66fd456
adds zarr v3 req opt
norlandrhagen Nov 20, 2024
13fce09
zarr_v3 decorator
norlandrhagen Nov 20, 2024
c36962d
add more tests
norlandrhagen Nov 20, 2024
4be4906
wip
norlandrhagen Nov 21, 2024
ca5ff32
adds missing await
norlandrhagen Nov 21, 2024
88cbeca
more tests
norlandrhagen Nov 21, 2024
1fbdc9c
wip
norlandrhagen Nov 21, 2024
370621f
wip on v3
norlandrhagen Nov 21, 2024
9bb0653
add note + xfail v3
norlandrhagen Nov 21, 2024
7e03ea5
tmp run network
norlandrhagen Nov 21, 2024
5c1e331
revert
norlandrhagen Nov 21, 2024
9404625
update construct_virtual_array ordering
norlandrhagen Nov 22, 2024
1a5a960
merge
norlandrhagen Dec 3, 2024
cc7d68c
updated ABC after merge
norlandrhagen Dec 3, 2024
ac105ea
wip
norlandrhagen Dec 9, 2024
7b57bd0
Merge branch 'main' into zarr_reader
norlandrhagen Dec 9, 2024
ff01c92
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 9, 2024
4f2470a
working for v2 and v3, but only local
norlandrhagen Dec 10, 2024
0c1ff82
merge
norlandrhagen Dec 10, 2024
05d4050
cleanup test_zarr reader test
norlandrhagen Dec 11, 2024
f40ba28
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Dec 11, 2024
b5fb802
cleanup after zarr-python issue report
norlandrhagen Dec 12, 2024
be5280f
Merge branch 'zarr_reader' of https://github.com/zarr-developers/Virt…
norlandrhagen Dec 12, 2024
690ffee
temp disabled validate_and_normalize_path_to_uri due to issue in zarr…
norlandrhagen Dec 16, 2024
98600e7
Merge branch 'main' into zarr_reader
norlandrhagen Dec 16, 2024
31a1b94
marked zarr integration test skipped b/c of zarr-v3 and kerchunk inco…
norlandrhagen Dec 16, 2024
795c428
fixes some async behavior, reading from s3 seems to work
norlandrhagen Dec 17, 2024
c0004c6
lint + uri_fmt
norlandrhagen Dec 17, 2024
60b8912
adds to releases.rst
norlandrhagen Dec 17, 2024
8240997
nit
norlandrhagen Dec 17, 2024
816e696
cleanup, comments and nits
norlandrhagen Dec 17, 2024
31aacf9
progress on mypy
norlandrhagen Dec 17, 2024
5d14b20
make mypy happy
norlandrhagen Dec 17, 2024
fb844b6
adds option for AsyncArray to _is_zarr_array
norlandrhagen Dec 18, 2024
421f53f
big async rewrite
norlandrhagen Dec 19, 2024
cedad11
merge w/ main
norlandrhagen Dec 19, 2024
1c5e42d
fixes merge conflict
norlandrhagen Dec 19, 2024
89d8555
bit of restructure
norlandrhagen Dec 19, 2024
c1a5218
nit
norlandrhagen Dec 19, 2024
6af84b4
WIP on ChunkManifest.from_arrays
norlandrhagen Dec 20, 2024
349386f
v2/v3 c chunk fix + build ChunkManifest from numpy arrays
norlandrhagen Dec 21, 2024
c776ab9
removed method of creating ChunkManifests from dicts
norlandrhagen Dec 21, 2024
fb6fff7
cleanup
norlandrhagen Dec 21, 2024
87c74d4
adds xfails to TestOpenVirtualDatasetZarr due to local filesystem zar…
norlandrhagen Dec 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ dependencies:
- pytest
- pooch
- fsspec
- dask
- pip
- pip:
- icechunk>=0.1.0a7 # Installs zarr v3 as dependency
Expand Down
16 changes: 16 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,22 @@ def pytest_runtest_setup(item):
)


def _xarray_subset():
ds = xr.tutorial.open_dataset("air_temperature", chunks={})
return ds.isel(time=slice(0, 10), lat=slice(0, 9), lon=slice(0, 18)).chunk(
{"time": 5}
)


@pytest.fixture(params=[2, 3])
def zarr_store(tmpdir, request):
ds = _xarray_subset()
filepath = f"{tmpdir}/air.zarr"
ds.to_zarr(filepath, zarr_format=request.param)
ds.close()
return filepath


@pytest.fixture
def empty_netcdf4_file(tmpdir):
# Set up example xarray dataset
Expand Down
3 changes: 3 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ v1.2.1 (unreleased)
New Features
~~~~~~~~~~~~

- Adds a Zarr reader to ``open_virtual_dataset``, which allows opening Zarr V2 and V3 stores as virtual datasets.
(:pull:`#271`) By `Raphael Hagen <https://github.com/norlandrhagen>`_.

- Added a ``.nbytes`` accessor method which displays the bytes needed to hold the virtual references in memory.
(:issue:`167`, :pull:`227`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Expand Down
7 changes: 3 additions & 4 deletions virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@
KerchunkVirtualBackend,
NetCDF3VirtualBackend,
TIFFVirtualBackend,
ZarrV3VirtualBackend,
ZarrVirtualBackend,
)
from virtualizarr.readers.common import VirtualBackend
from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions

# TODO add entrypoint to allow external libraries to add to this mapping
VIRTUAL_BACKENDS = {
"kerchunk": KerchunkVirtualBackend,
"zarr_v3": ZarrV3VirtualBackend,
"zarr": ZarrVirtualBackend,
"dmrpp": DMRPPVirtualBackend,
# all the below call one of the kerchunk backends internally (https://fsspec.github.io/kerchunk/reference.html#file-format-backends)
"hdf5": HDF5VirtualBackend,
Expand Down Expand Up @@ -72,8 +72,7 @@ def automatically_determine_filetype(

# TODO how do we handle kerchunk json / parquet here?
if Path(filepath).suffix == ".zarr":
# TODO we could imagine opening an existing zarr store, concatenating it, and writing a new virtual one...
raise NotImplementedError()
return FileType.zarr

# Read magic bytes from local or remote file
fpath = _FsspecFSFromFilepath(
Expand Down
4 changes: 2 additions & 2 deletions virtualizarr/codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,9 +65,9 @@ def _get_manifestarray_codecs(
def _is_zarr_array(array: object) -> bool:
"""Check if the array is an instance of Zarr Array."""
try:
from zarr import Array
from zarr import Array, AsyncArray

return isinstance(array, Array)
return isinstance(array, (Array, AsyncArray))
except ImportError:
return False

Expand Down
15 changes: 2 additions & 13 deletions virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,6 @@ def with_validation(
"""

# note: we can't just use `__init__` or a dataclass' `__post_init__` because we need `fs_root` to be an optional kwarg

path = validate_and_normalize_path_to_uri(path, fs_root=fs_root)
validate_byte_range(offset=offset, length=length)
return ChunkEntry(path=path, offset=offset, length=length)
Expand Down Expand Up @@ -84,7 +83,8 @@ def validate_and_normalize_path_to_uri(path: str, fs_root: str | None = None) ->
return urlunparse(components)

elif any(path.startswith(prefix) for prefix in VALID_URI_PREFIXES):
if not PosixPath(path).suffix:
# Question: This feels fragile, is there a better way to ID a Zarr
if not PosixPath(path).suffix and "zarr" not in path:
raise ValueError(
f"entries in the manifest must be paths to files, but this path has no file suffix: {path}"
)
Expand Down Expand Up @@ -357,17 +357,6 @@ def shape_chunk_grid(self) -> tuple[int, ...]:
def __repr__(self) -> str:
return f"ChunkManifest<shape={self.shape_chunk_grid}>"

@property
def nbytes(self) -> int:
"""
Size required to hold these references in memory in bytes.

Note this is not the size of the referenced chunks if they were actually loaded into memory,
this is only the size of the pointers to the chunk locations.
If you were to load the data into memory it would be ~1e6x larger for 1MB chunks.
"""
return self._paths.nbytes + self._offsets.nbytes + self._lengths.nbytes

def __getitem__(self, key: ChunkKey) -> ChunkEntry:
indices = split(key)
path = self._paths[indices]
Expand Down
6 changes: 4 additions & 2 deletions virtualizarr/readers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
from virtualizarr.readers.kerchunk import KerchunkVirtualBackend
from virtualizarr.readers.netcdf3 import NetCDF3VirtualBackend
from virtualizarr.readers.tiff import TIFFVirtualBackend
from virtualizarr.readers.zarr_v3 import ZarrV3VirtualBackend
from virtualizarr.readers.zarr import (
ZarrVirtualBackend,
)

__all__ = [
"DMRPPVirtualBackend",
Expand All @@ -15,5 +17,5 @@
"KerchunkVirtualBackend",
"NetCDF3VirtualBackend",
"TIFFVirtualBackend",
"ZarrV3VirtualBackend",
"ZarrVirtualBackend",
]
18 changes: 13 additions & 5 deletions virtualizarr/readers/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,16 +45,24 @@ def maybe_open_loadable_vars_and_indexes(
# TODO Really we probably want a dedicated backend that iterates over all variables only once
# TODO See issue #124 for a suggestion of how to avoid calling xarray here.

fpath = _FsspecFSFromFilepath(
filepath=filepath, reader_options=reader_options
).open_file()
fpath = _FsspecFSFromFilepath(filepath=filepath, reader_options=reader_options)

# Updates the Xarray open_dataset kwargs if Zarr

if fpath.upath.suffix == ".zarr":
engine = "zarr"
xr_input = fpath.filepath

else:
engine = None
xr_input = fpath.open_file() # type: ignore

# fpath can be `Any` thanks to fsspec.filesystem(...).open() returning Any.
ds = open_dataset(
fpath, # type: ignore[arg-type]
xr_input, # type: ignore[arg-type]
drop_variables=drop_variables,
group=group,
decode_times=decode_times,
engine=engine,
)

if indexes is None:
Expand Down
Loading
Loading