Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr-v3 Consolidated Metadata #2113

Merged
Merged
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
73d53d7
Fixed MemoryStore.list_dir
TomAugspurger Aug 25, 2024
90940a0
fixup s3
TomAugspurger Aug 25, 2024
8ee89f4
recursive Group.members
TomAugspurger Aug 25, 2024
65a8bd4
Zarr-v3 Consolidated Metadata
TomAugspurger Aug 23, 2024
2515ca3
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 3, 2024
cdaf81f
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 6, 2024
5a86789
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 6, 2024
a839f16
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 10, 2024
5a31390
fixup
TomAugspurger Sep 10, 2024
79bf235
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 11, 2024
fc901eb
read zarr-v2 consolidated metadata
TomAugspurger Sep 11, 2024
3a3eb9d
check writablem
TomAugspurger Sep 12, 2024
78af362
Handle non-root paths
TomAugspurger Sep 12, 2024
750668c
Some error handling
TomAugspurger Sep 12, 2024
63697ab
cleanup
TomAugspurger Sep 12, 2024
5d79274
refactor open
TomAugspurger Sep 12, 2024
0c67972
remove dupe file
TomAugspurger Sep 12, 2024
657ad1e
v2 getitem
TomAugspurger Sep 12, 2024
511ff76
fixup
TomAugspurger Sep 12, 2024
b360eb4
Optimzied members
TomAugspurger Sep 12, 2024
abcdbe6
Impl flatten
TomAugspurger Sep 12, 2024
b9bcfe8
Fixups
TomAugspurger Sep 13, 2024
3575cda
doc
TomAugspurger Sep 13, 2024
7b6bd17
nest the tests
TomAugspurger Sep 13, 2024
500a91e
fixup
TomAugspurger Sep 13, 2024
22d501e
Fixups
TomAugspurger Sep 13, 2024
762cf96
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 13, 2024
d6c6cc7
fixup
TomAugspurger Sep 13, 2024
6755fbc
fixup
TomAugspurger Sep 13, 2024
e406f86
fixup
TomAugspurger Sep 13, 2024
07248ea
fixup
TomAugspurger Sep 13, 2024
bdf15ad
fixup
TomAugspurger Sep 13, 2024
18eb172
consistent open_consolidated handling
TomAugspurger Sep 16, 2024
c11f1ad
fixup
TomAugspurger Sep 16, 2024
f6397f4
make clear that flat_to_nested mutates
TomAugspurger Sep 16, 2024
f55aa37
fixujp
TomAugspurger Sep 16, 2024
123dc60
fixup
TomAugspurger Sep 16, 2024
34c7720
fixup
TomAugspurger Sep 17, 2024
4db042b
Fixup
TomAugspurger Sep 17, 2024
8febba3
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 19, 2024
d730350
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 20, 2024
a1f1ebb
fixup
TomAugspurger Sep 20, 2024
35a3832
fixup
TomAugspurger Sep 20, 2024
c1837fd
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 20, 2024
d03f4bd
fixup
TomAugspurger Sep 20, 2024
cddd01f
fixup
TomAugspurger Sep 20, 2024
9303cd0
added docs
TomAugspurger Sep 20, 2024
87b65f1
fixup
TomAugspurger Sep 20, 2024
ee5d130
Ensure empty dict
TomAugspurger Sep 23, 2024
af9788f
fixed name
TomAugspurger Sep 23, 2024
5a08466
fixup nested
TomAugspurger Sep 23, 2024
d236e53
removed dupe tests
TomAugspurger Sep 23, 2024
2824de6
fixup
TomAugspurger Sep 23, 2024
08a7682
doc fix
TomAugspurger Sep 23, 2024
b8b5f51
fixups
TomAugspurger Sep 24, 2024
ba4fb47
fixup
TomAugspurger Sep 24, 2024
10d062f
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 24, 2024
e6142d8
fixup
TomAugspurger Sep 24, 2024
8ad3738
v2 writer
TomAugspurger Sep 24, 2024
fc94933
fixup
TomAugspurger Sep 24, 2024
79246dd
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 25, 2024
a62240b
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 28, 2024
4bfad1b
fixup
TomAugspurger Sep 28, 2024
ae02bb5
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Sep 30, 2024
3265abd
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 1, 2024
8728440
path fix
TomAugspurger Oct 1, 2024
20c97a4
Fixed v2 use_consolidated=False
TomAugspurger Oct 1, 2024
f7e5b3f
fixupg
TomAugspurger Oct 1, 2024
c31f8a1
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 7, 2024
483681b
Special case object dtype
TomAugspurger Oct 9, 2024
7e76e9e
fixup
TomAugspurger Oct 9, 2024
19b9271
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 9, 2024
418bc6b
Merge branch 'tom/fix/dtype-str-special-case' into user/tom/feature/c…
TomAugspurger Oct 9, 2024
97fa2a0
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 9, 2024
6fab362
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 10, 2024
cbffcbb
docs
TomAugspurger Oct 10, 2024
56d2704
pr review
TomAugspurger Oct 10, 2024
8ade87d
must_understand
TomAugspurger Oct 10, 2024
b5fb721
Updated from_dict checking
TomAugspurger Oct 10, 2024
d17f955
cleanup
TomAugspurger Oct 10, 2024
1d17140
cleanup
TomAugspurger Oct 10, 2024
2b2e3da
Fixed fill_value
TomAugspurger Oct 10, 2024
96b274c
Merge remote-tracking branch 'upstream/v3' into user/tom/feature/cons…
TomAugspurger Oct 10, 2024
c9229d1
fixup
TomAugspurger Oct 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/consolidated_metadata.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Consolidated Metadata
=====================

zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec.
Zarr-Python implements the `Consolidated Metadata_` extension to the Zarr Spec.

Consolidated metadata can reduce the time needed to load the metadata for an
entire hierarchy, especially when the metadata is being served over a network.
Consolidated metadata essentially stores all the metadata for a hierarchy in the
metadata of the root Group.

Usage
-----

If consolidated metadata is present in a Zarr Group's metadata then it is used
by default. The initial read to open the group will need to communicate with
the store (reading from a file for a :class:`zarr.store.LocalStore`, making a
network request for a :class:`zarr.store.RemoteStore`). After that, any subsequent
metadata reads get child Group or Array nodes will *not* require reads from the store.

In Python, the consolidated metadata is available on the ``.consolidated_metadata``
attribute of the Group.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attribute of the Group.
attribute of the Group metadata.


.. code-block:: python

>>> import zarr
>>> store = zarr.store.MemoryStore({}, mode="w")
>>> group = zarr.open_group(store=store)
>>> group.create_array(shape=(1,), name="a")
>>> group.create_array(shape=(2, 2), name="b")
>>> group.create_array(shape=(3, 3, 3), name="c")
>>> zarr.consolidate_metadata(store)

If we open that group, the Group's metadata has a :class:`zarr.ConsolidatedMetadata`
that can be used.

.. code-block:: python

>>> consolidated = zarr.open_group(store=store)
>>> consolidated.metadata.consolidated_metadata.metadata
{'b': ArrayV3Metadata(shape=(2, 2), fill_value=np.float64(0.0), ...),
'a': ArrayV3Metadata(shape=(1,), fill_value=np.float64(0.0), ...),
'c': ArrayV3Metadata(shape=(3, 3, 3), fill_value=np.float64(0.0), ...)}

Operations on the group to get children automatically use the consolidated metadata.

.. code-block:: python

>>> consolidated["a"] # no read / HTTP request to the Store is required
<Array memory://.../a shape=(1,) dtype=float64>

With nested groups, the consolidated metadata is available on the children, recursively.

... code-block:: python

>>> child = group.create_group("child", attributes={"kind": "child"})
>>> grandchild = child.create_group("child", attributes={"kind": "grandchild"})
>>> consolidated = zarr.consolidate_metadata(store)

>>> consolidated["child"].metadata.consolidated_metadata
ConsolidatedMetadata(metadata={'child': GroupMetadata(attributes={'kind': 'grandchild'}, zarr_format=3, )}, ...)

Synchronization and Concurrency
-------------------------------

Consolidated metadata is intended for read-heavy use cases on slowly changing
hierarchies. For hierarchies where new nodes are constantly being added,
removed, or modified, consolidated metadata may not be desirable.

1. It will add some overhead to each update operation, since the metadata
would need to be re-consolidated to keep it in sync with the store.
2. Readers using consolidated metadata will regularly see a "past" version
of the metadata, at the time they read the root node with its consolidated
metadata.

.. _Consolidated Metadata: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#consolidated-metadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are great @TomAugspurger! 👏

1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Zarr-Python

getting_started
tutorial
consolidated_metadata
api/index
spec
release
Expand Down
105 changes: 95 additions & 10 deletions src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from __future__ import annotations

import asyncio
import dataclasses
import warnings
from typing import TYPE_CHECKING, Any, Literal, cast

Expand All @@ -9,11 +10,18 @@

from zarr.abc.store import Store
from zarr.core.array import Array, AsyncArray, get_array_metadata
from zarr.core.common import JSON, AccessModeLiteral, ChunkCoords, MemoryOrder, ZarrFormat
from zarr.core.buffer import NDArrayLike
from zarr.core.chunk_key_encodings import ChunkKeyEncoding
from zarr.core.common import (
JSON,
AccessModeLiteral,
ChunkCoords,
MemoryOrder,
ZarrFormat,
)
from zarr.core.config import config
from zarr.core.group import AsyncGroup
from zarr.core.metadata.v2 import ArrayV2Metadata
from zarr.core.metadata.v3 import ArrayV3Metadata
from zarr.core.group import AsyncGroup, ConsolidatedMetadata, GroupMetadata
from zarr.core.metadata import ArrayV2Metadata, ArrayV3Metadata
from zarr.storage import (
StoreLike,
StorePath,
Expand Down Expand Up @@ -132,8 +140,61 @@ def _default_zarr_version() -> ZarrFormat:
return cast(ZarrFormat, int(config.get("default_zarr_version", 3)))


async def consolidate_metadata(*args: Any, **kwargs: Any) -> AsyncGroup:
raise NotImplementedError
async def consolidate_metadata(
store: StoreLike,
path: str | None = None,
zarr_format: ZarrFormat | None = None,
) -> AsyncGroup:
"""
Consolidate the metadata of all nodes in a hierarchy.

Upon completion, the metadata of the root node in the Zarr hierarchy will be
updated to include all the metadata of child nodes.

Parameters
----------
store: StoreLike
The store-like object whose metadata you wish to consolidate.
path: str, optional
A path to a group in the store to consolidate at. Only children
below that group will be consolidated.

By default, the root node is used so all the metadata in the
store is consolidated.

TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
Returns
-------
group: AsyncGroup
The group, with the ``consolidated_metadata`` field set to include
the metadata of each child node.
"""
store_path = await make_store_path(store)

if path is not None:
store_path = store_path / path

group = await AsyncGroup.open(store_path, zarr_format=zarr_format, use_consolidated=False)
group.store_path.store._check_writable()

members_metadata = {k: v.metadata async for k, v in group.members(max_depth=None)}

# While consolidating, we want to be explicit about when child groups
# are empty by inserting an empty dict for consolidated_metadata.metadata
for k, v in members_metadata.items():
if isinstance(v, GroupMetadata) and v.consolidated_metadata is None:
v = dataclasses.replace(v, consolidated_metadata=ConsolidatedMetadata(metadata={}))
members_metadata[k] = v

ConsolidatedMetadata._flat_to_nested(members_metadata)

consolidated_metadata = ConsolidatedMetadata(metadata=members_metadata)
metadata = dataclasses.replace(group.metadata, consolidated_metadata=consolidated_metadata)
group = dataclasses.replace(
group,
metadata=metadata,
)
await group._save_metadata()
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
return group


async def copy(*args: Any, **kwargs: Any) -> tuple[int, int, int]:
Expand Down Expand Up @@ -251,8 +312,11 @@ async def open(
return await open_group(store=store_path, zarr_format=zarr_format, **kwargs)


async def open_consolidated(*args: Any, **kwargs: Any) -> AsyncGroup:
raise NotImplementedError
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=use_consolidated, **kwargs)
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup:
"""
Alias for :func:`open_group` with ``use_consolidated=True``.
"""
return await open_group(*args, use_consolidated=True, **kwargs)

Seems to me that open_consolidated should always use use_consolidated=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here is to avoid a user accidentally passing use_consolidated in **kwargs and us silently overwriting it. I'll update this to raise if use_consolidated isn't True.



async def save(
Expand Down Expand Up @@ -542,6 +606,7 @@ async def open_group(
zarr_format: ZarrFormat | None = None,
meta_array: Any | None = None, # not used
attributes: dict[str, JSON] | None = None,
use_consolidated: bool | str | None = None,
) -> AsyncGroup:
"""Open a group using file-mode-like semantics.

Expand Down Expand Up @@ -580,6 +645,22 @@ async def open_group(
meta_array : array-like, optional
An array instance to use for determining arrays to create and return
to users. Use `numpy.empty(())` by default.
use_consolidated: bool or str, default None
Whether to use consolidated metadata.

By default, consolidated metadata is used if it's present in the
store (in the ``zarr.json`` for Zarr v3 and in the ``.zmetadata`` file
for Zarr v2).

To explicitly require consolidated metadata, set ``use_consolidated=True``,
which will raise an exception if consolidated metadata is not found.

To explicitly *not* use consolidated metadata, set ``use_consolidated=False``,
which will fall back to using the regular, non consolidated metadata.

Zarr v2 allowed configuring the key storing the consolidated metadata
(``.zmetadata`` by default). Specify the custom key as ``use_consolidated``
to load consolidated metadata from a non-default key.

Returns
-------
Expand All @@ -606,7 +687,9 @@ async def open_group(
attributes = {}

try:
return await AsyncGroup.open(store_path, zarr_format=zarr_format)
return await AsyncGroup.open(
store_path, zarr_format=zarr_format, use_consolidated=use_consolidated
)
except (KeyError, FileNotFoundError):
return await AsyncGroup.from_store(
store_path,
Expand Down Expand Up @@ -768,7 +851,9 @@ async def create(
)
else:
warnings.warn(
"dimension_separator is not yet implemented", RuntimeWarning, stacklevel=2
"dimension_separator is not yet implemented",
RuntimeWarning,
stacklevel=2,
)
if write_empty_chunks:
warnings.warn("write_empty_chunks is not yet implemented", RuntimeWarning, stacklevel=2)
Expand Down
8 changes: 6 additions & 2 deletions src/zarr/api/synchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,10 @@ def open(
return Group(obj)


def open_consolidated(*args: Any, **kwargs: Any) -> Group:
return Group(sync(async_api.open_consolidated(*args, **kwargs)))
def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> Group:
return Group(
sync(async_api.open_consolidated(*args, use_consolidated=use_consolidated, **kwargs))
)


def save(
Expand Down Expand Up @@ -207,6 +209,7 @@ def open_group(
zarr_version: ZarrFormat | None = None, # deprecated
zarr_format: ZarrFormat | None = None,
meta_array: Any | None = None, # not used in async api
use_consolidated: bool | str | None = None,
) -> Group:
return Group(
sync(
Expand All @@ -221,6 +224,7 @@ def open_group(
zarr_version=zarr_version,
zarr_format=zarr_format,
meta_array=meta_array,
use_consolidated=use_consolidated,
)
)
)
Expand Down
10 changes: 9 additions & 1 deletion src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
ShapeLike,
ZarrFormat,
concurrent_map,
parse_dtype,
parse_shapelike,
product,
)
Expand Down Expand Up @@ -157,6 +158,12 @@ async def get_array_metadata(
# V3 arrays are comprised of a zarr.json object
assert zarr_json_bytes is not None
metadata_dict = json.loads(zarr_json_bytes.to_bytes())

if metadata_dict.get("node_type") != "array":
# This KeyError is load bearing for `open`. That currently tries
# to open the node as an `array` and then falls back to opening
# as a group.
raise KeyError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps raise NodeTypeValidationError from #2310

return metadata_dict


Expand Down Expand Up @@ -226,7 +233,8 @@ async def create(
if chunks is not None and chunk_shape is not None:
raise ValueError("Only one of chunk_shape or chunks can be provided.")

dtype = np.dtype(dtype)
dtype = parse_dtype(dtype, zarr_format)
# dtype = np.dtype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# dtype = np.dtype(dtype)

if chunks:
_chunks = normalize_chunks(chunks, shape, dtype.itemsize)
else:
Expand Down
15 changes: 15 additions & 0 deletions src/zarr/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@
overload,
)

import numpy as np

from zarr.core.strings import _STRING_DTYPE

if TYPE_CHECKING:
from collections.abc import Awaitable, Callable, Iterator

Expand All @@ -24,6 +28,7 @@
ZARRAY_JSON = ".zarray"
ZGROUP_JSON = ".zgroup"
ZATTRS_JSON = ".zattrs"
ZMETADATA_V2_JSON = ".zmetadata"

ByteRangeRequest = tuple[int | None, int | None]
BytesLike = bytes | bytearray | memoryview
Expand Down Expand Up @@ -162,3 +167,13 @@ def parse_order(data: Any) -> Literal["C", "F"]:
if data in ("C", "F"):
return cast(Literal["C", "F"], data)
raise ValueError(f"Expected one of ('C', 'F'), got {data} instead.")


def parse_dtype(dtype: Any, zarr_format: ZarrFormat) -> np.dtype[Any]:
if dtype is str or dtype == "str":
if zarr_format == 2:
# special case as object
return np.dtype("object")
else:
return _STRING_DTYPE
return np.dtype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely from #2323

Noting because I think we should merge that first then resolve the conflict here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is me messing up git branches. I'll back it out here as well.

Loading