-
-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr-v3 Consolidated Metadata #2113
Changes from 74 commits
73d53d7
90940a0
8ee89f4
65a8bd4
2515ca3
cdaf81f
5a86789
a839f16
5a31390
79bf235
fc901eb
3a3eb9d
78af362
750668c
63697ab
5d79274
0c67972
657ad1e
511ff76
b360eb4
abcdbe6
b9bcfe8
3575cda
7b6bd17
500a91e
22d501e
762cf96
d6c6cc7
6755fbc
e406f86
07248ea
bdf15ad
18eb172
c11f1ad
f6397f4
f55aa37
123dc60
34c7720
4db042b
8febba3
d730350
a1f1ebb
35a3832
c1837fd
d03f4bd
cddd01f
9303cd0
87b65f1
ee5d130
af9788f
5a08466
d236e53
2824de6
08a7682
b8b5f51
ba4fb47
10d062f
e6142d8
8ad3738
fc94933
79246dd
a62240b
4bfad1b
ae02bb5
3265abd
8728440
20c97a4
f7e5b3f
c31f8a1
483681b
7e76e9e
19b9271
418bc6b
97fa2a0
6fab362
cbffcbb
56d2704
8ade87d
b5fb721
d17f955
1d17140
2b2e3da
96b274c
c9229d1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,74 @@ | ||||||
Consolidated Metadata | ||||||
===================== | ||||||
|
||||||
zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec. | ||||||
Consolidated metadata can reduce the time needed to load the metadata for an | ||||||
entire hierarchy, especially when the metadata is being served over a network. | ||||||
Consolidated metadata essentially stores all the metadata for a hierarchy in the | ||||||
metadata of the root Group. | ||||||
|
||||||
Usage | ||||||
----- | ||||||
|
||||||
If consolidated metadata is present in a Zarr Group's metadata then it is used | ||||||
by default. The initial read to open the group will need to communicate with | ||||||
the store (reading from a file for a :class:`zarr.store.LocalStore`, making a | ||||||
network request for a :class:`zarr.store.RemoteStore`). After that, any subsequent | ||||||
metadata reads get child Group or Array nodes will *not* require reads from the store. | ||||||
|
||||||
In Python, the consolidated metadata is available on the ``.consolidated_metadata`` | ||||||
attribute of the Group. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> import zarr | ||||||
>>> store = zarr.store.MemoryStore({}, mode="w") | ||||||
>>> group = zarr.open_group(store=store) | ||||||
>>> group.create_array(shape=(1,), name="a") | ||||||
>>> group.create_array(shape=(2, 2), name="b") | ||||||
>>> group.create_array(shape=(3, 3, 3), name="c") | ||||||
>>> zarr.consolidate_metadata(store) | ||||||
|
||||||
If we open that group, the Group's metadata has a :class:`zarr.ConsolidatedMetadata` | ||||||
that can be used. | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> consolidated = zarr.open_group(store=store) | ||||||
>>> consolidated.metadata.consolidated_metadata.metadata | ||||||
{'b': ArrayV3Metadata(shape=(2, 2), fill_value=np.float64(0.0), ...), | ||||||
'a': ArrayV3Metadata(shape=(1,), fill_value=np.float64(0.0), ...), | ||||||
'c': ArrayV3Metadata(shape=(3, 3, 3), fill_value=np.float64(0.0), ...)} | ||||||
|
||||||
Operations on the group to get children automatically use the consolidated metadata. | ||||||
|
||||||
.. code-block:: python | ||||||
|
||||||
>>> consolidated["a"] # no read / HTTP request to the Store is required | ||||||
<Array memory://.../a shape=(1,) dtype=float64> | ||||||
|
||||||
With nested groups, the consolidated metadata is available on the children, recursively. | ||||||
|
||||||
... code-block:: python | ||||||
|
||||||
>>> child = group.create_group("child", attributes={"kind": "child"}) | ||||||
>>> grandchild = child.create_group("child", attributes={"kind": "grandchild"}) | ||||||
>>> consolidated = zarr.consolidate_metadata(store) | ||||||
|
||||||
>>> consolidated["child"].metadata.consolidated_metadata | ||||||
ConsolidatedMetadata(metadata={'child': GroupMetadata(attributes={'kind': 'grandchild'}, zarr_format=3, )}, ...) | ||||||
|
||||||
Synchronization and Concurrency | ||||||
------------------------------- | ||||||
|
||||||
Consolidated metadata is intended for read-heavy use cases on slowly changing | ||||||
hierarchies. For hierarchies where new nodes are constantly being added, | ||||||
removed, or modified, consolidated metadata may not be desirable. | ||||||
|
||||||
1. It will add some overhead to each update operation, since the metadata | ||||||
would need to be re-consolidated to keep it in sync with the store. | ||||||
2. Readers using consolidated metadata will regularly see a "past" version | ||||||
of the metadata, at the time they read the root node with its consolidated | ||||||
metadata. | ||||||
|
||||||
.. _Consolidated Metadata: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#consolidated-metadata | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These docs are great @TomAugspurger! 👏 |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,7 @@ Zarr-Python | |
|
||
getting_started | ||
tutorial | ||
consolidated_metadata | ||
api/index | ||
spec | ||
release | ||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
@@ -1,6 +1,7 @@ | ||||||||||||||||||||||
from __future__ import annotations | ||||||||||||||||||||||
|
||||||||||||||||||||||
import asyncio | ||||||||||||||||||||||
import dataclasses | ||||||||||||||||||||||
import warnings | ||||||||||||||||||||||
from typing import TYPE_CHECKING, Any, Literal, cast | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
@@ -9,11 +10,18 @@ | |||||||||||||||||||||
|
||||||||||||||||||||||
from zarr.abc.store import Store | ||||||||||||||||||||||
from zarr.core.array import Array, AsyncArray, get_array_metadata | ||||||||||||||||||||||
from zarr.core.common import JSON, AccessModeLiteral, ChunkCoords, MemoryOrder, ZarrFormat | ||||||||||||||||||||||
from zarr.core.buffer import NDArrayLike | ||||||||||||||||||||||
from zarr.core.chunk_key_encodings import ChunkKeyEncoding | ||||||||||||||||||||||
from zarr.core.common import ( | ||||||||||||||||||||||
JSON, | ||||||||||||||||||||||
AccessModeLiteral, | ||||||||||||||||||||||
ChunkCoords, | ||||||||||||||||||||||
MemoryOrder, | ||||||||||||||||||||||
ZarrFormat, | ||||||||||||||||||||||
) | ||||||||||||||||||||||
from zarr.core.config import config | ||||||||||||||||||||||
from zarr.core.group import AsyncGroup | ||||||||||||||||||||||
from zarr.core.metadata.v2 import ArrayV2Metadata | ||||||||||||||||||||||
from zarr.core.metadata.v3 import ArrayV3Metadata | ||||||||||||||||||||||
from zarr.core.group import AsyncGroup, ConsolidatedMetadata, GroupMetadata | ||||||||||||||||||||||
from zarr.core.metadata import ArrayV2Metadata, ArrayV3Metadata | ||||||||||||||||||||||
from zarr.storage import ( | ||||||||||||||||||||||
StoreLike, | ||||||||||||||||||||||
StorePath, | ||||||||||||||||||||||
|
@@ -132,8 +140,61 @@ def _default_zarr_version() -> ZarrFormat: | |||||||||||||||||||||
return cast(ZarrFormat, int(config.get("default_zarr_version", 3))) | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
async def consolidate_metadata(*args: Any, **kwargs: Any) -> AsyncGroup: | ||||||||||||||||||||||
raise NotImplementedError | ||||||||||||||||||||||
async def consolidate_metadata( | ||||||||||||||||||||||
store: StoreLike, | ||||||||||||||||||||||
path: str | None = None, | ||||||||||||||||||||||
zarr_format: ZarrFormat | None = None, | ||||||||||||||||||||||
) -> AsyncGroup: | ||||||||||||||||||||||
""" | ||||||||||||||||||||||
Consolidate the metadata of all nodes in a hierarchy. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Upon completion, the metadata of the root node in the Zarr hierarchy will be | ||||||||||||||||||||||
updated to include all the metadata of child nodes. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Parameters | ||||||||||||||||||||||
---------- | ||||||||||||||||||||||
store: StoreLike | ||||||||||||||||||||||
The store-like object whose metadata you wish to consolidate. | ||||||||||||||||||||||
path: str, optional | ||||||||||||||||||||||
A path to a group in the store to consolidate at. Only children | ||||||||||||||||||||||
below that group will be consolidated. | ||||||||||||||||||||||
|
||||||||||||||||||||||
By default, the root node is used so all the metadata in the | ||||||||||||||||||||||
store is consolidated. | ||||||||||||||||||||||
|
||||||||||||||||||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
Returns | ||||||||||||||||||||||
------- | ||||||||||||||||||||||
group: AsyncGroup | ||||||||||||||||||||||
The group, with the ``consolidated_metadata`` field set to include | ||||||||||||||||||||||
the metadata of each child node. | ||||||||||||||||||||||
""" | ||||||||||||||||||||||
store_path = await make_store_path(store) | ||||||||||||||||||||||
|
||||||||||||||||||||||
if path is not None: | ||||||||||||||||||||||
store_path = store_path / path | ||||||||||||||||||||||
|
||||||||||||||||||||||
group = await AsyncGroup.open(store_path, zarr_format=zarr_format, use_consolidated=False) | ||||||||||||||||||||||
group.store_path.store._check_writable() | ||||||||||||||||||||||
|
||||||||||||||||||||||
members_metadata = {k: v.metadata async for k, v in group.members(max_depth=None)} | ||||||||||||||||||||||
|
||||||||||||||||||||||
# While consolidating, we want to be explicit about when child groups | ||||||||||||||||||||||
# are empty by inserting an empty dict for consolidated_metadata.metadata | ||||||||||||||||||||||
for k, v in members_metadata.items(): | ||||||||||||||||||||||
if isinstance(v, GroupMetadata) and v.consolidated_metadata is None: | ||||||||||||||||||||||
v = dataclasses.replace(v, consolidated_metadata=ConsolidatedMetadata(metadata={})) | ||||||||||||||||||||||
members_metadata[k] = v | ||||||||||||||||||||||
|
||||||||||||||||||||||
ConsolidatedMetadata._flat_to_nested(members_metadata) | ||||||||||||||||||||||
|
||||||||||||||||||||||
consolidated_metadata = ConsolidatedMetadata(metadata=members_metadata) | ||||||||||||||||||||||
metadata = dataclasses.replace(group.metadata, consolidated_metadata=consolidated_metadata) | ||||||||||||||||||||||
group = dataclasses.replace( | ||||||||||||||||||||||
group, | ||||||||||||||||||||||
metadata=metadata, | ||||||||||||||||||||||
) | ||||||||||||||||||||||
await group._save_metadata() | ||||||||||||||||||||||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||||||
return group | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
async def copy(*args: Any, **kwargs: Any) -> tuple[int, int, int]: | ||||||||||||||||||||||
|
@@ -251,8 +312,11 @@ async def open( | |||||||||||||||||||||
return await open_group(store=store_path, zarr_format=zarr_format, **kwargs) | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
async def open_consolidated(*args: Any, **kwargs: Any) -> AsyncGroup: | ||||||||||||||||||||||
raise NotImplementedError | ||||||||||||||||||||||
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup: | ||||||||||||||||||||||
""" | ||||||||||||||||||||||
Alias for :func:`open_group` with ``use_consolidated=True``. | ||||||||||||||||||||||
""" | ||||||||||||||||||||||
return await open_group(*args, use_consolidated=use_consolidated, **kwargs) | ||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Seems to me that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My goal here is to avoid a user accidentally passing |
||||||||||||||||||||||
|
||||||||||||||||||||||
|
||||||||||||||||||||||
async def save( | ||||||||||||||||||||||
|
@@ -542,6 +606,7 @@ async def open_group( | |||||||||||||||||||||
zarr_format: ZarrFormat | None = None, | ||||||||||||||||||||||
meta_array: Any | None = None, # not used | ||||||||||||||||||||||
attributes: dict[str, JSON] | None = None, | ||||||||||||||||||||||
use_consolidated: bool | str | None = None, | ||||||||||||||||||||||
) -> AsyncGroup: | ||||||||||||||||||||||
"""Open a group using file-mode-like semantics. | ||||||||||||||||||||||
|
||||||||||||||||||||||
|
@@ -580,6 +645,22 @@ async def open_group( | |||||||||||||||||||||
meta_array : array-like, optional | ||||||||||||||||||||||
An array instance to use for determining arrays to create and return | ||||||||||||||||||||||
to users. Use `numpy.empty(())` by default. | ||||||||||||||||||||||
use_consolidated: bool or str, default None | ||||||||||||||||||||||
Whether to use consolidated metadata. | ||||||||||||||||||||||
|
||||||||||||||||||||||
By default, consolidated metadata is used if it's present in the | ||||||||||||||||||||||
store (in the ``zarr.json`` for Zarr v3 and in the ``.zmetadata`` file | ||||||||||||||||||||||
for Zarr v2). | ||||||||||||||||||||||
|
||||||||||||||||||||||
To explicitly require consolidated metadata, set ``use_consolidated=True``, | ||||||||||||||||||||||
which will raise an exception if consolidated metadata is not found. | ||||||||||||||||||||||
|
||||||||||||||||||||||
To explicitly *not* use consolidated metadata, set ``use_consolidated=False``, | ||||||||||||||||||||||
which will fall back to using the regular, non consolidated metadata. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Zarr v2 allowed configuring the key storing the consolidated metadata | ||||||||||||||||||||||
(``.zmetadata`` by default). Specify the custom key as ``use_consolidated`` | ||||||||||||||||||||||
to load consolidated metadata from a non-default key. | ||||||||||||||||||||||
|
||||||||||||||||||||||
Returns | ||||||||||||||||||||||
------- | ||||||||||||||||||||||
|
@@ -606,7 +687,9 @@ async def open_group( | |||||||||||||||||||||
attributes = {} | ||||||||||||||||||||||
|
||||||||||||||||||||||
try: | ||||||||||||||||||||||
return await AsyncGroup.open(store_path, zarr_format=zarr_format) | ||||||||||||||||||||||
return await AsyncGroup.open( | ||||||||||||||||||||||
store_path, zarr_format=zarr_format, use_consolidated=use_consolidated | ||||||||||||||||||||||
) | ||||||||||||||||||||||
except (KeyError, FileNotFoundError): | ||||||||||||||||||||||
return await AsyncGroup.from_store( | ||||||||||||||||||||||
store_path, | ||||||||||||||||||||||
|
@@ -768,7 +851,9 @@ async def create( | |||||||||||||||||||||
) | ||||||||||||||||||||||
else: | ||||||||||||||||||||||
warnings.warn( | ||||||||||||||||||||||
"dimension_separator is not yet implemented", RuntimeWarning, stacklevel=2 | ||||||||||||||||||||||
"dimension_separator is not yet implemented", | ||||||||||||||||||||||
RuntimeWarning, | ||||||||||||||||||||||
stacklevel=2, | ||||||||||||||||||||||
) | ||||||||||||||||||||||
if write_empty_chunks: | ||||||||||||||||||||||
warnings.warn("write_empty_chunks is not yet implemented", RuntimeWarning, stacklevel=2) | ||||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -35,6 +35,7 @@ | |||
ShapeLike, | ||||
ZarrFormat, | ||||
concurrent_map, | ||||
parse_dtype, | ||||
parse_shapelike, | ||||
product, | ||||
) | ||||
|
@@ -157,6 +158,12 @@ async def get_array_metadata( | |||
# V3 arrays are comprised of a zarr.json object | ||||
assert zarr_json_bytes is not None | ||||
metadata_dict = json.loads(zarr_json_bytes.to_bytes()) | ||||
|
||||
if metadata_dict.get("node_type") != "array": | ||||
# This KeyError is load bearing for `open`. That currently tries | ||||
# to open the node as an `array` and then falls back to opening | ||||
# as a group. | ||||
raise KeyError | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. perhaps raise |
||||
return metadata_dict | ||||
|
||||
|
||||
|
@@ -226,7 +233,8 @@ async def create( | |||
if chunks is not None and chunk_shape is not None: | ||||
raise ValueError("Only one of chunk_shape or chunks can be provided.") | ||||
|
||||
dtype = np.dtype(dtype) | ||||
dtype = parse_dtype(dtype, zarr_format) | ||||
# dtype = np.dtype(dtype) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
if chunks: | ||||
_chunks = normalize_chunks(chunks, shape, dtype.itemsize) | ||||
else: | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,10 @@ | |
overload, | ||
) | ||
|
||
import numpy as np | ||
|
||
from zarr.core.strings import _STRING_DTYPE | ||
|
||
if TYPE_CHECKING: | ||
from collections.abc import Awaitable, Callable, Iterator | ||
|
||
|
@@ -24,6 +28,7 @@ | |
ZARRAY_JSON = ".zarray" | ||
ZGROUP_JSON = ".zgroup" | ||
ZATTRS_JSON = ".zattrs" | ||
ZMETADATA_V2_JSON = ".zmetadata" | ||
|
||
ByteRangeRequest = tuple[int | None, int | None] | ||
BytesLike = bytes | bytearray | memoryview | ||
|
@@ -162,3 +167,13 @@ def parse_order(data: Any) -> Literal["C", "F"]: | |
if data in ("C", "F"): | ||
return cast(Literal["C", "F"], data) | ||
raise ValueError(f"Expected one of ('C', 'F'), got {data} instead.") | ||
|
||
|
||
def parse_dtype(dtype: Any, zarr_format: ZarrFormat) -> np.dtype[Any]: | ||
if dtype is str or dtype == "str": | ||
if zarr_format == 2: | ||
# special case as object | ||
return np.dtype("object") | ||
else: | ||
return _STRING_DTYPE | ||
return np.dtype(dtype) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is likely from #2323 Noting because I think we should merge that first then resolve the conflict here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, this is me messing up git branches. I'll back it out here as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.