Skip to content

Commit

Permalink
Merge pull request #268 from martindurant/consolidate_metadata
Browse files Browse the repository at this point in the history
Consolidate zarr metadata into single key
  • Loading branch information
alimanfoo authored Nov 14, 2018
2 parents cc9f4e1 + ccef26c commit d193a78
Show file tree
Hide file tree
Showing 16 changed files with 422 additions and 49 deletions.
2 changes: 2 additions & 0 deletions docs/api/convenience.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ Convenience functions (``zarr.convenience``)
.. autofunction:: copy_all
.. autofunction:: copy_store
.. autofunction:: tree
.. autofunction:: consolidate_metadata
.. autofunction:: open_consolidated
2 changes: 2 additions & 0 deletions docs/api/storage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Storage (``zarr.storage``)
.. automethod:: invalidate_values
.. automethod:: invalidate_keys

.. autoclass:: ConsolidatedMetadataStore

.. autofunction:: init_array
.. autofunction:: init_group
.. autofunction:: contains_array
Expand Down
7 changes: 7 additions & 0 deletions docs/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,13 @@ Release notes
Enhancements
~~~~~~~~~~~~

* Add "consolidated" metadata as an experimental feature: use
:func:`zarr.convenience.consolidate_metadata` to copy all metadata from the various
metadata keys within a dataset hierarchy under a single key, and
:func:`zarr.convenience.open_consolidated` to use this single key. This can greatly
cut down the number of calls to the storage backend, and so remove a lot of overhead
for reading remote data. By :user:`Martin Durant <martindurant>`, :issue:`268`.

* Support has been added for structured arrays with sub-array shape and/or nested fields. By
:user:`Tarik Onalan <onalant>`, :issue:`111`, :issue:`296`.

Expand Down
54 changes: 47 additions & 7 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -778,9 +778,11 @@ chunk size, which will reduce the number of chunks and thus reduce the number of
round-trips required to retrieve data for an array (and thus reduce the impact of network
latency). Another option is to try to increase the compression ratio by changing
compression options or trying a different compressor (which will reduce the impact of
limited network bandwidth). As of version 2.2, Zarr also provides the
:class:`zarr.storage.LRUStoreCache` which can be used to implement a local in-memory cache
layer over a remote store. E.g.::
limited network bandwidth).

As of version 2.2, Zarr also provides the :class:`zarr.storage.LRUStoreCache`
which can be used to implement a local in-memory cache layer over a remote
store. E.g.::

>>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
>>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
Expand All @@ -797,13 +799,51 @@ layer over a remote store. E.g.::
b'Hello from the cloud!'
0.0009490990014455747

If you are still experiencing poor performance with distributed/cloud storage, please
raise an issue on the GitHub issue tracker with any profiling data you can provide, as
there may be opportunities to optimise further either within Zarr or within the mapping
interface to the storage.
If you are still experiencing poor performance with distributed/cloud storage,
please raise an issue on the GitHub issue tracker with any profiling data you
can provide, as there may be opportunities to optimise further either within
Zarr or within the mapping interface to the storage.

.. _tutorial_copy:

Consolidating metadata
~~~~~~~~~~~~~~~~~~~~~~

(This is an experimental feature.)

Since there is a significant overhead for every connection to a cloud object
store such as S3, the pattern described in the previous section may incur
significant latency while scanning the metadata of the dataset hierarchy, even
though each individual metadata object is small. For cases such as these, once
the data are static and can be regarded as read-only, at least for the
metadata/structure of the dataset hierarchy, the many metadata objects can be
consolidated into a single one via
:func:`zarr.convenience.consolidate_metadata`. Doing this can greatly increase
the speed of reading the dataset metadata, e.g.::

>>> zarr.consolidate_metadata(store) # doctest: +SKIP

This creates a special key with a copy of all of the metadata from all of the
metadata objects in the store.

Later, to open a Zarr store with consolidated metadata, use
:func:`zarr.convenience.open_consolidated`, e.g.::

>>> root = zarr.open_consolidated(store) # doctest: +SKIP

This uses the special key to read all of the metadata in a single call to the
backend storage.

Note that, the hierarchy could still be opened in the normal way and altered,
causing the consolidated metadata to become out of sync with the real state of
the dataset hierarchy. In this case,
:func:`zarr.convenience.consolidate_metadata` would need to be called again.

To protect against consolidated metadata accidentally getting out of sync, the
root group returned by :func:`zarr.convenience.open_consolidated` is read-only
for the metadata, meaning that no new groups or arrays can be created, and
arrays cannot be resized. However, data values with arrays can still be updated.

Copying/migrating data
----------------------

Expand Down
1 change: 1 addition & 0 deletions requirements_dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ python-dateutil==2.7.3
readme-renderer==22.0
requests==2.19.1
requests-toolbelt==0.8.0
setuptools-scm==3.1.0
s3fs==0.1.6
s3transfer==0.1.13
scandir==1.9.0
Expand Down
3 changes: 2 additions & 1 deletion zarr/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from zarr.sync import ThreadSynchronizer, ProcessSynchronizer
from zarr.codecs import *
from zarr.convenience import (open, save, save_array, save_group, load, copy_store,
copy, copy_all, tree)
copy, copy_all, tree, consolidate_metadata,
open_consolidated)
from zarr.errors import CopyError, MetadataError, PermissionError
from zarr.version import version as __version__
4 changes: 2 additions & 2 deletions zarr/attrs.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
from collections import MutableMapping


from zarr.compat import text_type
from zarr.errors import PermissionError
from zarr.meta import parse_metadata


class Attributes(MutableMapping):
Expand Down Expand Up @@ -43,7 +43,7 @@ def _get_nosync(self):
except KeyError:
d = dict()
else:
d = json.loads(text_type(data, 'ascii'))
d = parse_metadata(data)
return d

def asdict(self):
Expand Down
4 changes: 4 additions & 0 deletions zarr/compat.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ class PermissionError(Exception):
def OrderedDict_move_to_end(od, key):
od[key] = od.pop(key)

from collections import Mapping


else: # pragma: py2 no cover

Expand All @@ -29,3 +31,5 @@ def OrderedDict_move_to_end(od, key):

def OrderedDict_move_to_end(od, key):
od.move_to_end(key)

from collections.abc import Mapping
126 changes: 120 additions & 6 deletions zarr/convenience.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,28 +15,34 @@
from zarr.errors import err_path_not_found, CopyError
from zarr.util import normalize_storage_path, TreeViewer, buffer_size
from zarr.compat import PY2, text_type
from zarr.meta import ensure_str, json_dumps


# noinspection PyShadowingBuiltins
def open(store, mode='a', **kwargs):
def open(store=None, mode='a', **kwargs):
"""Convenience function to open a group or array using file-mode-like semantics.
Parameters
----------
store : MutableMapping or string
store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
read/write (must exist); 'a' means read/write (create if doesn't
exist); 'w' means create (overwrite if exists); 'w-' means create
(fail if exists).
**kwargs
Additional parameters are passed through to :func:`zarr.open_array` or
:func:`zarr.open_group`.
Additional parameters are passed through to :func:`zarr.creation.open_array` or
:func:`zarr.hierarchy.open_group`.
Returns
-------
z : :class:`zarr.core.Array` or :class:`zarr.hierarchy.Group`
Array or group, depending on what exists in the given store.
See Also
--------
zarr.open_array, zarr.open_group
zarr.creation.open_array, zarr.hierarchy.open_group
Examples
--------
Expand Down Expand Up @@ -68,7 +74,8 @@ def open(store, mode='a', **kwargs):

path = kwargs.get('path', None)
# handle polymorphic store arg
store = normalize_store_arg(store, clobber=(mode == 'w'))
clobber = mode == 'w'
store = normalize_store_arg(store, clobber=clobber)
path = normalize_storage_path(path)

if mode in {'w', 'w-', 'x'}:
Expand Down Expand Up @@ -1069,3 +1076,110 @@ def copy_all(source, dest, shallow=False, without_attrs=False, log=None,
_log_copy_summary(log, dry_run, n_copied, n_skipped, n_bytes_copied)

return n_copied, n_skipped, n_bytes_copied


def consolidate_metadata(store, metadata_key='.zmetadata'):
"""
Consolidate all metadata for groups and arrays within the given store
into a single resource and put it under the given key.
This produces a single object in the backend store, containing all the
metadata read from all the zarr-related keys that can be found. After
metadata have been consolidated, use :func:`open_consolidated` to open
the root group in optimised, read-only mode, using the consolidated
metadata to reduce the number of read operations on the backend store.
Note, that if the metadata in the store is changed after this
consolidation, then the metadata read by :func:`open_consolidated`
would be incorrect unless this function is called again.
.. note:: This is an experimental feature.
Parameters
----------
store : MutableMapping or string
Store or path to directory in file system or name of zip file.
metadata_key : str
Key to put the consolidated metadata under.
Returns
-------
g : :class:`zarr.hierarchy.Group`
Group instance, opened with the new consolidated metadata.
See Also
--------
open_consolidated
"""
import json

store = normalize_store_arg(store)

def is_zarr_key(key):
return (key.endswith('.zarray') or key.endswith('.zgroup') or
key.endswith('.zattrs'))

out = {
'zarr_consolidated_format': 1,
'metadata': {
key: json.loads(ensure_str(store[key]))
for key in store if is_zarr_key(key)
}
}
store[metadata_key] = json_dumps(out).encode()
return open_consolidated(store, metadata_key=metadata_key)


def open_consolidated(store, metadata_key='.zmetadata', mode='r+'):
"""Open group using metadata previously consolidated into a single key.
This is an optimised method for opening a Zarr group, where instead of
traversing the group/array hierarchy by accessing the metadata keys at
each level, a single key contains all of the metadata for everything.
For remote data sources where the overhead of accessing a key is large
compared to the time to read data.
The group accessed must have already had its metadata consolidated into a
single key using the function :func:`consolidate_metadata`.
This optimised method only works in modes which do not change the
metadata, although the data may still be written/updated.
Parameters
----------
store : MutableMapping or string
Store or path to directory in file system or name of zip file.
metadata_key : str
Key to read the consolidated metadata from. The default (.zmetadata)
corresponds to the default used by :func:`consolidate_metadata`.
mode : {'r', 'r+'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
read/write (must exist) although only writes to data are allowed,
changes to metadata including creation of new arrays or group
are not allowed.
Returns
-------
g : :class:`zarr.hierarchy.Group`
Group instance, opened with the consolidated metadata.
See Also
--------
consolidate_metadata
"""

from .storage import ConsolidatedMetadataStore

# normalize parameters
store = normalize_store_arg(store)
if mode not in {'r', 'r+'}:
raise ValueError("invalid mode, expected either 'r' or 'r+'; found {!r}"
.format(mode))

# setup metadata sotre
meta_store = ConsolidatedMetadataStore(store, metadata_key=metadata_key)

# pass through
return open(store=meta_store, chunk_store=store, mode=mode)
3 changes: 3 additions & 0 deletions zarr/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ def _load_metadata_nosync(self):
if config is None:
self._compressor = None
else:
# temporary workaround for
# https://github.com/zarr-developers/numcodecs/issues/78
config = dict(config)
self._compressor = get_codec(config)

# setup filters
Expand Down
26 changes: 16 additions & 10 deletions zarr/creation.py
Original file line number Diff line number Diff line change
Expand Up @@ -346,15 +346,15 @@ def array(data, **kwargs):
return z


def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor='default',
fill_value=0, order='C', synchronizer=None, filters=None,
cache_metadata=True, cache_attrs=True, path=None, object_codec=None,
**kwargs):
def open_array(store=None, mode='a', shape=None, chunks=True, dtype=None,
compressor='default', fill_value=0, order='C', synchronizer=None,
filters=None, cache_metadata=True, cache_attrs=True, path=None,
object_codec=None, chunk_store=None, **kwargs):
"""Open an array using file-mode-like semantics.
Parameters
----------
store : MutableMapping or string
store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.
mode : {'r', 'r+', 'a', 'w', 'w-'}, optional
Persistence mode: 'r' means read only (must exist); 'r+' means
Expand Down Expand Up @@ -391,6 +391,8 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
Array path within store.
object_codec : Codec, optional
A codec to encode object arrays, only needed if dtype=object.
chunk_store : MutableMapping or string, optional
Store or path to directory in file system or name of zip file.
Returns
-------
Expand Down Expand Up @@ -426,7 +428,10 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
# a : read/write if exists, create otherwise (default)

# handle polymorphic store arg
store = normalize_store_arg(store, clobber=(mode == 'w'))
clobber = mode == 'w'
store = normalize_store_arg(store, clobber=clobber)
if chunk_store is not None:
chunk_store = normalize_store_arg(chunk_store, clobber=clobber)
path = normalize_storage_path(path)

# API compatibility with h5py
Expand All @@ -448,7 +453,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, overwrite=True, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

elif mode == 'a':
if contains_group(store, path=path):
Expand All @@ -457,7 +462,7 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

elif mode in ['w-', 'x']:
if contains_group(store, path=path):
Expand All @@ -468,14 +473,15 @@ def open_array(store, mode='a', shape=None, chunks=True, dtype=None, compressor=
init_array(store, shape=shape, chunks=chunks, dtype=dtype,
compressor=compressor, fill_value=fill_value,
order=order, filters=filters, path=path,
object_codec=object_codec)
object_codec=object_codec, chunk_store=chunk_store)

# determine read only status
read_only = mode == 'r'

# instantiate array
z = Array(store, read_only=read_only, synchronizer=synchronizer,
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path)
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
chunk_store=chunk_store)

return z

Expand Down
Loading

0 comments on commit d193a78

Please sign in to comment.