Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to zarr store #2706

Merged
merged 45 commits into from
Jun 29, 2019
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f231393
Initial version of appending to zarr store
Jan 24, 2019
f14f3b7
Added docs
Jan 24, 2019
928440d
Resolve PEP8 incompliances
Jan 24, 2019
442e938
Added write and append test for mode 'a'
Jan 26, 2019
389ba43
Resolved conflicts with master
Jan 26, 2019
6097da2
Merge branch 'master' of https://github.com/pydata/xarray into append…
Jan 29, 2019
390a792
Merged repaired master
Jan 29, 2019
da9a962
Resolved pep8 issue
Jan 29, 2019
6756b8f
Put target store encoding in appended variable
davidbrochart Apr 3, 2019
95d5782
Merge with master
davidbrochart Apr 4, 2019
a750a92
Rewrite test with appending along time dimension
davidbrochart Apr 4, 2019
295084b
Add chunk_size parameter for rechunking appended coordinate
davidbrochart Apr 22, 2019
cc353e1
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart Apr 22, 2019
e56a210
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart May 21, 2019
519b398
Add chunk_dim test
davidbrochart May 21, 2019
c85aa98
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 4, 2019
3adfd49
Add type check and tests for it.
Jun 17, 2019
608813b
Add documentation
Jun 17, 2019
2078838
Add test for compute=False and commented it out
Jun 17, 2019
7a90ce8
Merge master
Jun 17, 2019
b8af5bd
Remove python 3.7 string formatting
Jun 17, 2019
5bee0dc
Fix PEP8 incompliance
Jun 17, 2019
7ed77ad
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 17, 2019
b4ff1c7
Add missing whitespaces
Jun 18, 2019
5b3f8ea
allowed for compute=False when appending to a zarr store
Jun 20, 2019
7564329
Fixed empty array data error
Jun 20, 2019
93be790
flake8 fixes
Jun 20, 2019
58c4b78
removed chunk_dim argument to to_zarr function
Jun 21, 2019
5316593
implemented requested changes
Jun 24, 2019
ad08c73
Update xarray/backends/api.py
shikharsg Jun 25, 2019
4d2122b
added contributors and example of using append to zarr
Jun 25, 2019
105ed39
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 25, 2019
af4a5a5
fixed docs fail
Jun 25, 2019
62d4f52
fixed docs
Jun 25, 2019
9558811
Merge branch 'master' into append_zarr
shikharsg Jun 25, 2019
9d70e02
removed unnecessary condition
Jun 25, 2019
a6ff494
attempt at clean string encoding and variable length strings
Jun 26, 2019
34b700f
implemented suggestions
Jun 26, 2019
3e54cb9
* append_dim does not need to be specified if creating a new array wi…
Jun 27, 2019
97ed25b
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 27, 2019
beb12e5
raise ValueError when append_dim is not a valid dimension
Jun 27, 2019
41a6ca3
flake8 fix
Jun 27, 2019
321aec1
removed unused comment
Jun 27, 2019
2b130ff
* raise error when appending with encoding provided for existing vari…
Jun 29, 2019
58de86d
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,29 @@ store is already present at that path, an error will be raised, preventing it
from being overwritten. To override this behavior and overwrite an existing
store, add ``mode='w'`` when invoking ``to_zarr``.

It is also possible to append to an existing store. For that, add ``mode='a'``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possibly get an example of using append here?

and set ``append_dim`` to the name of the dimension along which to append.

.. ipython:: python
:suppress:

! rm -rf path/to/directory.zarr

.. ipython:: python

ds1 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
coords={'x': [10, 20, 30, 40],
'y': [1,2,3,4,5],
't': pd.date_range('2001-01-01', periods=2)})
ds1.to_zarr('path/to/directory.zarr')
ds2 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
coords={'x': [10, 20, 30, 40],
'y': [1,2,3,4,5],
't': pd.date_range('2001-01-03', periods=2)})
ds2.to_zarr('path/to/directory.zarr', mode='a', append_dim='t')

To store variable length strings use ``dtype=object``.

To read back a zarr dataset that has been created this way, we use the
:py:func:`~xarray.open_zarr` method:

Expand Down
5 changes: 5 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,11 @@ Other enhancements
report showing what exactly differs between the two objects (dimensions /
coordinates / variables / attributes) (:issue:`1507`).
By `Benoit Bovy <https://github.com/benbovy>`_.
- Added append capability to the zarr store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to credit all the contributors to this PR.

By `Jendrik Jördening <https://github.com/jendrikjoe>`_,
`David Brochart <https://github.com/davidbrochart>`_,
`Ryan Abernathey <https://github.com/rabernat>`_ and
`Shikhar Goenka<https://github.com/shikharsg>`_.
- Resampling of standard and non-standard calendars indexed by
:py:class:`~xarray.CFTimeIndex` is now possible. (:issue:`2191`).
By `Jwen Fai Low <https://github.com/jwenfai>`_ and
Expand Down
49 changes: 46 additions & 3 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,21 @@
from io import BytesIO
from numbers import Number
from pathlib import Path
import re

import numpy as np
import pandas as pd

from .. import Dataset, DataArray, backends, conventions
from .. import Dataset, DataArray, backends, conventions, coding
from ..core import indexing
from .. import auto_combine
from ..core.combine import (combine_by_coords, _nested_combine,
_infer_concat_order_from_positions)
from ..core.utils import close_on_error, is_grib_path, is_remote_uri
from ..core.variable import Variable
from .common import ArrayWriter
from .locks import _get_scheduler
from ..coding.variables import safe_setitem, unpack_for_encoding

DATAARRAY_NAME = '__xarray_dataarray_name__'
DATAARRAY_VARIABLE = '__xarray_dataarray_variable__'
Expand Down Expand Up @@ -1024,8 +1028,42 @@ def save_mfdataset(datasets, paths, mode='w', format=None, groups=None,
for w, s in zip(writes, stores)])


def _validate_datatypes_for_zarr_append(dataset):
"""DataArray.name and Dataset keys must be a string or None"""
def check_dtype(var):
if (not np.issubdtype(var.dtype, np.number)
and not coding.strings.is_unicode_dtype(var.dtype)
and not var.dtype == object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and not var.dtype == object):
and not coding.strings.is_unicode_dtype(var.dtype)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to work:

         if (not np.issubdtype(var.dtype, np.number)
                and not coding.strings.is_unicode_dtype(var.dtype)
                and not var.dtype == object):

while this does not:

         if (not np.issubdtype(var.dtype, np.number)
                and not np.issubdtype(var.dtype, np.string_)
                and not np.issubdtype(var.dtype, np.unicode_)
                and not coding.strings.is_unicode_dtype(var.dtype)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or were you suggesting this?

         if (not np.issubdtype(var.dtype, np.number)
                and not coding.strings.is_unicode_dtype(var.dtype)):

this does not work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I understand why this doesn't work: we aren't using xarray's standard encoding pipeline when appending variables. This probably is fine for strings -- in the worst case, users will just get an error downstream from Zarr instead.

# and not re.match('^bytes[1-9]+$', var.dtype.name)):
raise ValueError('Invalid dtype for data variable: {} '
'dtype must be a subtype of number, '
'a fixed sized string, a fixed size '
'unicode string or an object'.format(var))
for k in dataset.data_vars.values():
check_dtype(k)


def _validate_append_dim(ds_to_append, store, append_dim, **open_kwargs):
try:
ds = backends.zarr.open_zarr(store, **open_kwargs)
except ValueError: # store empty
return
if append_dim:
if append_dim not in ds.dims:
raise ValueError(
"{} not a valid dimension in the Dataset".format(append_dim)
)
for data_var in ds_to_append:
if data_var in ds:
if append_dim is None:
raise ValueError(
"variable '{}' already exists, but append_dim "
"was not set".format(data_var)
)


def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False, append_dim=None):
"""This function creates an appropriate datastore for writing a dataset to
a zarr ztore

Expand All @@ -1040,11 +1078,16 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
_validate_dataset_names(dataset)
_validate_attrs(dataset)

if mode == "a":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an error if encoding was passed explicitly into to_zarr() and specifies encoding for any existing variables? I.e., ds.to_arr(..., mode='a', encoding={'existing_variable': ....}). I think this would always indicate a programming mistake, given that we throw away these variable encodings anyways.

_validate_datatypes_for_zarr_append(dataset)
_validate_append_dim(dataset, store, append_dim,
consolidated=consolidated, group=group)

zstore = backends.ZarrStore.open_group(store=store, mode=mode,
synchronizer=synchronizer,
group=group,
consolidate_on_close=consolidated)

zstore.append_dim = append_dim
writer = ArrayWriter()
# TODO: figure out how to properly handle unlimited_dims
dump_to_store(dataset, zstore, writer, encoding=encoding)
Expand Down
13 changes: 10 additions & 3 deletions xarray/backends/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,26 +158,33 @@ class ArrayWriter:
def __init__(self, lock=None):
self.sources = []
self.targets = []
self.regions = []
self.lock = lock

def add(self, source, target):
def add(self, source, target, region=None):
if isinstance(source, dask_array_type):
self.sources.append(source)
self.targets.append(target)
self.regions.append(region)
else:
target[...] = source
if region:
target[region] = source
else:
target[...] = source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 This seems like a perfect (and rather general) way to handle appends at the dask level.


def sync(self, compute=True):
if self.sources:
import dask.array as da
# TODO: consider wrapping targets with dask.delayed, if this makes
# for any discernable difference in perforance, e.g.,
# targets = [dask.delayed(t) for t in self.targets]

delayed_store = da.store(self.sources, self.targets,
lock=self.lock, compute=compute,
flush=True)
flush=True, regions=self.regions)
self.sources = []
self.targets = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reset self.regions = [] for consistency?

self.regions = []
return delayed_store


Expand Down
152 changes: 120 additions & 32 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
from ..core import indexing
from ..core.pycompat import integer_types
from ..core.utils import FrozenOrderedDict, HiddenKeyDict
from .common import AbstractWritableDataStore, BackendArray
from .common import AbstractWritableDataStore, BackendArray, \
_encode_variable_name

# need some special secret attributes to tell us the dimensions
_DIMENSION_KEY = '_ARRAY_DIMENSIONS'
Expand Down Expand Up @@ -212,7 +213,7 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
# zarr allows unicode, but not variable-length strings, so it's both
# simpler and more compact to always encode as UTF-8 explicitly.
# TODO: allow toggling this explicitly via dtype in encoding.
coder = coding.strings.EncodedStringCoder(allows_unicode=False)
coder = coding.strings.EncodedStringCoder(allows_unicode=True)
var = coder.encode(var, name=name)
var = coding.strings.ensure_fixed_length_bytes(var)

Expand Down Expand Up @@ -257,6 +258,7 @@ def __init__(self, zarr_group, consolidate_on_close=False):
self._synchronizer = self.ds.synchronizer
self._group = self.ds.path
self._consolidate_on_close = consolidate_on_close
self.append_dim = None

def open_store_variable(self, name, zarr_array):
data = indexing.LazilyOuterIndexedArray(ZarrArrayWrapper(name, self))
Expand Down Expand Up @@ -313,40 +315,126 @@ def encode_variable(self, variable):
def encode_attribute(self, a):
return _encode_zarr_attr_value(a)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):

attrs = variable.attrs.copy()
dims = variable.dims
dtype = variable.dtype
shape = variable.shape

fill_value = attrs.pop('_FillValue', None)
if variable.encoding == {'_FillValue': None} and fill_value is None:
variable.encoding = {}

encoding = _extract_zarr_variable_encoding(
variable, raise_on_invalid=check_encoding)

encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k, v in attrs.items():
encoded_attrs[k] = self.encode_attribute(v)

zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)

return zarr_array, variable.data

def store(self, variables, attributes, *args, **kwargs):
AbstractWritableDataStore.store(self, variables, attributes,
*args, **kwargs)
def store(self, variables, attributes, check_encoding_set=frozenset(),
writer=None, unlimited_dims=None):
"""
Top level method for putting data on this store, this method:
- encodes variables/attributes
- sets dimensions
- sets variables

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer : ArrayWriter
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
dimension on which the zarray will be appended
only needed in append mode
"""

existing_variables = set([vn for vn in variables
if _encode_variable_name(vn) in self.ds])
new_variables = set(variables) - existing_variables
variables_without_encoding = OrderedDict([(vn, variables[vn])
for vn in new_variables])
variables_encoded, attributes = self.encode(
variables_without_encoding, attributes)

if len(existing_variables) > 0:
# there are variables to append
# their encoding must be the same as in the store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any unit tests that verify that encoding is kept consistent? This would be nice to add, if not.

Probably a good example would be dataset saved with scale/offset encoding, where the new dataset to be appended does not have any encoding provided. We could verify that probably scaled values are read back from disk.

ds = open_zarr(self.ds.store, chunks=None)
variables_with_encoding = OrderedDict()
for vn in existing_variables:
variables_with_encoding[vn] = variables[vn]
variables_with_encoding[vn].encoding = ds[vn].encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies an argument that was passed into the function in-place, which in general should be avoided due to unexpected side effects in other parts of the code. It would be better to shallow copy the Variable before overriding encoding, e.g., with variables[vn].copy(deep=False).

variables_with_encoding, _ = self.encode(variables_with_encoding,
{})
variables_encoded.update(variables_with_encoding)

self.set_attributes(attributes)
self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
self.set_variables(variables_encoded, check_encoding_set, writer,
unlimited_dims=unlimited_dims)

def sync(self):
pass

def set_variables(self, variables, check_encoding_set, writer,
unlimited_dims=None):
"""
This provides a centralized method to set the variables on the data
store.

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer :
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
"""

for vn, v in variables.items():
name = _encode_variable_name(vn)
check = vn in check_encoding_set
attrs = v.attrs.copy()
dims = v.dims
dtype = v.dtype
shape = v.shape

fill_value = attrs.pop('_FillValue', None)
if v.encoding == {'_FillValue': None} and fill_value is None:
v.encoding = {}
if name in self.ds:
zarr_array = self.ds[name]
"""
If variable is a dimension of an existing array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use # for comments instead of strings, which should be reserved for docstrings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, yes. That comment is not actually needed(it's from code I already removed). So I'll remove that.

append_dim does not need to be specified
"""
if self.append_dim in dims:
# this is the DataArray that has append_dim as a
# dimension
append_axis = dims.index(self.append_dim)
new_shape = list(zarr_array.shape)
new_shape[append_axis] += v.shape[append_axis]
new_region = [slice(None)] * len(new_shape)
new_region[append_axis] = slice(
zarr_array.shape[append_axis],
None
)
zarr_array.resize(new_shape)
writer.add(v.data, zarr_array,
region=tuple(new_region))
else:
# new variable
encoding = _extract_zarr_variable_encoding(
v, raise_on_invalid=check)
encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k2, v2 in attrs.items():
encoded_attrs[k2] = self.encode_attribute(v2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we pulled this attribute encoding out before the try block. Then we check encoded_attrs against zarr_array.attrs before appending.


if coding.strings.check_vlen_dtype(dtype) == str:
dtype = str
zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)
writer.add(v.data, zarr_array)

def close(self):
if self._consolidate_on_close:
import zarr
Expand Down
20 changes: 12 additions & 8 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1365,7 +1365,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
compute=compute)

def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False,
append_dim=None):
"""Write dataset contents to a zarr group.

.. note:: Experimental
Expand All @@ -1376,9 +1377,10 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
----------
store : MutableMapping or str, optional
Store or path to directory in file system.
mode : {'w', 'w-'}
mode : {'w', 'w-', 'a'}
Persistence mode: 'w' means create (overwrite if exists);
'w-' means create (fail if exists).
'w-' means create (fail if exists);
'a' means append (create if does not exist).
synchronizer : object, optional
Array synchronizer
group : str, obtional
Expand All @@ -1393,21 +1395,23 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
consolidated: bool, optional
If True, apply zarr's `consolidate_metadata` function to the store
after writing.
append_dim: str, optional
If mode='a', the dimension on which the data will be appended.

References
----------
https://zarr.readthedocs.io/
"""
if encoding is None:
encoding = {}
if mode not in ['w', 'w-']:
# TODO: figure out how to handle 'r+' and 'a'
raise ValueError("The only supported options for mode are 'w' "
"and 'w-'.")
if mode not in ['w', 'w-', 'a']:
# TODO: figure out how to handle 'r+'
raise ValueError("The only supported options for mode are 'w',"
"'w-' and 'a'.")
from ..backends.api import to_zarr
return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
group=group, encoding=encoding, compute=compute,
consolidated=consolidated)
consolidated=consolidated, append_dim=append_dim)

def __repr__(self):
return formatting.dataset_repr(self)
Expand Down
Loading