Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to zarr store #2706

Merged
merged 45 commits into from
Jun 29, 2019
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f231393
Initial version of appending to zarr store
Jan 24, 2019
f14f3b7
Added docs
Jan 24, 2019
928440d
Resolve PEP8 incompliances
Jan 24, 2019
442e938
Added write and append test for mode 'a'
Jan 26, 2019
389ba43
Resolved conflicts with master
Jan 26, 2019
6097da2
Merge branch 'master' of https://github.com/pydata/xarray into append…
Jan 29, 2019
390a792
Merged repaired master
Jan 29, 2019
da9a962
Resolved pep8 issue
Jan 29, 2019
6756b8f
Put target store encoding in appended variable
davidbrochart Apr 3, 2019
95d5782
Merge with master
davidbrochart Apr 4, 2019
a750a92
Rewrite test with appending along time dimension
davidbrochart Apr 4, 2019
295084b
Add chunk_size parameter for rechunking appended coordinate
davidbrochart Apr 22, 2019
cc353e1
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart Apr 22, 2019
e56a210
Merge remote-tracking branch 'upstream/master' into HEAD
davidbrochart May 21, 2019
519b398
Add chunk_dim test
davidbrochart May 21, 2019
c85aa98
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 4, 2019
3adfd49
Add type check and tests for it.
Jun 17, 2019
608813b
Add documentation
Jun 17, 2019
2078838
Add test for compute=False and commented it out
Jun 17, 2019
7a90ce8
Merge master
Jun 17, 2019
b8af5bd
Remove python 3.7 string formatting
Jun 17, 2019
5bee0dc
Fix PEP8 incompliance
Jun 17, 2019
7ed77ad
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 17, 2019
b4ff1c7
Add missing whitespaces
Jun 18, 2019
5b3f8ea
allowed for compute=False when appending to a zarr store
Jun 20, 2019
7564329
Fixed empty array data error
Jun 20, 2019
93be790
flake8 fixes
Jun 20, 2019
58c4b78
removed chunk_dim argument to to_zarr function
Jun 21, 2019
5316593
implemented requested changes
Jun 24, 2019
ad08c73
Update xarray/backends/api.py
shikharsg Jun 25, 2019
4d2122b
added contributors and example of using append to zarr
Jun 25, 2019
105ed39
Merge branch 'append_zarr' of https://github.com/jendrikjoe/xarray in…
Jun 25, 2019
af4a5a5
fixed docs fail
Jun 25, 2019
62d4f52
fixed docs
Jun 25, 2019
9558811
Merge branch 'master' into append_zarr
shikharsg Jun 25, 2019
9d70e02
removed unnecessary condition
Jun 25, 2019
a6ff494
attempt at clean string encoding and variable length strings
Jun 26, 2019
34b700f
implemented suggestions
Jun 26, 2019
3e54cb9
* append_dim does not need to be specified if creating a new array wi…
Jun 27, 2019
97ed25b
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 27, 2019
beb12e5
raise ValueError when append_dim is not a valid dimension
Jun 27, 2019
41a6ca3
flake8 fix
Jun 27, 2019
321aec1
removed unused comment
Jun 27, 2019
2b130ff
* raise error when appending with encoding provided for existing vari…
Jun 29, 2019
58de86d
Merge remote-tracking branch 'upstream/master' into append_zarr
Jun 29, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ Enhancements
report showing what exactly differs between the two objects (dimensions /
coordinates / variables / attributes) (:issue:`1507`).
By `Benoit Bovy <https://github.com/benbovy>`_.
- Added append capability to the zarr store.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to credit all the contributors to this PR.


Bug fixes
~~~~~~~~~
Expand Down
4 changes: 2 additions & 2 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -888,7 +888,7 @@ def save_mfdataset(datasets, paths, mode='w', format=None, groups=None,


def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False, append_dim=None):
"""This function creates an appropriate datastore for writing a dataset to
a zarr ztore

Expand All @@ -907,7 +907,7 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
synchronizer=synchronizer,
group=group,
consolidate_on_close=consolidated)

zstore.append_dim = append_dim
writer = ArrayWriter()
# TODO: figure out how to properly handle unlimited_dims
dump_to_store(dataset, zstore, writer, encoding=encoding)
Expand Down
122 changes: 91 additions & 31 deletions xarray/backends/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,8 @@
from ..core import indexing
from ..core.pycompat import integer_types
from ..core.utils import FrozenOrderedDict, HiddenKeyDict
from .common import AbstractWritableDataStore, BackendArray
from .common import AbstractWritableDataStore, BackendArray, \
_encode_variable_name

# need some special secret attributes to tell us the dimensions
_DIMENSION_KEY = '_ARRAY_DIMENSIONS'
Expand Down Expand Up @@ -312,40 +313,99 @@ def encode_variable(self, variable):
def encode_attribute(self, a):
return _encode_zarr_attr_value(a)

def prepare_variable(self, name, variable, check_encoding=False,
unlimited_dims=None):

attrs = variable.attrs.copy()
dims = variable.dims
dtype = variable.dtype
shape = variable.shape

fill_value = attrs.pop('_FillValue', None)
if variable.encoding == {'_FillValue': None} and fill_value is None:
variable.encoding = {}

encoding = _extract_zarr_variable_encoding(
variable, raise_on_invalid=check_encoding)

encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k, v in attrs.items():
encoded_attrs[k] = self.encode_attribute(v)

zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)

return zarr_array, variable.data

def store(self, variables, attributes, *args, **kwargs):
AbstractWritableDataStore.store(self, variables, attributes,
*args, **kwargs)
def store(self, variables, attributes, check_encoding_set=frozenset(),
writer=None, unlimited_dims=None):
"""
Top level method for putting data on this store, this method:
- encodes variables/attributes
- sets dimensions
- sets variables

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
attributes : dict-like
Dictionary of key/value (attribute name / attribute) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer : ArrayWriter
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
dimension on which the zarray will be appended
only needed in append mode
"""

variables, attributes = self.encode(variables, attributes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the encoding from datetime64 to int64 with days since ... units happens.

If we wanted to make sure that the encoding of the new variables is compatible with the target store, we would have to peek at the target store encodings and explicitly put them in the new variable encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try doing that :) Will probably take a while, but I might be able to do that on Monday or Tuesday 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an easy problem. Advice from @shoyer and @jhamman would be valuable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even consider opening up the zarr store (into an xarray.Dataset) before doing any appending. Then it’s easy to decode all the metadata and ensure consistency of the appended data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid opening the whole zarr store for performance reasons and instead just try pulling the encodings from the array attributes. I think the only way to really solve this is adding the possibility to all CF encoders to use a specific encoding if one is passed.
This would allow to parse the encoding to https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L204 from https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L209 and get the correctly encoded array for the append. What do you think?


self.set_attributes(attributes)
self.set_dimensions(variables, unlimited_dims=unlimited_dims)
self.set_variables(variables, check_encoding_set, writer,
unlimited_dims=unlimited_dims)

def sync(self):
pass

def set_variables(self, variables, check_encoding_set, writer,
unlimited_dims=None, append_dim=None):
"""
This provides a centralized method to set the variables on the data
store.

Parameters
----------
variables : dict-like
Dictionary of key/value (variable name / xr.Variable) pairs
check_encoding_set : list-like
List of variables that should be checked for invalid encoding
values
writer :
unlimited_dims : list-like
List of dimension names that should be treated as unlimited
dimensions.
append_dim: str
dimension on which the zarray will be appended
only needed in append mode
"""
for vn, v in variables.items():
name = _encode_variable_name(vn)
check = vn in check_encoding_set
attrs = v.attrs.copy()
dims = v.dims
dtype = v.dtype
shape = v.shape

fill_value = attrs.pop('_FillValue', None)
if v.encoding == {'_FillValue': None} and fill_value is None:
v.encoding = {}
append = False
try:
zarr_array = self.ds[name]
append = True
except KeyError:
encoding = _extract_zarr_variable_encoding(
v, raise_on_invalid=check)
encoded_attrs = OrderedDict()
# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k2, v2 in attrs.items():
encoded_attrs[k2] = self.encode_attribute(v2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we pulled this attribute encoding out before the try block. Then we check encoded_attrs against zarr_array.attrs before appending.


zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
fill_value=fill_value, **encoding)
zarr_array.attrs.put(encoded_attrs)
zarr_array[...] = v.data
if append:
if self.append_dim is None:
raise ValueError('The dimension on which the data is \
appended has to be named.')
Copy link
Contributor

@rabernat rabernat Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just want to add a new variable to an existing zarr store? This PR could hypothetically support that case as well, but in that case, there is no append_dim to specify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment, it should already work, but I will add another test for it 👍

if self.append_dim not in dims:
continue
axis = dims.index(self.append_dim)
zarr_array.append(v.data, axis=axis)

def close(self):
if self._consolidate_on_close:
import zarr
Expand Down
16 changes: 10 additions & 6 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -1243,7 +1243,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
compute=compute)

def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
encoding=None, compute=True, consolidated=False):
encoding=None, compute=True, consolidated=False,
append_dim=None):
"""Write dataset contents to a zarr group.

.. note:: Experimental
Expand All @@ -1254,9 +1255,10 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
----------
store : MutableMapping or str, optional
Store or path to directory in file system.
mode : {'w', 'w-'}
mode : {'w', 'w-', 'a'}
Persistence mode: 'w' means create (overwrite if exists);
'w-' means create (fail if exists).
'w-' means create (fail if exists);
'a' means append (create if does not exist).
synchronizer : object, optional
Array synchronizer
group : str, obtional
Expand All @@ -1271,21 +1273,23 @@ def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,
consolidated: bool, optional
If True, apply zarr's `consolidate_metadata` function to the store
after writing.
append_dim: str
If mode='a' hand the dimension on which the data will be appended

References
----------
https://zarr.readthedocs.io/
"""
if encoding is None:
encoding = {}
if mode not in ['w', 'w-']:
# TODO: figure out how to handle 'r+' and 'a'
if mode not in ['w', 'w-', 'a']:
# TODO: figure out how to handle 'r+'
raise ValueError("The only supported options for mode are 'w' "
"and 'w-'.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and 'a' now!

from ..backends.api import to_zarr
return to_zarr(self, store=store, mode=mode, synchronizer=synchronizer,
group=group, encoding=encoding, compute=compute,
consolidated=consolidated)
consolidated=consolidated, append_dim=append_dim)

def __repr__(self):
return formatting.dataset_repr(self)
Expand Down
18 changes: 12 additions & 6 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
requires_pathlib, requires_pseudonetcdf, requires_pydap, requires_pynio,
requires_rasterio, requires_scipy, requires_scipy_or_netCDF4,
requires_zarr)
from .test_dataset import create_test_data
from .test_dataset import create_test_data, create_append_test_data

try:
import netCDF4 as nc4
Expand Down Expand Up @@ -1482,11 +1482,17 @@ def test_write_persistence_modes(self):
with pytest.raises(ValueError):
self.save(original, store, mode='w-')

# check that we can't use other persistence modes
# TODO: reconsider whether other persistence modes should be supported
with pytest.raises(ValueError):
with self.roundtrip(original, save_kwargs={'mode': 'a'}) as actual:
pass
# check append mode for normal write
with self.roundtrip(original, save_kwargs={'mode': 'a'}) as actual:
assert_identical(original, actual)

# check append mode for append write
obj, obj2 = create_append_test_data()
with self.create_zarr_target() as store_target:
obj.to_zarr(store_target, mode='w')
obj2.to_zarr(store_target, mode='a', append_dim='dim1')
original = xr.concat([obj, obj2], dim='dim1')
assert_identical(original, xr.open_zarr(store_target))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 love this test 😄


def test_compressor_encoding(self):
original = create_test_data()
Expand Down
37 changes: 37 additions & 0 deletions xarray/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,43 @@ def create_test_data(seed=None):
return obj


def create_append_test_data(seed=None):
rs = np.random.RandomState(seed)
_vars = {'var1': ['dim1', 'dim2'],
'var2': ['dim1', 'dim2'],
'var3': ['dim3', 'dim1']}
_dims = {'dim1': 8, 'dim2': 9, 'dim3': 10}

obj = Dataset()
obj['time'] = ('time', pd.date_range('2000-01-01', periods=20))
obj['dim2'] = ('dim2', 0.5 * np.arange(_dims['dim2']))
obj['dim3'] = ('dim3', list('abcdefghij'))
for v, dims in sorted(_vars.items()):
data = rs.normal(size=tuple(_dims[d] for d in dims))
obj[v] = (dims, data, {'foo': 'variable'})
obj.coords['numbers'] = ('dim3', np.array([0, 1, 2, 0, 0, 1, 1, 2, 2, 3],
dtype='int64'))
obj.encoding = {'foo': 'bar'}
assert all(objp.data.flags.writeable for objp in obj.variables.values())
_vars = {'var1': ['dim1', 'dim2'],
'var2': ['dim1', 'dim2'],
'var3': ['dim3', 'dim1']}
_dims = {'dim1': 8, 'dim2': 9, 'dim3': 10}

obj2 = Dataset()
obj2['time'] = ('time', pd.date_range('2000-01-01', periods=20))
obj2['dim2'] = ('dim2', 0.5 * np.arange(_dims['dim2']))
obj2['dim3'] = ('dim3', list('abcdefghij'))
for v, dims in sorted(_vars.items()):
data = rs.normal(size=tuple(_dims[d] for d in dims))
obj2[v] = (dims, data, {'foo': 'variable'})
obj2.coords['numbers'] = ('dim3', np.array([0, 1, 2, 0, 0, 1, 1, 2, 2, 3],
dtype='int64'))
obj2.encoding = {'foo': 'bar'}
assert all(objp.data.flags.writeable for objp in obj2.variables.values())
return obj, obj2


def create_test_multiindex():
mindex = pd.MultiIndex.from_product([['a', 'b'], [1, 2]],
names=('level_1', 'level_2'))
Expand Down