Appending to zarr store #2706

jendrikjoe · 2019-01-24T21:43:02Z

This pull request allows to append an xarray to an existing datastore.

Closes Enable Append/concat to existing zarr datastore #2022
Tests will be added. Wanted to get an opinion if this is what is imagined by the community
Fully documented, including whats-new.rst for all changes and api.rst for new API
To filter the data written to the array, the dimension over which the data will be appended has to be explicitly stated. If someone has an idea how to overcome this, I would be more than happy to incorporate the necessary changes into the PR.
Cheers,
Jendrik

pep8speaks · 2019-01-24T21:43:09Z

Hello @jendrikjoe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-29 22:49:41 UTC

rabernat · 2019-01-25T21:49:26Z

Hi @jendrikjoe -- thanks for submitting a PR to address one of the most important issues in xarray (IMHO)! I am very excited about your contribution and am looking forward to getting this feature merged.

I have many questions about how this works. I think the best way to move forward is to wait until we have a test for the append feature which involves the following steps:

Write a dataset to a zarr store
Open the store in append mode
Append data along a particular dimension

Seeing the code that accomplishes this will help clarify for me what is happening.

Thanks again for your contribution, and welcome to xarray!

jendrikjoe · 2019-01-26T12:35:28Z

Hi @rabernat,

happy to help! I love using xarray. I added the test for the append mode.
One is making sure, that it behaves like the 'w' mode, if no data exist at the target path.
The other one is testing what you described. The append_dim argument is actually the same as the dim argument for concat.
Hope that helps clarifying my code :)

rabernat · 2019-01-29T20:19:58Z

Ok, with the example, I can see a bit better how this works.

Here is my main concern: there doesn't appear to be any alignment checking between the target dataset and the new data. The only check that happens is whether a variable with the same name already exists in the target store, if so, append is used (rather than creating a new array). What if the coordinates differ? What if the attributes differ?

I'm not sure this is a deal-breaker. But we should be very clear about this in the docs.

rabernat · 2019-01-29T20:21:37Z

xarray/backends/zarr.py

+            if append:
+                if self.append_dim is None:
+                    raise ValueError('The dimension on which the data is \
+                     appended has to be named.')


What if we just want to add a new variable to an existing zarr store? This PR could hypothetically support that case as well, but in that case, there is no append_dim to specify.

As mentioned in the other comment, it should already work, but I will add another test for it 👍

jendrikjoe · 2019-01-29T20:29:05Z

You are definitely right, that there are no checks regarding the alignment.
However, if another shape than the append_dim does not align zarr will raise an error.
If the coordinate differs that could be definitely an issue. I did not think about that as I am dumping reshaped dask.dataframe partitions with the append mode. Therefore, I am anyway not allowed to have a name twice. Might be interesting for other users indeed. Similar point for the attributes. I could try figuring that out as well, but that might take a while.
The place where the ValueError is raised should allow to add other variables, as those are added in the KeyError exception above :)

…_zarr

davidbrochart · 2019-01-29T21:50:14Z

Hi @jendrikjoe,

Thanks for your PR, I am very interested in it because this is something I was hacking around (see here). In my particular case, I want to append along a time dimension, but it looks like your PR currently doesn't support it. In the following example ds2 should have a time dimension ranging from 2000-01-01 to 2000-01-06:

import xarray as xr
import pandas as pd

ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
ds1 = xr.Dataset({'temperature': (['time'],  [53, 54, 55])}, coords={'time': pd.date_range('2000-01-04', periods=3)})

ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

ds2 = xr.open_zarr('temp')

But it's not the case:

ds2.time
<xarray.DataArray 'time' (time: 6)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-01T00:00:00.000000000',
       '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-03

Maybe it's not intended to work with time dimensions yet?

davidbrochart · 2019-01-29T22:19:07Z

To make it work, time dimensions would have to be treated separately because zarr doesn't encode absolute time values but deltas relative to a reference (see https://github.com/davidbrochart/pangeo_upload/blob/master/py/trmm2pangeo.py#L108).

jendrikjoe · 2019-01-29T22:39:00Z

Hey @davidbrochart,
thanks for all your input and as well for the resarch on how zarr stores the data.
I would actually claim that the calculation of the accurate relative time should be handled by the zarr append function.
An exception would be of course if xarray is storing the data with deltas to a reference as well?
Then I would try collecting the minimum and offsetting the input by this.
@rabernat can you provide input on that?

davidbrochart · 2019-01-30T09:05:53Z

zarr stores the reference in the .zattrs file:

{
    "_ARRAY_DIMENSIONS": [
        "time"
    ],
    "calendar": "proleptic_gregorian",
    "units": "days since 2000-01-01 00:00:00"
}

jendrikjoe · 2019-01-30T10:37:56Z

I will check as well how xarry stores times to check if we have to add the offset to the xarray first or if this can be resolved with a PR to zarr :)

rabernat · 2019-01-31T01:24:11Z

So the problem in @davidbrochart's example is that there are different encodings on the time variables in the two datasets.

When writing datetimes, xarray automatically picks an encoding (i.e. days since 2000-01-01 00:00:00) based on some heuristics. When serializing the dataset, this encoding is used to encode the datetime64[ns] dtype into a different dtype, and the encoding is placed in the attributes of the store. When you open the dataset, the encoding is automatically decoded according to CF conventions. This can be disabled by using decode_cf=False or decode_times=False when you open the dataset.

In this case, xarray's heuristics are picking different encodings for the two dates. You could make this example work by manually specifying encoding on the appended dataset to be the same as the original.

This example illustrates the need for some sort of compatibility checks between the target dataset and the appended dataset. For example, checking for attribute compatibility would have caught this error.

shoyer · 2019-02-01T07:54:30Z

We should definitely always make sure that we write data consistently (e.g., for dates), but checking for alignment of all coordinates could be expensive/slow. Potentially a keyword argument ignore_alignment=True would be a good way for user to opt-out of checking index coordinates for consistency.

davidbrochart · 2019-02-01T09:51:55Z

When we use this feature e.g. to store data that is produced every day, we might start with a data set that has a small size on the time dimension, and thus the chunks will be chosen according to this initial shape. When we append to this data set, will the chunks be kept as in the initial zarr archive? If so, we might end up with a lot of small chunks on the time dimension, where ideally we would have chosen only one chunk.

rabernat · 2019-02-01T18:07:26Z

We should definitely always make sure that we write data consistently (e.g., for dates), but checking for alignment of all coordinates could be expensive/slow.

This implies we should be checking for attributes compatibility before calling zarr.append.

rabernat · 2019-02-01T18:14:45Z

xarray/backends/zarr.py

+                # the magic for storing the hidden dimension data
+                encoded_attrs[_DIMENSION_KEY] = dims
+                for k2, v2 in attrs.items():
+                    encoded_attrs[k2] = self.encode_attribute(v2)


What if we pulled this attribute encoding out before the try block. Then we check encoded_attrs against zarr_array.attrs before appending.

rabernat · 2019-02-01T18:40:53Z

xarray/backends/zarr.py

+            only needed in append mode
+        """
+
+        variables, attributes = self.encode(variables, attributes)


This is where the encoding from datetime64 to int64 with days since ... units happens.

If we wanted to make sure that the encoding of the new variables is compatible with the target store, we would have to peek at the target store encodings and explicitly put them in the new variable encoding.

Will try doing that :) Will probably take a while, but I might be able to do that on Monday or Tuesday 👍

This is not an easy problem. Advice from @shoyer and @jhamman would be valuable.

I would even consider opening up the zarr store (into an xarray.Dataset) before doing any appending. Then it’s easy to decode all the metadata and ensure consistency of the appended data.

I would try to avoid opening the whole zarr store for performance reasons and instead just try pulling the encodings from the array attributes. I think the only way to really solve this is adding the possibility to all CF encoders to use a specific encoding if one is passed.
This would allow to parse the encoding to https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L204 from https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L209 and get the correctly encoded array for the append. What do you think?

davidbrochart · 2019-03-22T09:53:43Z

Hi @jendrikjoe, do you plan to work on this PR again in the future? I think it would be a great contribution to xarray.

davidbrochart · 2019-03-31T18:22:53Z

May I try and take this work over?

rabernat · 2019-03-31T19:47:49Z

@davidbrochart I would personally be happy to see anyone work on this. I'm sure @jendrikjoe would not mind if we make it a team effort!

shikharsg · 2019-06-27T10:21:53Z

adding a new variable currently errors if we don't provide the append_dim argument:

>>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shikhar/code/xarray/xarray/core/dataset.py", line 1374, in to_zarr
    consolidated=consolidated, append_dim=append_dim)
  File "/home/shikhar/code/xarray/xarray/backends/api.py", line 1071, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/shikhar/code/xarray/xarray/backends/api.py", line 928, in dump_to_store
    unlimited_dims=unlimited_dims)
  File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 366, in store
    unlimited_dims=unlimited_dims)
  File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 406, in set_variables
    "was not set".format(name)
ValueError: variable 'time' already exists, but append_dim was not set

this works:

>>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a', append_dim='asdfasdf')
>>> xr.open_zarr(store)
<xarray.Dataset>
Dimensions:      (time: 3)
Coordinates:
  * time         (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
Data variables:
    pressure     (time) int64 dask.array<shape=(3,), chunksize=(3,)>
    temperature  (time) int64 dask.array<shape=(3,), chunksize=(3,)>

will push a fix for this in a bit

…th Dataset.to_zarr(store, mode='a') * cleand up to_zarr append mode tests

shikharsg · 2019-06-27T14:11:55Z

I have fixed the above error now and all comments have now been addressed.

@rabernat @shoyer

rabernat · 2019-06-27T14:59:52Z

adding a new variable currently errors if we don't provide the append_dim argument:

Is this scenario now covered by the tests? Sorry if the answer is obvious; it's hard for me to discern just by looking at the code.

shoyer · 2019-06-27T15:18:00Z

Just to be clear, we do always requiring writing append_dim if you want to append values along a dimension, right? And we raise an informative error if you write append_dim='not-a-valid-dimension'?

shikharsg · 2019-06-27T15:50:36Z

adding a new variable currently errors if we don't provide the append_dim argument:

Is this scenario now covered by the tests? Sorry if the answer is obvious; it's hard for me to discern just by looking at the code.

@rabernat , the scenario I am talking about is adding a new DataArray to an existing Dataset(in which case we do not have to specify an append_dim argument). Yes it is covered by tests, specifically see the with clause here: https://github.com/pydata/xarray/pull/2706/files#diff-df47fcb9c2f1f7dfc0c6032d97072af2R1636

Just to be clear, we do always requiring writing append_dim if you want to append values along a dimension, right? And we raise an informative error if you write append_dim='not-a-valid-dimension'?

@shoyer We do always require append_dim when appending to an existing array, but I just realized that it does not raise an error when append_dim='not-valid', but silently fails to append to the existing array. Let me write a test for that and push

shikharsg · 2019-06-27T16:27:58Z

it's done. I fixed it by opening the zarr dataset beforehand using xr.open_zarr

shoyer · 2019-06-27T16:33:49Z

xarray/backends/zarr.py

+            if name in self.ds:
+                zarr_array = self.ds[name]
+                """
+                If variable is a dimension of an existing array


Please use # for comments instead of strings, which should be reserved for docstrings.

sorry, yes. That comment is not actually needed(it's from code I already removed). So I'll remove that.

shoyer · 2019-06-27T16:37:47Z

xarray/backends/zarr.py

+            variables_with_encoding = OrderedDict()
+            for vn in existing_variables:
+                variables_with_encoding[vn] = variables[vn]
+                variables_with_encoding[vn].encoding = ds[vn].encoding


This modifies an argument that was passed into the function in-place, which in general should be avoided due to unexpected side effects in other parts of the code. It would be better to shallow copy the Variable before overriding encoding, e.g., with variables[vn].copy(deep=False).

shoyer · 2019-06-27T16:40:21Z

xarray/backends/zarr.py

+
+        if len(existing_variables) > 0:
+            # there are variables to append
+            # their encoding must be the same as in the store


Are there any unit tests that verify that encoding is kept consistent? This would be nice to add, if not.

Probably a good example would be dataset saved with scale/offset encoding, where the new dataset to be appended does not have any encoding provided. We could verify that probably scaled values are read back from disk.

shoyer · 2019-06-27T16:45:11Z

xarray/backends/api.py

@@ -1040,11 +1078,16 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
    _validate_dataset_names(dataset)
    _validate_attrs(dataset)

+    if mode == "a":


Can we raise an error if encoding was passed explicitly into to_zarr() and specifies encoding for any existing variables? I.e., ds.to_arr(..., mode='a', encoding={'existing_variable': ....}). I think this would always indicate a programming mistake, given that we throw away these variable encodings anyways.

…able * add test for encoding consistency when appending * implemented: pydata#2706 (comment) * refactored tests

shikharsg · 2019-06-29T13:57:31Z

I have implemented all the changes suggested and refactored the append tests as all tests were previously crammed into test_write_persistence_modes

I'm not sure why the build fails. In all the failed checks, it is these two tests that are failing:

================================== FAILURES ===================================
_________________________ test_rolling_properties[1] __________________________

da = <xarray.DataArray (a: 3, time: 21, x: 4)>
array([[[0.561926, 0.243845, 0.601879, 0.733398],
        [0.500418, 0.84942...ordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-21
Dimensions without coordinates: a, x

    def test_rolling_properties(da):
        rolling_obj = da.rolling(time=4)
    
        assert rolling_obj.obj.get_axis_num('time') == 1
    
        # catching invalid args
        with pytest.raises(ValueError) as exception:
            da.rolling(time=7, x=2)
>       assert 'exactly one dim/window should' in str(exception)
E       AssertionError: assert 'exactly one dim/window should' in '<ExceptionInfo ValueError tblen=4>'
E        +  where '<ExceptionInfo ValueError tblen=4>' = str(<ExceptionInfo ValueError tblen=4>)

xarray\tests\test_dataarray.py:3715: AssertionError
_________________________ test_rolling_properties[1] __________________________

ds = <xarray.Dataset>
Dimensions:  (time: 10, x: 8, y: 2)
Coordinates:
  * x        (x) float64 0.0 0.1429 0.2857 0.4286 0....-1.152 -0.6704 ... -0.9796 -1.884 0.4049
    z2       (time, y) float64 -1.218 -0.9627 -1.398 ... -0.3552 0.1446 0.3392

    def test_rolling_properties(ds):
        # catching invalid args
        with pytest.raises(ValueError) as exception:
            ds.rolling(time=7, x=2)
>       assert 'exactly one dim/window should' in str(exception)
E       AssertionError: assert 'exactly one dim/window should' in '<ExceptionInfo ValueError tblen=4>'
E        +  where '<ExceptionInfo ValueError tblen=4>' = str(<ExceptionInfo ValueError tblen=4>)

xarray\tests\test_dataset.py:4845: AssertionError
============================== warnings summary ===============================

I have no idea why as the same two tests pass on my local machine

shikharsg · 2019-06-29T14:05:02Z

xarray/tests/test_backends.py

+            ds.to_zarr(store_target, mode='w')
+            ds_to_append.to_zarr(store_target, mode='a', append_dim='time')
+            original = xr.concat([ds, ds_to_append], dim='time')
+            assert_identical(original, xr.open_zarr(store_target))

    @pytest.mark.xfail(reason="Zarr stores can not be appended to")
    def test_append_overwrite_values(self):


what is this test exactly supposed to do? Is it something like what's happening here: https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L881?
If so,this kind of functionality is not implemented in this PR yet. You can only "append" to existing variables, not overwrite existing data. That would have to be done with the 'w' mode.

@shoyer @rabernat

Yes, I think you have analyzed this correctly. This is fine.

shoyer · 2019-06-29T20:27:10Z

@shikharsg the test failure should be fixed on master (by #3059).

approved in a comment already

shoyer · 2019-06-29T23:43:50Z

OK, thank you @shikharsg, @jendrikjoe and everyone else who worked on this !

jendrik added 2 commits January 24, 2019 22:15

Initial version of appending to zarr store

f231393

Added docs

f14f3b7

Resolve PEP8 incompliances

928440d

dcherian requested a review from rabernat January 25, 2019 19:34

jendrik added 2 commits January 26, 2019 13:25

Added write and append test for mode 'a'

442e938

Resolved conflicts with master

389ba43

rabernat reviewed Jan 29, 2019

View reviewed changes

jendrik added 2 commits January 29, 2019 21:29

Merge branch 'master' of https://github.com/pydata/xarray into append…

6097da2

…_zarr

Merged repaired master

390a792

Resolved pep8 issue

da9a962

rabernat reviewed Feb 1, 2019

View reviewed changes

rabernat mentioned this pull request Feb 6, 2019

weekly checkin meeting 2019-02-06 15:00 EST pangeo-data/pangeo#545

Closed

implemented suggestions

34b700f

Shikhar Goenka added 2 commits June 27, 2019 14:14

* append_dim does not need to be specified if creating a new array wi…

3e54cb9

…th Dataset.to_zarr(store, mode='a') * cleand up to_zarr append mode tests

Merge remote-tracking branch 'upstream/master' into append_zarr

97ed25b

raise ValueError when append_dim is not a valid dimension

beb12e5

flake8 fix

41a6ca3

shoyer reviewed Jun 27, 2019

View reviewed changes

removed unused comment

321aec1

shoyer reviewed Jun 27, 2019

View reviewed changes

* raise error when appending with encoding provided for existing vari…

2b130ff

…able * add test for encoding consistency when appending * implemented: pydata#2706 (comment) * refactored tests

shikharsg reviewed Jun 29, 2019

View reviewed changes

shoyer mentioned this pull request Jun 29, 2019

Fix test suite use of str(exception) #3059

Merged

2 tasks

Merge remote-tracking branch 'upstream/master' into append_zarr

58de86d

shoyer approved these changes Jun 29, 2019

View reviewed changes

shoyer merged commit 18f35da into pydata:master Jun 29, 2019

shoyer mentioned this pull request Jul 5, 2019

Support variable length string arrays in xarray/zarr #2724

Closed

VincentDehaye mentioned this pull request Jul 11, 2019

Support parallel writes to zarr store #3096

Closed

VincentDehaye mentioned this pull request Jul 30, 2019

Dataset.to_zarr() with mode='a' does not work with groups #3170

Closed

shikharsg mentioned this pull request Dec 12, 2019

Problems faced while storing onto Zarr store using ABSStore zarr-developers/zarr-python#528

Closed

shoyer mentioned this pull request Mar 11, 2022

to_zarr raises ValueError: Invalid dtype with mode='a' (but not with mode='w') #6345

Closed

cisaacstern mentioned this pull request Apr 13, 2022

Fix zarr append dtype checks #6476

Merged

4 tasks

Appending to zarr store #2706

Appending to zarr store #2706

Conversation

jendrikjoe commented Jan 24, 2019 • edited by jhamman Loading

pep8speaks commented Jan 24, 2019 • edited Loading

Comment last updated at 2019-06-29 22:49:41 UTC

rabernat commented Jan 25, 2019

jendrikjoe commented Jan 26, 2019

rabernat commented Jan 29, 2019

rabernat Jan 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jendrikjoe commented Jan 29, 2019 • edited Loading

davidbrochart commented Jan 29, 2019

davidbrochart commented Jan 29, 2019

jendrikjoe commented Jan 29, 2019

davidbrochart commented Jan 30, 2019

jendrikjoe commented Jan 30, 2019

rabernat commented Jan 31, 2019

shoyer commented Feb 1, 2019

davidbrochart commented Feb 1, 2019

rabernat commented Feb 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidbrochart commented Mar 22, 2019

davidbrochart commented Mar 31, 2019

rabernat commented Mar 31, 2019

shikharsg commented Jun 27, 2019

shikharsg commented Jun 27, 2019

rabernat commented Jun 27, 2019

shoyer commented Jun 27, 2019

shikharsg commented Jun 27, 2019

shikharsg commented Jun 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shikharsg commented Jun 29, 2019

shikharsg Jun 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jun 29, 2019

shoyer commented Jun 29, 2019

jendrikjoe commented Jan 24, 2019 •

edited by jhamman

Loading

pep8speaks commented Jan 24, 2019 •

edited

Loading

rabernat Jan 29, 2019 •

edited

Loading

jendrikjoe commented Jan 29, 2019 •

edited

Loading

shikharsg Jun 29, 2019 •

edited

Loading