Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending to zarr store #2706

Merged
merged 45 commits into from
Jun 29, 2019
Merged

Appending to zarr store #2706

merged 45 commits into from
Jun 29, 2019

Conversation

jendrikjoe
Copy link
Contributor

@jendrikjoe jendrikjoe commented Jan 24, 2019

This pull request allows to append an xarray to an existing datastore.

  • Closes Enable Append/concat to existing zarr datastore #2022
  • Tests will be added. Wanted to get an opinion if this is what is imagined by the community
  • Fully documented, including whats-new.rst for all changes and api.rst for new API
    To filter the data written to the array, the dimension over which the data will be appended has to be explicitly stated. If someone has an idea how to overcome this, I would be more than happy to incorporate the necessary changes into the PR.
    Cheers,
    Jendrik

@pep8speaks
Copy link

pep8speaks commented Jan 24, 2019

Hello @jendrikjoe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-06-29 22:49:41 UTC

@dcherian dcherian requested a review from rabernat January 25, 2019 19:34
@rabernat
Copy link
Contributor

Hi @jendrikjoe -- thanks for submitting a PR to address one of the most important issues in xarray (IMHO)! I am very excited about your contribution and am looking forward to getting this feature merged.

I have many questions about how this works. I think the best way to move forward is to wait until we have a test for the append feature which involves the following steps:

  • Write a dataset to a zarr store
  • Open the store in append mode
  • Append data along a particular dimension

Seeing the code that accomplishes this will help clarify for me what is happening.

Thanks again for your contribution, and welcome to xarray!

@jendrikjoe
Copy link
Contributor Author

Hi @rabernat,

happy to help! I love using xarray. I added the test for the append mode.
One is making sure, that it behaves like the 'w' mode, if no data exist at the target path.
The other one is testing what you described. The append_dim argument is actually the same as the dim argument for concat.
Hope that helps clarifying my code :)

@rabernat
Copy link
Contributor

Ok, with the example, I can see a bit better how this works.

Here is my main concern: there doesn't appear to be any alignment checking between the target dataset and the new data. The only check that happens is whether a variable with the same name already exists in the target store, if so, append is used (rather than creating a new array). What if the coordinates differ? What if the attributes differ?

I'm not sure this is a deal-breaker. But we should be very clear about this in the docs.

if append:
if self.append_dim is None:
raise ValueError('The dimension on which the data is \
appended has to be named.')
Copy link
Contributor

@rabernat rabernat Jan 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just want to add a new variable to an existing zarr store? This PR could hypothetically support that case as well, but in that case, there is no append_dim to specify.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment, it should already work, but I will add another test for it 👍

@jendrikjoe
Copy link
Contributor Author

jendrikjoe commented Jan 29, 2019

You are definitely right, that there are no checks regarding the alignment.
However, if another shape than the append_dim does not align zarr will raise an error.
If the coordinate differs that could be definitely an issue. I did not think about that as I am dumping reshaped dask.dataframe partitions with the append mode. Therefore, I am anyway not allowed to have a name twice. Might be interesting for other users indeed. Similar point for the attributes. I could try figuring that out as well, but that might take a while.
The place where the ValueError is raised should allow to add other variables, as those are added in the KeyError exception above :)

@davidbrochart
Copy link
Contributor

Hi @jendrikjoe,

Thanks for your PR, I am very interested in it because this is something I was hacking around (see here). In my particular case, I want to append along a time dimension, but it looks like your PR currently doesn't support it. In the following example ds2 should have a time dimension ranging from 2000-01-01 to 2000-01-06:

import xarray as xr
import pandas as pd

ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
ds1 = xr.Dataset({'temperature': (['time'],  [53, 54, 55])}, coords={'time': pd.date_range('2000-01-04', periods=3)})

ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')

ds2 = xr.open_zarr('temp')

But it's not the case:

ds2.time
<xarray.DataArray 'time' (time: 6)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
       '2000-01-03T00:00:00.000000000', '2000-01-01T00:00:00.000000000',
       '2000-01-02T00:00:00.000000000', '2000-01-03T00:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-03

Maybe it's not intended to work with time dimensions yet?

@davidbrochart
Copy link
Contributor

To make it work, time dimensions would have to be treated separately because zarr doesn't encode absolute time values but deltas relative to a reference (see https://github.com/davidbrochart/pangeo_upload/blob/master/py/trmm2pangeo.py#L108).

@jendrikjoe
Copy link
Contributor Author

Hey @davidbrochart,
thanks for all your input and as well for the resarch on how zarr stores the data.
I would actually claim that the calculation of the accurate relative time should be handled by the zarr append function.
An exception would be of course if xarray is storing the data with deltas to a reference as well?
Then I would try collecting the minimum and offsetting the input by this.
@rabernat can you provide input on that?

@davidbrochart
Copy link
Contributor

zarr stores the reference in the .zattrs file:

{
    "_ARRAY_DIMENSIONS": [
        "time"
    ],
    "calendar": "proleptic_gregorian",
    "units": "days since 2000-01-01 00:00:00"
}

@jendrikjoe
Copy link
Contributor Author

I will check as well how xarry stores times to check if we have to add the offset to the xarray first or if this can be resolved with a PR to zarr :)

@rabernat
Copy link
Contributor

So the problem in @davidbrochart's example is that there are different encodings on the time variables in the two datasets.

When writing datetimes, xarray automatically picks an encoding (i.e. days since 2000-01-01 00:00:00) based on some heuristics. When serializing the dataset, this encoding is used to encode the datetime64[ns] dtype into a different dtype, and the encoding is placed in the attributes of the store. When you open the dataset, the encoding is automatically decoded according to CF conventions. This can be disabled by using decode_cf=False or decode_times=False when you open the dataset.

In this case, xarray's heuristics are picking different encodings for the two dates. You could make this example work by manually specifying encoding on the appended dataset to be the same as the original.

This example illustrates the need for some sort of compatibility checks between the target dataset and the appended dataset. For example, checking for attribute compatibility would have caught this error.

@shoyer
Copy link
Member

shoyer commented Feb 1, 2019

We should definitely always make sure that we write data consistently (e.g., for dates), but checking for alignment of all coordinates could be expensive/slow. Potentially a keyword argument ignore_alignment=True would be a good way for user to opt-out of checking index coordinates for consistency.

@davidbrochart
Copy link
Contributor

When we use this feature e.g. to store data that is produced every day, we might start with a data set that has a small size on the time dimension, and thus the chunks will be chosen according to this initial shape. When we append to this data set, will the chunks be kept as in the initial zarr archive? If so, we might end up with a lot of small chunks on the time dimension, where ideally we would have chosen only one chunk.

@rabernat
Copy link
Contributor

rabernat commented Feb 1, 2019

We should definitely always make sure that we write data consistently (e.g., for dates), but checking for alignment of all coordinates could be expensive/slow.

This implies we should be checking for attributes compatibility before calling zarr.append.

# the magic for storing the hidden dimension data
encoded_attrs[_DIMENSION_KEY] = dims
for k2, v2 in attrs.items():
encoded_attrs[k2] = self.encode_attribute(v2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we pulled this attribute encoding out before the try block. Then we check encoded_attrs against zarr_array.attrs before appending.

only needed in append mode
"""

variables, attributes = self.encode(variables, attributes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the encoding from datetime64 to int64 with days since ... units happens.

If we wanted to make sure that the encoding of the new variables is compatible with the target store, we would have to peek at the target store encodings and explicitly put them in the new variable encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try doing that :) Will probably take a while, but I might be able to do that on Monday or Tuesday 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an easy problem. Advice from @shoyer and @jhamman would be valuable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even consider opening up the zarr store (into an xarray.Dataset) before doing any appending. Then it’s easy to decode all the metadata and ensure consistency of the appended data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to avoid opening the whole zarr store for performance reasons and instead just try pulling the encodings from the array attributes. I think the only way to really solve this is adding the possibility to all CF encoders to use a specific encoding if one is passed.
This would allow to parse the encoding to https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L204 from https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L209 and get the correctly encoded array for the append. What do you think?

@davidbrochart
Copy link
Contributor

Hi @jendrikjoe, do you plan to work on this PR again in the future? I think it would be a great contribution to xarray.

@davidbrochart
Copy link
Contributor

May I try and take this work over?

@rabernat
Copy link
Contributor

@davidbrochart I would personally be happy to see anyone work on this. I'm sure @jendrikjoe would not mind if we make it a team effort!

@shikharsg
Copy link

adding a new variable currently errors if we don't provide the append_dim argument:

>>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/shikhar/code/xarray/xarray/core/dataset.py", line 1374, in to_zarr
    consolidated=consolidated, append_dim=append_dim)
  File "/home/shikhar/code/xarray/xarray/backends/api.py", line 1071, in to_zarr
    dump_to_store(dataset, zstore, writer, encoding=encoding)
  File "/home/shikhar/code/xarray/xarray/backends/api.py", line 928, in dump_to_store
    unlimited_dims=unlimited_dims)
  File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 366, in store
    unlimited_dims=unlimited_dims)
  File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 406, in set_variables
    "was not set".format(name)
ValueError: variable 'time' already exists, but append_dim was not set

this works:

>>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'],  [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a', append_dim='asdfasdf')
>>> xr.open_zarr(store)
<xarray.Dataset>
Dimensions:      (time: 3)
Coordinates:
  * time         (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
Data variables:
    pressure     (time) int64 dask.array<shape=(3,), chunksize=(3,)>
    temperature  (time) int64 dask.array<shape=(3,), chunksize=(3,)>

will push a fix for this in a bit

Shikhar Goenka added 2 commits June 27, 2019 14:14
@shikharsg
Copy link

I have fixed the above error now and all comments have now been addressed.

@rabernat @shoyer

@rabernat
Copy link
Contributor

adding a new variable currently errors if we don't provide the append_dim argument:

Is this scenario now covered by the tests? Sorry if the answer is obvious; it's hard for me to discern just by looking at the code.

@shoyer
Copy link
Member

shoyer commented Jun 27, 2019

Just to be clear, we do always requiring writing append_dim if you want to append values along a dimension, right? And we raise an informative error if you write append_dim='not-a-valid-dimension'?

@shikharsg
Copy link

adding a new variable currently errors if we don't provide the append_dim argument:

Is this scenario now covered by the tests? Sorry if the answer is obvious; it's hard for me to discern just by looking at the code.

@rabernat , the scenario I am talking about is adding a new DataArray to an existing Dataset(in which case we do not have to specify an append_dim argument). Yes it is covered by tests, specifically see the with clause here: https://github.com/pydata/xarray/pull/2706/files#diff-df47fcb9c2f1f7dfc0c6032d97072af2R1636

Just to be clear, we do always requiring writing append_dim if you want to append values along a dimension, right? And we raise an informative error if you write append_dim='not-a-valid-dimension'?

@shoyer We do always require append_dim when appending to an existing array, but I just realized that it does not raise an error when append_dim='not-valid', but silently fails to append to the existing array. Let me write a test for that and push

@shikharsg
Copy link

it's done. I fixed it by opening the zarr dataset beforehand using xr.open_zarr

if name in self.ds:
zarr_array = self.ds[name]
"""
If variable is a dimension of an existing array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use # for comments instead of strings, which should be reserved for docstrings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, yes. That comment is not actually needed(it's from code I already removed). So I'll remove that.

variables_with_encoding = OrderedDict()
for vn in existing_variables:
variables_with_encoding[vn] = variables[vn]
variables_with_encoding[vn].encoding = ds[vn].encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This modifies an argument that was passed into the function in-place, which in general should be avoided due to unexpected side effects in other parts of the code. It would be better to shallow copy the Variable before overriding encoding, e.g., with variables[vn].copy(deep=False).


if len(existing_variables) > 0:
# there are variables to append
# their encoding must be the same as in the store
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any unit tests that verify that encoding is kept consistent? This would be nice to add, if not.

Probably a good example would be dataset saved with scale/offset encoding, where the new dataset to be appended does not have any encoding provided. We could verify that probably scaled values are read back from disk.

@@ -1040,11 +1078,16 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
_validate_dataset_names(dataset)
_validate_attrs(dataset)

if mode == "a":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we raise an error if encoding was passed explicitly into to_zarr() and specifies encoding for any existing variables? I.e., ds.to_arr(..., mode='a', encoding={'existing_variable': ....}). I think this would always indicate a programming mistake, given that we throw away these variable encodings anyways.

…able

* add test for encoding consistency when appending
* implemented: pydata#2706 (comment)
* refactored tests
@shikharsg
Copy link

I have implemented all the changes suggested and refactored the append tests as all tests were previously crammed into test_write_persistence_modes

I'm not sure why the build fails. In all the failed checks, it is these two tests that are failing:

================================== FAILURES ===================================
_________________________ test_rolling_properties[1] __________________________

da = <xarray.DataArray (a: 3, time: 21, x: 4)>
array([[[0.561926, 0.243845, 0.601879, 0.733398],
        [0.500418, 0.84942...ordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 ... 2000-01-21
Dimensions without coordinates: a, x

    def test_rolling_properties(da):
        rolling_obj = da.rolling(time=4)
    
        assert rolling_obj.obj.get_axis_num('time') == 1
    
        # catching invalid args
        with pytest.raises(ValueError) as exception:
            da.rolling(time=7, x=2)
>       assert 'exactly one dim/window should' in str(exception)
E       AssertionError: assert 'exactly one dim/window should' in '<ExceptionInfo ValueError tblen=4>'
E        +  where '<ExceptionInfo ValueError tblen=4>' = str(<ExceptionInfo ValueError tblen=4>)

xarray\tests\test_dataarray.py:3715: AssertionError
_________________________ test_rolling_properties[1] __________________________

ds = <xarray.Dataset>
Dimensions:  (time: 10, x: 8, y: 2)
Coordinates:
  * x        (x) float64 0.0 0.1429 0.2857 0.4286 0....-1.152 -0.6704 ... -0.9796 -1.884 0.4049
    z2       (time, y) float64 -1.218 -0.9627 -1.398 ... -0.3552 0.1446 0.3392

    def test_rolling_properties(ds):
        # catching invalid args
        with pytest.raises(ValueError) as exception:
            ds.rolling(time=7, x=2)
>       assert 'exactly one dim/window should' in str(exception)
E       AssertionError: assert 'exactly one dim/window should' in '<ExceptionInfo ValueError tblen=4>'
E        +  where '<ExceptionInfo ValueError tblen=4>' = str(<ExceptionInfo ValueError tblen=4>)

xarray\tests\test_dataset.py:4845: AssertionError
============================== warnings summary ===============================

I have no idea why as the same two tests pass on my local machine

ds.to_zarr(store_target, mode='w')
ds_to_append.to_zarr(store_target, mode='a', append_dim='time')
original = xr.concat([ds, ds_to_append], dim='time')
assert_identical(original, xr.open_zarr(store_target))

@pytest.mark.xfail(reason="Zarr stores can not be appended to")
def test_append_overwrite_values(self):
Copy link

@shikharsg shikharsg Jun 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this test exactly supposed to do? Is it something like what's happening here: https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L881?
If so,this kind of functionality is not implemented in this PR yet. You can only "append" to existing variables, not overwrite existing data. That would have to be done with the 'w' mode.

@shoyer @rabernat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you have analyzed this correctly. This is fine.

@shoyer shoyer mentioned this pull request Jun 29, 2019
2 tasks
@shoyer
Copy link
Member

shoyer commented Jun 29, 2019

@shikharsg the test failure should be fixed on master (by #3059).

@shoyer shoyer dismissed rabernat’s stale review June 29, 2019 23:42

approved in a comment already

@shoyer shoyer merged commit 18f35da into pydata:master Jun 29, 2019
@shoyer
Copy link
Member

shoyer commented Jun 29, 2019

OK, thank you @shikharsg, @jendrikjoe and everyone else who worked on this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable Append/concat to existing zarr datastore
6 participants