-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending to zarr store #2706
Appending to zarr store #2706
Conversation
Hello @jendrikjoe! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-06-29 22:49:41 UTC |
Hi @jendrikjoe -- thanks for submitting a PR to address one of the most important issues in xarray (IMHO)! I am very excited about your contribution and am looking forward to getting this feature merged. I have many questions about how this works. I think the best way to move forward is to wait until we have a test for the append feature which involves the following steps:
Seeing the code that accomplishes this will help clarify for me what is happening. Thanks again for your contribution, and welcome to xarray! |
Hi @rabernat, happy to help! I love using xarray. I added the test for the append mode. |
Ok, with the example, I can see a bit better how this works. Here is my main concern: there doesn't appear to be any alignment checking between the target dataset and the new data. The only check that happens is whether a variable with the same name already exists in the target store, if so, append is used (rather than creating a new array). What if the coordinates differ? What if the attributes differ? I'm not sure this is a deal-breaker. But we should be very clear about this in the docs. |
xarray/backends/zarr.py
Outdated
if append: | ||
if self.append_dim is None: | ||
raise ValueError('The dimension on which the data is \ | ||
appended has to be named.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we just want to add a new variable to an existing zarr store? This PR could hypothetically support that case as well, but in that case, there is no append_dim to specify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the other comment, it should already work, but I will add another test for it 👍
You are definitely right, that there are no checks regarding the alignment. |
Hi @jendrikjoe, Thanks for your PR, I am very interested in it because this is something I was hacking around (see here). In my particular case, I want to append along a time dimension, but it looks like your PR currently doesn't support it. In the following example import xarray as xr
import pandas as pd
ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
ds1 = xr.Dataset({'temperature': (['time'], [53, 54, 55])}, coords={'time': pd.date_range('2000-01-04', periods=3)})
ds0.to_zarr('temp')
ds1.to_zarr('temp', mode='a', append_dim='time')
ds2 = xr.open_zarr('temp') But it's not the case:
Maybe it's not intended to work with time dimensions yet? |
To make it work, time dimensions would have to be treated separately because zarr doesn't encode absolute time values but deltas relative to a reference (see https://github.com/davidbrochart/pangeo_upload/blob/master/py/trmm2pangeo.py#L108). |
Hey @davidbrochart, |
zarr stores the reference in the
|
I will check as well how xarry stores times to check if we have to add the offset to the xarray first or if this can be resolved with a PR to zarr :) |
So the problem in @davidbrochart's example is that there are different encodings on the time variables in the two datasets. When writing datetimes, xarray automatically picks an encoding (i.e. In this case, xarray's heuristics are picking different encodings for the two dates. You could make this example work by manually specifying encoding on the appended dataset to be the same as the original. This example illustrates the need for some sort of compatibility checks between the target dataset and the appended dataset. For example, checking for attribute compatibility would have caught this error. |
We should definitely always make sure that we write data consistently (e.g., for dates), but checking for alignment of all coordinates could be expensive/slow. Potentially a keyword argument |
When we use this feature e.g. to store data that is produced every day, we might start with a data set that has a small size on the time dimension, and thus the chunks will be chosen according to this initial shape. When we append to this data set, will the chunks be kept as in the initial zarr archive? If so, we might end up with a lot of small chunks on the time dimension, where ideally we would have chosen only one chunk. |
This implies we should be checking for attributes compatibility before calling |
# the magic for storing the hidden dimension data | ||
encoded_attrs[_DIMENSION_KEY] = dims | ||
for k2, v2 in attrs.items(): | ||
encoded_attrs[k2] = self.encode_attribute(v2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we pulled this attribute encoding out before the try
block. Then we check encoded_attrs
against zarr_array.attrs
before appending.
xarray/backends/zarr.py
Outdated
only needed in append mode | ||
""" | ||
|
||
variables, attributes = self.encode(variables, attributes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the encoding from datetime64
to int64
with days since ...
units happens.
If we wanted to make sure that the encoding of the new variables is compatible with the target store, we would have to peek at the target store encodings and explicitly put them in the new variable encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try doing that :) Will probably take a while, but I might be able to do that on Monday or Tuesday 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even consider opening up the zarr store (into an xarray.Dataset) before doing any appending. Then it’s easy to decode all the metadata and ensure consistency of the appended data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would try to avoid opening the whole zarr store for performance reasons and instead just try pulling the encodings from the array attributes. I think the only way to really solve this is adding the possibility to all CF encoders to use a specific encoding if one is passed.
This would allow to parse the encoding to https://github.com/pydata/xarray/blob/master/xarray/conventions.py#L204 from https://github.com/pydata/xarray/blob/master/xarray/backends/zarr.py#L209 and get the correctly encoded array for the append. What do you think?
Hi @jendrikjoe, do you plan to work on this PR again in the future? I think it would be a great contribution to xarray. |
May I try and take this work over? |
@davidbrochart I would personally be happy to see anyone work on this. I'm sure @jendrikjoe would not mind if we make it a team effort! |
adding a new variable currently errors if we don't provide the >>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'], [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/shikhar/code/xarray/xarray/core/dataset.py", line 1374, in to_zarr
consolidated=consolidated, append_dim=append_dim)
File "/home/shikhar/code/xarray/xarray/backends/api.py", line 1071, in to_zarr
dump_to_store(dataset, zstore, writer, encoding=encoding)
File "/home/shikhar/code/xarray/xarray/backends/api.py", line 928, in dump_to_store
unlimited_dims=unlimited_dims)
File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 366, in store
unlimited_dims=unlimited_dims)
File "/home/shikhar/code/xarray/xarray/backends/zarr.py", line 406, in set_variables
"was not set".format(name)
ValueError: variable 'time' already exists, but append_dim was not set this works: >>> import xarray as xr
>>> import pandas as pd
>>> ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> ds1 = xr.Dataset({'pressure': (['time'], [50, 51, 52])}, coords={'time': pd.date_range('2000-01-01', periods=3)})
>>> store = dict()
>>> ds0.to_zarr(store, mode='w')
<xarray.backends.zarr.ZarrStore object at 0x7fae926505c0>
>>> ds1.to_zarr(store, mode='a', append_dim='asdfasdf')
>>> xr.open_zarr(store)
<xarray.Dataset>
Dimensions: (time: 3)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
Data variables:
pressure (time) int64 dask.array<shape=(3,), chunksize=(3,)>
temperature (time) int64 dask.array<shape=(3,), chunksize=(3,)> will push a fix for this in a bit |
…th Dataset.to_zarr(store, mode='a') * cleand up to_zarr append mode tests
Is this scenario now covered by the tests? Sorry if the answer is obvious; it's hard for me to discern just by looking at the code. |
Just to be clear, we do always requiring writing |
@rabernat , the scenario I am talking about is adding a new
@shoyer We do always require |
it's done. I fixed it by opening the zarr dataset beforehand using |
xarray/backends/zarr.py
Outdated
if name in self.ds: | ||
zarr_array = self.ds[name] | ||
""" | ||
If variable is a dimension of an existing array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use #
for comments instead of strings, which should be reserved for docstrings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, yes. That comment is not actually needed(it's from code I already removed). So I'll remove that.
variables_with_encoding = OrderedDict() | ||
for vn in existing_variables: | ||
variables_with_encoding[vn] = variables[vn] | ||
variables_with_encoding[vn].encoding = ds[vn].encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This modifies an argument that was passed into the function in-place, which in general should be avoided due to unexpected side effects in other parts of the code. It would be better to shallow copy the Variable
before overriding encoding
, e.g., with variables[vn].copy(deep=False)
.
|
||
if len(existing_variables) > 0: | ||
# there are variables to append | ||
# their encoding must be the same as in the store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any unit tests that verify that encoding is kept consistent? This would be nice to add, if not.
Probably a good example would be dataset saved with scale/offset encoding, where the new dataset to be appended does not have any encoding provided. We could verify that probably scaled values are read back from disk.
@@ -1040,11 +1078,16 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None, | |||
_validate_dataset_names(dataset) | |||
_validate_attrs(dataset) | |||
|
|||
if mode == "a": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we raise an error if encoding
was passed explicitly into to_zarr()
and specifies encoding for any existing variables? I.e., ds.to_arr(..., mode='a', encoding={'existing_variable': ....})
. I think this would always indicate a programming mistake, given that we throw away these variable encodings anyways.
…able * add test for encoding consistency when appending * implemented: pydata#2706 (comment) * refactored tests
I have implemented all the changes suggested and refactored the append tests as all tests were previously crammed into I'm not sure why the build fails. In all the failed checks, it is these two tests that are failing:
I have no idea why as the same two tests pass on my local machine |
ds.to_zarr(store_target, mode='w') | ||
ds_to_append.to_zarr(store_target, mode='a', append_dim='time') | ||
original = xr.concat([ds, ds_to_append], dim='time') | ||
assert_identical(original, xr.open_zarr(store_target)) | ||
|
||
@pytest.mark.xfail(reason="Zarr stores can not be appended to") | ||
def test_append_overwrite_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this test exactly supposed to do? Is it something like what's happening here: https://github.com/pydata/xarray/blob/master/xarray/tests/test_backends.py#L881?
If so,this kind of functionality is not implemented in this PR yet. You can only "append" to existing variables, not overwrite existing data. That would have to be done with the 'w' mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you have analyzed this correctly. This is fine.
@shikharsg the test failure should be fixed on master (by #3059). |
OK, thank you @shikharsg, @jendrikjoe and everyone else who worked on this ! |
This pull request allows to append an xarray to an existing datastore.
whats-new.rst
for all changes andapi.rst
for new APITo filter the data written to the array, the dimension over which the data will be appended has to be explicitly stated. If someone has an idea how to overcome this, I would be more than happy to incorporate the necessary changes into the PR.
Cheers,
Jendrik