Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr encoding attributes persist after slicing data, raising error on to_zarr #5219

Open
bolliger32 opened this issue Apr 27, 2021 · 9 comments
Labels
topic-backends topic-zarr Related to zarr storage library

Comments

@bolliger32
Copy link
Contributor

What happened:
Opened a dataset using open_zarr, sliced the dataset, and then tried to resave to a zarr store using to_zarr.

What you expected to happen:
The file would save without needing to explicitly modify any encoding dictionary values

Minimal Complete Verifiable Example:

ds = xr.Dataset({"data": (("dimA", ), [10, 20, 30, 40])}, coords={"dimA": [1, 2, 3, 4]})
ds = ds.chunk({"dimA": 2})
ds.to_zarr("test.zarr", consolidated=True, mode="w")

ds2 = xr.open_zarr("test.zarr", consolidated=True).sel(dimA=[1,3]).persist()
ds2.to_zarr("test2.zarr", consolidated=True, mode="w")

This raises:

NotImplementedError: Specified zarr chunks encoding['chunks']=(2,) for variable named 'data' would overlap multiple dask chunks ((1, 1),). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.

Anything else we need to know?:

Not sure if there is a good way around this (or perhaps this is even desired behavior?), but figured I would flag it as it seemed unexpected and took us a second to diagnose. Once you've loaded the data from a zarr store, I feel like the default behavior should probably be to forget the encodings used to save that zarr, treating the in-memory dataset object just like any other in-memory dataset object that could have been loaded from any source. But maybe I'm in the minority or missing some nuance about why you'd want the encoding to hang around.

Environment:

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.89+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.17.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.7.1
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: 3.0.1
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: 6.2.3
IPython: 7.22.0
sphinx: 3.5.4
@mathause
Copy link
Collaborator

Thanks for the clear error report. On master you should be able to do ds2.to_zarr("test2.zarr", consolidated=True, mode="w", safe_chunks=False) - see #5065

@bolliger32
Copy link
Contributor Author

bolliger32 commented Apr 27, 2021

Thanks for the pointer @mathause that is super helpful. And thanks for #5065 @rabernat. If I'm understanding the PR correctly (looks like it evolved a lot!) in most cases matching the example above, we probably would NOT want to use safe_chunks=False, correct? B/c if we're writing in parallel, this could lead to data corruption. Instead, we'd want to manually delete the chunks item from each variables encoding attribute after loading/persisting the data into memory. That way, to_zarr would use the dask chunks as the zarr chunks, rather than relying on whatever chunks were used in the "original" zarr store (the source of the in-memory Dataset).

Does that sound right? I feel like if I'm reading through the PR comments correctly, this was one of the controversial parts that didnt' end up in the merged PR.

@rabernat
Copy link
Contributor

we probably would NOT want to use safe_chunks=False, correct?

correct

The problem in this issue is that the dataset is carrying around its original chunks in .encoding and then xarray tries to use these values to set the chunk encoding on the second write op. The solution is to manually delete the chunk encoding from all your data variables. Something like

for var in ds:
    del ds[var].encoding['chunks']

Originally part of #5056 was a change that would have xarray automatically do this deletion after some operations (such as calling .chunk()); however, we could not reach a consensus on the best way to implement that change. Your example is interesting because it is a slightly different scenario -- calling sel() instead of chunk() -- but the root cause appears to be the same: encoding['chunks'] is being kept around too conservatively.

@bolliger32
Copy link
Contributor Author

Yup this all makes sense thanks for the explanation @rabernat . It does seem like it would be good to drop encoding["chunks"] at some point but I can see how that is tricky timing. I'm assuming it's necessary metadata to keep around when the zarr has been "opened" but data has not yet been read, b/c it is used by xarray to read the zarr?

Anyways, we'll continue with the manual deletion for now but I'm inclined to keep this issue open as I do think it would be helpful to eventually figure out how to automatically do this.

@shoyer
Copy link
Member

shoyer commented May 3, 2021

Somewhat inevitably, I finally fit this issue this week, too :)

I am increasingly coming to the conclusion that there are no "safe" manipulations in xarray that should preserve encoding. Perhaps we should just drop encoding after every operation in Xarray.

@shoyer
Copy link
Member

shoyer commented May 11, 2021

It occurs to me that another possible fix would be to ignore chunks from pre-existing encoding attributes in to_zarr(). Instead we could require explicitly supplying chunks vis the encoding parameter in the to_zarr() call.

Probably better to remove encoding more aggressively overall, but this might be a good intermediate step...

@rabernat
Copy link
Contributor

Instead we could require explicitly supplying chunks vis the encoding parameter in the to_zarr() call.

This could also break existing workflows though. For example, pangeo-forge is using the encoding.chunks attribute to specify target dataset chunks.

@vedal
Copy link

vedal commented Nov 13, 2022

Is this still an issue after merging PR ,#5065?

@rizziemma
Copy link

@vedal Hi, I'm still facing this issue using xarray 2022.11.0 and python 3.10. Is this PR included in the latest xarray version ?

@dcherian dcherian added the topic-zarr Related to zarr storage library label Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

7 participants