Zarr encoding attributes persist after slicing data, raising error on `to_zarr` #5219

bolliger32 · 2021-04-27T01:34:52Z

What happened:
Opened a dataset using open_zarr, sliced the dataset, and then tried to resave to a zarr store using to_zarr.

What you expected to happen:
The file would save without needing to explicitly modify any encoding dictionary values

Minimal Complete Verifiable Example:

ds = xr.Dataset({"data": (("dimA", ), [10, 20, 30, 40])}, coords={"dimA": [1, 2, 3, 4]})
ds = ds.chunk({"dimA": 2})
ds.to_zarr("test.zarr", consolidated=True, mode="w")

ds2 = xr.open_zarr("test.zarr", consolidated=True).sel(dimA=[1,3]).persist()
ds2.to_zarr("test2.zarr", consolidated=True, mode="w")

This raises:

NotImplementedError: Specified zarr chunks encoding['chunks']=(2,) for variable named 'data' would overlap multiple dask chunks ((1, 1),). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.

Anything else we need to know?:

Not sure if there is a good way around this (or perhaps this is even desired behavior?), but figured I would flag it as it seemed unexpected and took us a second to diagnose. Once you've loaded the data from a zarr store, I feel like the default behavior should probably be to forget the encodings used to save that zarr, treating the in-memory dataset object just like any other in-memory dataset object that could have been loaded from any source. But maybe I'm in the minority or missing some nuance about why you'd want the encoding to hang around.

Environment:

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.89+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.17.0
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.11.0
h5py: 3.2.1
Nio: None
zarr: 2.7.1
cftime: 1.2.1
nc_time_axis: 1.2.0
PseudoNetCDF: None
rasterio: 1.2.2
cfgrib: 0.9.9.0
iris: 3.0.1
bottleneck: 1.3.2
dask: 2021.04.1
distributed: 2021.04.1
matplotlib: 3.4.1
cartopy: 0.19.0
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.0.1
conda: None
pytest: 6.2.3
IPython: 7.22.0
sphinx: 3.5.4

The text was updated successfully, but these errors were encountered:

mathause · 2021-04-27T22:36:42Z

Thanks for the clear error report. On master you should be able to do ds2.to_zarr("test2.zarr", consolidated=True, mode="w", safe_chunks=False) - see #5065

bolliger32 · 2021-04-27T23:05:02Z

Thanks for the pointer @mathause that is super helpful. And thanks for #5065 @rabernat. If I'm understanding the PR correctly (looks like it evolved a lot!) in most cases matching the example above, we probably would NOT want to use safe_chunks=False, correct? B/c if we're writing in parallel, this could lead to data corruption. Instead, we'd want to manually delete the chunks item from each variables encoding attribute after loading/persisting the data into memory. That way, to_zarr would use the dask chunks as the zarr chunks, rather than relying on whatever chunks were used in the "original" zarr store (the source of the in-memory Dataset).

Does that sound right? I feel like if I'm reading through the PR comments correctly, this was one of the controversial parts that didnt' end up in the merged PR.

rabernat · 2021-04-28T01:26:34Z

we probably would NOT want to use safe_chunks=False, correct?

correct

The problem in this issue is that the dataset is carrying around its original chunks in .encoding and then xarray tries to use these values to set the chunk encoding on the second write op. The solution is to manually delete the chunk encoding from all your data variables. Something like

for var in ds:
    del ds[var].encoding['chunks']

Originally part of #5056 was a change that would have xarray automatically do this deletion after some operations (such as calling .chunk()); however, we could not reach a consensus on the best way to implement that change. Your example is interesting because it is a slightly different scenario -- calling sel() instead of chunk() -- but the root cause appears to be the same: encoding['chunks'] is being kept around too conservatively.

bolliger32 · 2021-04-28T01:31:17Z

Yup this all makes sense thanks for the explanation @rabernat . It does seem like it would be good to drop encoding["chunks"] at some point but I can see how that is tricky timing. I'm assuming it's necessary metadata to keep around when the zarr has been "opened" but data has not yet been read, b/c it is used by xarray to read the zarr?

Anyways, we'll continue with the manual deletion for now but I'm inclined to keep this issue open as I do think it would be helpful to eventually figure out how to automatically do this.

shoyer · 2021-05-03T00:52:21Z

Somewhat inevitably, I finally fit this issue this week, too :)

I am increasingly coming to the conclusion that there are no "safe" manipulations in xarray that should preserve encoding. Perhaps we should just drop encoding after every operation in Xarray.

shoyer · 2021-05-11T20:02:58Z

It occurs to me that another possible fix would be to ignore chunks from pre-existing encoding attributes in to_zarr(). Instead we could require explicitly supplying chunks vis the encoding parameter in the to_zarr() call.

Probably better to remove encoding more aggressively overall, but this might be a good intermediate step...

rabernat · 2021-05-11T20:08:27Z

Instead we could require explicitly supplying chunks vis the encoding parameter in the to_zarr() call.

This could also break existing workflows though. For example, pangeo-forge is using the encoding.chunks attribute to specify target dataset chunks.

vedal · 2022-11-13T10:01:37Z

Is this still an issue after merging PR ,#5065?

rizziemma · 2022-12-06T16:16:20Z

@vedal Hi, I'm still facing this issue using xarray 2022.11.0 and python 3.10. Is this PR included in the latest xarray version ?

dcherian added the topic-backends label Apr 28, 2021

shoyer mentioned this issue May 11, 2021

Zarr chunks would overlap multiple dask chunks error #5286

Closed

Hoeze mentioned this issue May 15, 2021

Error when writing string coordinate variables to zarr #3476

Open

d70-t mentioned this issue Mar 4, 2022

to_zarr: region not recognised as dataset dimensions #6069

Open

jhamman mentioned this issue Mar 27, 2023

Add reset_encoding to Dataset and DataArray objects #7686

Closed

dcherian added the topic-zarr Related to zarr storage library label Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr encoding attributes persist after slicing data, raising error on `to_zarr` #5219

Zarr encoding attributes persist after slicing data, raising error on `to_zarr` #5219

bolliger32 commented Apr 27, 2021

mathause commented Apr 27, 2021

bolliger32 commented Apr 27, 2021 •

edited

Loading

rabernat commented Apr 28, 2021

bolliger32 commented Apr 28, 2021

shoyer commented May 3, 2021

shoyer commented May 11, 2021

rabernat commented May 11, 2021

vedal commented Nov 13, 2022 •

edited

Loading

rizziemma commented Dec 6, 2022

Zarr encoding attributes persist after slicing data, raising error on to_zarr #5219

Zarr encoding attributes persist after slicing data, raising error on to_zarr #5219

Comments

bolliger32 commented Apr 27, 2021

mathause commented Apr 27, 2021

bolliger32 commented Apr 27, 2021 • edited Loading

rabernat commented Apr 28, 2021

bolliger32 commented Apr 28, 2021

shoyer commented May 3, 2021

shoyer commented May 11, 2021

rabernat commented May 11, 2021

vedal commented Nov 13, 2022 • edited Loading

rizziemma commented Dec 6, 2022

Zarr encoding attributes persist after slicing data, raising error on `to_zarr` #5219

Zarr encoding attributes persist after slicing data, raising error on `to_zarr` #5219

bolliger32 commented Apr 27, 2021 •

edited

Loading

vedal commented Nov 13, 2022 •

edited

Loading