Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" #347

TonioF · 2020-10-26T08:48:58Z

For some data it is necessary to rechunk it in order to deal with it efficiently. This rechunking process might result in that the final chunk of a dimension has a smaller size than the previous ones. E.g., if we have a dimension of 500, originally split into ten chunks of size 50, we can rechunk it to sizes of 200, 200, and 100. This is perfectly valid and supported by xarray.
However, when the affected dimension/variable is latitude and the latitude is ascending, the current implementation of the normalization in xcube will revert the order of chunks and cause that the smaller chunk will be at the start - which is not supported by xarray and might result in an error as the one in the description, for example when writing to zarr.

For the time being, I will work around this issue by ensuring that all chunks are of the same size after rechunking, which will make this issue a bit harder to reproduce. I open this issue to document that the current solution of dealing with ascending latitudes is not optimal. The best solution will probably be to get rid of the part where latitudes are reverted during the normalization and support ascending latitudes in the data.

See also #251 and #327.

AliceBalfanz · 2020-11-16T09:51:23Z

Hi Tonio, I am facing a similar issue with the same error message.

When executing the example jupyter NB examples/store/open_data_directory.ipynb
line
store.write_data(new_dataset, 'cube-1-250-250-subset.zarr', writer_id='dataset:zarr:posix')
results in

NotImplementedError: Specified zarr chunks encoding['chunks']=(1, 250, 250) for variable named 'quality_flags' would overlap multiple dask chunks ((1, 1, 1, 1, 1), (90, 90, 90, 90), (90, 90, 90, 90, 90, 90, 90, 90, 60)). This is not implemented in xarray yet. Consider either rechunking using `chunk()` or instead deleting or modifying `encoding['chunks']`.

xarray does not adjust the encoding of varialbes when rechunking a dataset.

new_dataset.c2rcc_flags.encoding

{'chunks': (1, 250, 250),
 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0),
 'filters': None,
 '_FillValue': nan,
 'dtype': dtype('float64')}

This problem is already known:

pydata/xarray#2300

Maybe we could implement in xcube, that if a dataset is rechunked, then the encoding is adjusted as well?

TonioF · 2020-11-16T11:26:23Z

Maybe we could implement in xcube, that if a dataset is rechunked, then the encoding is adjusted as well?

We already did: https://github.com/dcs4cop/xcube/blob/bc4cdb4aa5e88557d920d71b7dff4100015c3512/xcube/core/chunk.py#L10 (Make sure you set the format when you use this, otherwise the encodings won't be updated).

However, I still couldn't run your notebook cell successfully, but that seems to be due to another problem.

AliceBalfanz · 2020-11-16T12:15:49Z

Thanks for your suggestion, I changed the notebook to use xcube chunk_dataset instead of xarrays way to chunk a dataset. It now works like a dream :)

rabernat · 2021-03-03T18:56:17Z

This can and should be fixed in xarray. But it's also very easy to just delete the encoding.

del new_dataset.c2rcc_flags.encoding['chunks']

as a simple workaround.

forman · 2021-03-05T15:47:01Z

@rabernat thanks! The xcube function chunk_dataset() is basically xarray.chunk() plus correction of "chunks" encoding in each variable.

@AliceBalfanz

Thanks for your suggestion, I changed the notebook to use xcube chunk_dataset instead of xarrays way to chunk a dataset.

We should make sure that any xr.Dataset returned from higher-level xcube functions should always have a chunking compatible with Zarr, as this is our standard I/O format. Users should not be forced to rechunk just for the purpose of writing to Zarr. This is counter-intuitive. I suggest we provide a utiliy function that ensures "valid" chunking (including possible deletion of the "chunks" encoding property, as suggested by @rabernat), ensure_valid_chunks(). Then we'll apply that to all datasets before we return them from xcube functions.

I guess there are good reasons why encoding is not adjusted in xarray.chunk().

rabernat · 2021-03-05T15:50:36Z

The xcube function chunk_dataset() is basically xarray.chunk() plus correction of "chunks" encoding in each variable.

🙌 Could I convince you to submit this as a PR to xarray itself? 😁

forman · 2021-03-05T16:06:51Z

I'm convinced. Hope to find a little time next week. Maybe there is a related issue already?

rabernat · 2021-03-05T16:12:29Z

pydata/xarray#2300 is the main one.

Workaround for some cases that end up in #347

rabernat · 2021-03-22T15:50:16Z

FYI I have started a PR to fix this upstream in Xarray. Your review there would be helpful.

forman · 2021-03-23T16:09:25Z

Sure @rabernat. Thanks!

TonioF added the bug Something isn't working label Oct 26, 2020

TonioF changed the title ~~Writing to zarr fails with message specified zarr chunks would overlap multiple dask chunks~~ Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" Oct 26, 2020

forman added a commit that referenced this issue Mar 9, 2021

workaround for some cases that end up in #347

bccd5f2

forman mentioned this issue Mar 9, 2021

Workaround for some cases that end up in #347 #419

Merged

7 tasks

forman closed this as completed in #419 Mar 17, 2021

forman added a commit that referenced this issue Mar 17, 2021

Merge pull request #419 from dcs4cop/forman-347-dont_reorder_lat

2280c12

Workaround for some cases that end up in #347

rabernat mentioned this issue Mar 22, 2021

Zarr chunking fixes pydata/xarray#5065

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" #347

Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" #347

TonioF commented Oct 26, 2020

AliceBalfanz commented Nov 16, 2020

TonioF commented Nov 16, 2020

AliceBalfanz commented Nov 16, 2020

rabernat commented Mar 3, 2021

forman commented Mar 5, 2021

rabernat commented Mar 5, 2021

forman commented Mar 5, 2021

rabernat commented Mar 5, 2021

rabernat commented Mar 22, 2021

forman commented Mar 23, 2021

Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" #347

Writing to zarr fails with message "specified zarr chunks would overlap multiple dask chunks" #347

Comments

TonioF commented Oct 26, 2020

AliceBalfanz commented Nov 16, 2020

TonioF commented Nov 16, 2020

AliceBalfanz commented Nov 16, 2020

rabernat commented Mar 3, 2021

forman commented Mar 5, 2021

rabernat commented Mar 5, 2021

forman commented Mar 5, 2021

rabernat commented Mar 5, 2021

rabernat commented Mar 22, 2021

forman commented Mar 23, 2021