-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Appending with to_zarr raises ValueError if append_dim length of existing data is not an integer multiple of chunk size #9767
Comments
See [issue on GitHub](pydata/xarray#9767)
If you're absolutely confident it's safe you can set We know that the code raises false positives and would welcome changes that make it better. I believe your offset comment was noted in the PR too but it was a strict improvement over the status quo, and so was merged. |
Which PR was this? |
Hi, I think in this case is not a false positive. You can visualize the regions of the chunks of your dataset on the time dimension as follows: chunk 1: from 1 to 14 Your first_ds would be on the region that goes from position 1 to 31 which means that it is going to write on the chunk 1, 2 and 3 (only three elements would be on the chunk 3) and when you try to append your second_ds whose first chunk is of size 14, Xarray will append the data starting on the chunk 3 (with size 3), and if you write 14 contiguous elements you are going to end up writing on the region that goes from position 32 to 46, which would correspond to chunk 3 and 4, and as you have more chunks on that array you will be writing two chunks at the same time on the chunk 4, and this can corrupt your data. My recommendation is to modify the chunks of your second_ds to correspond to the size of your first_ds, you can achieve this with the following code (this can generate more tasks than desired so probably you could consider creating your datasets in a different way): import xarray as xr
second_ds = xr.concat([first_ds, second_ds], dim="time").chunk({"time": 14}).sel(
time=slice(second_ds.coords["time"][0], None)
) |
Thanks @josephnowak, that workaround seems to have helped and the task increase isn't too bad. Would there be any way to sequence or mutex the chunk writes to avoid this issue? |
It's nice to hear that it was useful. There is a parameter called synchronizer on the to_zarr method, it should help you, but I think it can not be used with a Distributed environment (someone correct me if I'm wrong), for that case I think that you can create a class that implement the Zarr interface for synchronization using Dask locks instead of thread or process locks. |
Great analysis @josephnowak ! |
Thanks for all the help. Do you think it may be worthwhile to document that method of aligning the new data's chunks somewhere? |
What would we add to the documentation? (genuine question, very likely there are things to add!) In the meantime will close as I don't think there's a bug here, even though the current behavior is not complete |
@max-sixty Sorry I forgot to reply I think it would be worthwhile to add Joseph's code snippet somewhere around the documentation to |
Yes, very open to adding something to the docs. Likely that code snippet needs some generalization before we paste it in... (Again, the current state is not great; I'm not dismissing this as "everything is perfect", but the binding constraint is making easy-to-understand interfaces & docs...) |
@max-sixty, do you think it would be useful to add a parameter to the to_zarr method that allows the automatic chunk alignment between Dask and Zarr? It looks like this is a very common problem. As an additional idea, I think that we could go beyond the chunk alignment parameter and add a coordinate alignment. I think that would make the to_zarr method more user-friendly in some scenarios. |
Yes, I definitely think we could have something to align the chunks. That could be a param in To @RKuttruff 's point — to the extent we want to make incremental progress, adding the code of a function in the zarr docs page would be valuable, I think... |
Hi, I was about to lodge an issue on what I believe is a similar bug I've encounter since xarray 2024.10. I've had to revert to 2024.9 in my use case. I have a unittest that I'm happy to share which demonstrate this works with The issue happens after appending or overwriting to an existing zarr dataset, and would completely corrupt it. Finally, I thought i could get rid of the problem by calling ValueError: Specified zarr chunks encoding['chunks']=(4, 60, 59) for variable named 'UCUR_quality_control' would overlap multiple dask chunks ((1,), (60,), (59,)) on the region (slice(0, 1, None), slice(None, None, None), slice(None, None, None)). Writing this array in parallel with dask could lead to corrupted data. Consider either rechunking using `chunk()`, deleting or modifying `encoding['chunks']`, or specify `safe_chunks=False`. Note that none of the variables had an encoding dictionary. So the ValueError message is basically useless and leads to more confusion. Happy to lodge another issue, but I feel like this one should be re open |
Can we get an MCVE? Feel free to open a new issue. Please keep in mind @josephnowak 's analysis of this MCVE; to avoid making an example that has the same issue... |
What happened?
I have code that produces zarr data as output with configurable chunking. Recent builds have been raising unexpected
ValueErrors
about misaligned chunks, despite a) the chunk shaping being the same for both the new and existing data and b) callingchunk()
and ensuringencoding['chunks']
is unset on append as suggested in the error message.The error:
In the provided MCVE, this can be observed as provided. If the value(s) of
DAYS_PER_APPEND
or the first value of theCHUNKING
tuple are edited to be integer multiples of each other the error is not raised. If you further edit to add an offset to thecreate()
call for the first dataset such that it will not be an integer multiple of the chunk shape (ie,create(DAYS_PER_APPEND + 1, start_dt, LATITUDE_RES)
withCHUNKING = (14, 50, 50)
) the error will appear again, but NOT if this is done for the second dataset, leading me to conclude that the error is raised on the existing dataset being out of alignment with the chunk shape.What did you expect to happen?
I expect appending with
to_zarr
to complete without error regardless of the length of the append dimension in the existing data, provided the chunking of both are the same.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
/opt/conda/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
INSTALLED VERSIONS
commit: None
python: 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:27:36) [GCC 13.3.0]
python-bits: 64
OS: Linux
OS-release: 6.10.11-linuxkit
machine: x86_64
processor:
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2024.10.0
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.11.4
netCDF4: 1.7.1
pydap: None
h5netcdf: 1.4.0
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2024.10.0
distributed: 2024.10.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.5.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.0.0
pip: 23.3
conda: 23.11.0
pytest: None
mypy: None
IPython: None
sphinx: None
The text was updated successfully, but these errors were encountered: