Appending to existing zarr store writes mostly NaN from dask arrays, but not numpy arrays #7812

grahamfindlay · 2023-05-03T19:30:13Z

What is your issue?

I am using xarray to consolidate ~24 pre-existing, moderately large netCDF files into a single zarr store. Each file contains a DataArray with dimensions (channel, time), and no values are nan. Each file's timeseries picks up right where the previous one's left off, making this a perfect use case for out-of-memory file concatenation.

for i, f in enumerate(tqdm(files)):
    da = xr.open_dataarray(f) # Open the netCDF file
    da = da.chunk({'channel': da.channel.size, 'time': 'auto'}) # Chunk along the time dimension
    if i == 0:
        da.to_zarr(zarr_file, mode="w")
    else:
        da.to_zarr(zarr_file, append_dim='time')
    da.close()

This always writes the first file correctly, and every other file appends without warning or error, but when I read the resulting zarr store, ~25% of all timepoints (probably, time chunks) derived from files i > 0 are nan.

Admittedly, the above code seems dangerous, since there is no guarantee that da.chunk({'time': 'auto'}) will always return chunks of the same size, even though the files are nearly identical in size, and I don't know what the expected behavior is if the dask chunksizes don't match the chunksizes of the pre-existing zarr store. I checked the docs but didn't find the answer.

Even if the chunksizes always do match, I am not sure what will happen when appending to an existing store. If the last chunk in the store before appending is not a full chunk, will it be "filled in" when new data are appended to the store? Presumably, but this seems like it could cause problems with parallel writing, since the source chunks from a dask array almost certainly won't line up with the new chunks in the zarr store, unless you've been careful to make it so.

In any case, the following change seems to solve the issue, and the zarr store no longer contains nan.

for i, f in enumerate(tqdm(files)):
    da = xr.open_dataarray(f) # Open the netCDF file
    if i == 0:
        da = da.chunk({'channel': da.channel.size, 'time': 'auto'}) # Chunk along the time dimension
        da.to_zarr(zarr_file, mode="w")
    else:
        da.to_zarr(zarr_file, append_dim='time')
    da.close()

I didn't file this as a bug, because I was doing something that was a bad idea, but it does seem like to_zarr should have stopped me from doing it in the first place.

The text was updated successfully, but these errors were encountered:

max-sixty · 2023-10-14T20:23:34Z

Yes, if the chunks aren't aligned, we probably should raise an error there.

Is it possible to construct an example with explicit chunk sizes to reliably demonstrate a bug?

grahamfindlay added the needs triage Issue that has not been reviewed by xarray team member label May 3, 2023

max-sixty added needs info Issue reporter has not yet provided key information and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 14, 2023

dcherian added topic-documentation topic-zarr Related to zarr storage library and removed needs info Issue reporter has not yet provided key information labels Nov 15, 2023

josephnowak mentioned this issue Sep 18, 2024

fix safe chunks validation #9513

Closed

2 tasks

This was referenced Sep 25, 2024

Improve safe chunk validation #9527

Merged

Improve safe chunk validation #9559

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending to existing zarr store writes mostly NaN from dask arrays, but not numpy arrays #7812

Appending to existing zarr store writes mostly NaN from dask arrays, but not numpy arrays #7812

grahamfindlay commented May 3, 2023

max-sixty commented Oct 14, 2023

Appending to existing zarr store writes mostly NaN from dask arrays, but not numpy arrays #7812

Appending to existing zarr store writes mostly NaN from dask arrays, but not numpy arrays #7812

Comments

grahamfindlay commented May 3, 2023

What is your issue?

max-sixty commented Oct 14, 2023