Enable Append/concat to existing zarr datastore #2022

jgerardsimcock · 2018-03-28T05:39:37Z

Following discussion from pangeo-data/pangeo#19

How would we go about implementing a concat or append function for zarr data stores? I am imagining something like xr.concat here. Its not clear to me how this would work when using open_mfdataset.

zarray_1 = xr.open_zarr(store=gcsmap)

ds2 = xr.open_dataset(path_to_netcdf)

xr.concat([zarray_1, ds2])

Problem description

If you are using cloud storage facility like gcs, ds.to_zarr can fail before the completion of the upload. This is a problem for multi-TB datasets as the entire process needs to be restarted without any way to resume where you left off.

Expected Output

new zarr dataset with additional dataset appended along appropriate dim

The text was updated successfully, but these errors were encountered:

shoyer · 2018-03-29T01:10:16Z

This would probably make sense to think about along-side support for appending along an existing dimension in a netCDF file (#1672).

I can see a few potential ways to write the syntax. Probably supplying a range of indices along a dimension to write to would make the most sense, e.g., to_zarr(..., destination={'time': slice(1000, 2000)}) to indicate writing to positions 1000-2000 along the time dimension.

NickMortimer · 2018-07-04T08:41:47Z

My use case for this is appending Argo float data to an existing zarr store. At the moment I have 800+ netcdf files that need transforming before they can be added or read by xarray in *.nc type read. At the moment I read the first transform it and add to a zarr sort using .to_zarr. Then I proceed to read the next files and append each variable to zarr using zarr append function.

This is probably not a good way to go but all that I could figure at the moment.

@shoyer I think it would be useful to have a straight append mode:
to_zarr(....,mode='a+')

rabernat · 2018-10-17T20:37:28Z

We may have people interested in working on this soon.

I think we have some details to sort out regarding the api for appending. The most generic case looks something like this

ds1 = xr.open_dataset('file1.nc')
# file2.nc already exists
ds1.to_netcdf('file2.nc', mode='a+')

We need to figure out what should happen under different circumstances. Some cases are:

We are just adding or completely overwriting variables. This works currently (from the docs: "If mode=’a’, existing variables will be overwritten"). But I'm not sure what happens if there is a conflict between coordinates among the new and old variables.
ds1 has some of the same variables as ds2, possibly with overlapping coordinates. In this case, we want to do some kind of append. If there is no overlap between coordinates, then it's straightforward: put the extra values from ds1 into file2.nc. If there is overlap, then there are two options:
- overwrite all of the overlapping portion with ds1, or
- keep the existing values from ds2.
With netCDF, there is an additional limitation that the underlying library will only let you extend along one dimension (the UNLIMITED one). Other backends like zarr will let you extend along many dimensions.

It seems like much of the logic for overlapping dimension should be able to be handled via align. The hard part will be figuring out how to tell the store to write to the appropriate regions of its arrays.

shoyer · 2018-10-17T21:17:11Z

We are just adding or completely overwriting variables. This works currently (from the docs: "If mode=’a’, existing variables will be overwritten"). But I'm not sure what happens if there is a conflict between coordinates among the new and old variables.

I'm pretty sure the coordinates will just get overwritten, too, at least as long as the coordinate arrays have the same shape. If they have different shapes, you probably will get an error. We certainly don't do any checks for alignment currently.

ds1 has some of the same variables as ds2, possibly with overlapping coordinates. In this case, we want to do some kind of append. If there is no overlap between coordinates, then it's straightforward: put the extra values from ds1 into file2.nc.

This is only case I would try to solve to the initial implementation. It's probably 20% of the work (to add a keyword argument like extend='time') and covers 80% of the use-cases.

If we need alignment, I'm sure we could make that work in a follow-up. Certainly it would be less error prone to use.

leroygr · 2018-11-27T09:55:10Z

My use case for this is appending Argo float data to an existing zarr store. At the moment I have 800+ netcdf files that need transforming before they can be added or read by xarray in *.nc type read. At the moment I read the first transform it and add to a zarr sort using .to_zarr. Then I proceed to read the next files and append each variable to zarr using zarr append function.

This is probably not a good way to go but all that I could figure at the moment.

@NickMortimer would you have snipped for appending xarray objects to existing zarr dataset?

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Thanks!
Greg

rabernat · 2018-11-27T13:09:43Z

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Patience...or action. Anyone is welcome and encouraged to submit a pull request on this topic. Xarray is a volunteer effort.

leroygr · 2018-11-27T13:20:33Z

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Patience...or action. Anyone is welcome and encouraged to submit a pull request on this topic. Xarray is a volunteer effort.

Obviously. I'm just new to Zarr so a bit early to contribute to Xarray on that topic.

shoyer added API design topic-backends labels Mar 29, 2018

jhamman mentioned this issue Jul 3, 2018

.to_zarr with datetime64[ns] #2265

Closed

gustavo-marques mentioned this issue Sep 3, 2018

render MOM6 example with correct mesh geometry pangeo-data/pangeo-example-notebooks#11

Closed

jendrikjoe mentioned this issue Jan 24, 2019

Appending to zarr store #2706

Merged

3 tasks

shoyer closed this as completed in #2706 Jun 29, 2019

leewujung mentioned this issue Feb 23, 2020

Memory usage when combining multiple files OSOceanAcoustics/echopype#108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable Append/concat to existing zarr datastore #2022

Enable Append/concat to existing zarr datastore #2022

jgerardsimcock commented Mar 28, 2018

shoyer commented Mar 29, 2018

NickMortimer commented Jul 4, 2018

rabernat commented Oct 17, 2018

shoyer commented Oct 17, 2018

leroygr commented Nov 27, 2018

rabernat commented Nov 27, 2018

leroygr commented Nov 27, 2018

Enable Append/concat to existing zarr datastore #2022

Enable Append/concat to existing zarr datastore #2022

Comments

jgerardsimcock commented Mar 28, 2018

Problem description

Expected Output

shoyer commented Mar 29, 2018

NickMortimer commented Jul 4, 2018

rabernat commented Oct 17, 2018

shoyer commented Oct 17, 2018

leroygr commented Nov 27, 2018

rabernat commented Nov 27, 2018

leroygr commented Nov 27, 2018