Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Append/concat to existing zarr datastore #2022

Closed
jgerardsimcock opened this issue Mar 28, 2018 · 7 comments · Fixed by #2706
Closed

Enable Append/concat to existing zarr datastore #2022

jgerardsimcock opened this issue Mar 28, 2018 · 7 comments · Fixed by #2706

Comments

@jgerardsimcock
Copy link

Following discussion from pangeo-data/pangeo#19

How would we go about implementing a concat or append function for zarr data stores? I am imagining something like xr.concat here. Its not clear to me how this would work when using open_mfdataset.

zarray_1 = xr.open_zarr(store=gcsmap)

ds2 = xr.open_dataset(path_to_netcdf)

xr.concat([zarray_1, ds2])

Problem description

If you are using cloud storage facility like gcs, ds.to_zarr can fail before the completion of the upload. This is a problem for multi-TB datasets as the entire process needs to be restarted without any way to resume where you left off.

Expected Output

new zarr dataset with additional dataset appended along appropriate dim

@shoyer
Copy link
Member

shoyer commented Mar 29, 2018

This would probably make sense to think about along-side support for appending along an existing dimension in a netCDF file (#1672).

I can see a few potential ways to write the syntax. Probably supplying a range of indices along a dimension to write to would make the most sense, e.g., to_zarr(..., destination={'time': slice(1000, 2000)}) to indicate writing to positions 1000-2000 along the time dimension.

@NickMortimer
Copy link

My use case for this is appending Argo float data to an existing zarr store. At the moment I have 800+ netcdf files that need transforming before they can be added or read by xarray in *.nc type read. At the moment I read the first transform it and add to a zarr sort using .to_zarr. Then I proceed to read the next files and append each variable to zarr using zarr append function.

This is probably not a good way to go but all that I could figure at the moment.

@shoyer I think it would be useful to have a straight append mode:
to_zarr(....,mode='a+')

@rabernat
Copy link
Contributor

We may have people interested in working on this soon.

I think we have some details to sort out regarding the api for appending. The most generic case looks something like this

ds1 = xr.open_dataset('file1.nc')
# file2.nc already exists
ds1.to_netcdf('file2.nc', mode='a+')

We need to figure out what should happen under different circumstances. Some cases are:

  • We are just adding or completely overwriting variables. This works currently (from the docs: "If mode=’a’, existing variables will be overwritten"). But I'm not sure what happens if there is a conflict between coordinates among the new and old variables.
  • ds1 has some of the same variables as ds2, possibly with overlapping coordinates. In this case, we want to do some kind of append. If there is no overlap between coordinates, then it's straightforward: put the extra values from ds1 into file2.nc. If there is overlap, then there are two options:
    • overwrite all of the overlapping portion with ds1, or
    • keep the existing values from ds2.
  • With netCDF, there is an additional limitation that the underlying library will only let you extend along one dimension (the UNLIMITED one). Other backends like zarr will let you extend along many dimensions.

It seems like much of the logic for overlapping dimension should be able to be handled via align. The hard part will be figuring out how to tell the store to write to the appropriate regions of its arrays.

@shoyer
Copy link
Member

shoyer commented Oct 17, 2018

We are just adding or completely overwriting variables. This works currently (from the docs: "If mode=’a’, existing variables will be overwritten"). But I'm not sure what happens if there is a conflict between coordinates among the new and old variables.

I'm pretty sure the coordinates will just get overwritten, too, at least as long as the coordinate arrays have the same shape. If they have different shapes, you probably will get an error. We certainly don't do any checks for alignment currently.

ds1 has some of the same variables as ds2, possibly with overlapping coordinates. In this case, we want to do some kind of append. If there is no overlap between coordinates, then it's straightforward: put the extra values from ds1 into file2.nc.

This is only case I would try to solve to the initial implementation. It's probably 20% of the work (to add a keyword argument like extend='time') and covers 80% of the use-cases.

If we need alignment, I'm sure we could make that work in a follow-up. Certainly it would be less error prone to use.

@leroygr
Copy link

leroygr commented Nov 27, 2018

My use case for this is appending Argo float data to an existing zarr store. At the moment I have 800+ netcdf files that need transforming before they can be added or read by xarray in *.nc type read. At the moment I read the first transform it and add to a zarr sort using .to_zarr. Then I proceed to read the next files and append each variable to zarr using zarr append function.

This is probably not a good way to go but all that I could figure at the moment.

@NickMortimer would you have snipped for appending xarray objects to existing zarr dataset?

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Thanks!
Greg

@rabernat
Copy link
Contributor

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Patience...or action. Anyone is welcome and encouraged to submit a pull request on this topic. Xarray is a volunteer effort.

@leroygr
Copy link

leroygr commented Nov 27, 2018

Would be indeed really nice to get this built-in into xarray, but that is just a matter of patience I guess :)

Patience...or action. Anyone is welcome and encouraged to submit a pull request on this topic. Xarray is a volunteer effort.

Obviously. I'm just new to Zarr so a bit early to contribute to Xarray on that topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants