Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append along an unlimited dimension to an existing netCDF file #1672

Open
shoyer opened this issue Oct 30, 2017 · 9 comments
Open

Append along an unlimited dimension to an existing netCDF file #1672

shoyer opened this issue Oct 30, 2017 · 9 comments

Comments

@shoyer
Copy link
Member

shoyer commented Oct 30, 2017

This would be a nice feature to have for some use cases, e.g., for writing simulation time-steps:
https://stackoverflow.com/questions/46951981/create-and-write-xarray-dataarray-to-netcdf-in-chunks

It should be relatively straightforward to add, too, building on support for writing files with unlimited dimensions. User facing API would probably be a new keyword argument to to_netcdf(), e.g., extend='time' to indicate the extended dimension.

@Hoeze
Copy link

Hoeze commented May 24, 2018

Any updates on this?

@jhamman
Copy link
Member

jhamman commented May 24, 2018

None that I'm aware of. I think this issue is still in the "help wanted" stage.

@mullenkamp
Copy link

I would love to have this capability. As @shoyer mentioned, for adding time steps of any sort to existing netcdf files would be really beneficial. The only real alternative is to save a netcdf file for each additional time step...even if there are tons of time steps and each file is a couple hundred KBs (which is my situation with NASA data).

I'll look into it if I get some time...

@thomas-fred
Copy link

This would be extremely helpful for our modelling of time varying renewable energy.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Sep 1, 2020

I think I got a basic prototype working.

That said, I think a real challenge lies in supporting the numerous backends and lazy arrays.

For example, I was only able to add data in peculiar fashions using the netcdf4 library which may trigger complex computations many times.

Is this a use case that we must optimize for now?

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Sep 2, 2020

Small prototype, but maybe it can help boost the development.

import netCDF4


def _expand_variable(nc_variable, data, expanding_dim, nc_shape, added_size):
    # For time deltas, we must ensure that we use the same encoding as
    # what was previously stored.
    # We likely need to do this as well for variables that had custom
    # econdings too
    if hasattr(nc_variable, 'calendar'):
        
        data.encoding = {
            'units': nc_variable.units,
            'calendar': nc_variable.calendar,
        }
    data_encoded = xr.conventions.encode_cf_variable(data) # , name=name)
    left_slices = data.dims.index(expanding_dim)
    right_slices = data.ndim - left_slices - 1
    nc_slice   = (slice(None),) * left_slices + (slice(nc_shape, nc_shape + added_size),) + (slice(None),) * (right_slices)
    nc_variable[nc_slice] = data_encoded.data
        
def append_to_netcdf(filename, ds_to_append, unlimited_dims):
    if isinstance(unlimited_dims, str):
        unlimited_dims = [unlimited_dims]
        
    if len(unlimited_dims) != 1:
        # TODO: change this so it can support multiple expanding dims
        raise ValueError(
            "We only support one unlimited dim for now, "
            f"got {len(unlimited_dims)}.")

    unlimited_dims = list(set(unlimited_dims))
    expanding_dim = unlimited_dims[0]
    
    with netCDF4.Dataset(filename, mode='a') as nc:
        nc_dims = set(nc.dimensions.keys())

        nc_coord = nc[expanding_dim]
        nc_shape = len(nc_coord)
        
        added_size = len(ds_to_append[expanding_dim])
        variables, attrs = xr.conventions.encode_dataset_coordinates(ds_to_append)

        for name, data in variables.items():
            if expanding_dim not in data.dims:
                # Nothing to do, data assumed to the identical
                continue

            nc_variable = nc[name]
            _expand_variable(nc_variable, data, expanding_dim, nc_shape, added_size)

from xarray.tests.test_dataset import create_append_test_data
from xarray.testing import assert_equal
ds, ds_to_append, ds_with_new_var = create_append_test_data()

filename = 'test_dataset.nc'
ds.to_netcdf(filename, mode='w', unlimited_dims=['time'])
append_to_netcdf('test_dataset.nc', ds_to_append, unlimited_dims='time')

loaded = xr.load_dataset('test_dataset.nc')
assert_equal(xr.concat([ds, ds_to_append], dim="time"), loaded)

@espiritocz
Copy link

hi - i consider this extremely useful!!!

is your prototype already part of some library (or should we expect it in xr?)

many thanks for the code

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Nov 29, 2020

It isn't really part of any library. I don't really have plans of making it into a public library. I think the discussion is really around the xarray API, and what functions to implement at first.

Then somebody can take the code and integrate it into the decided upon API.

@ChrisBarker-NOAA
Copy link

Any movement on this? I'd love to have this -- kinda critical for some of my work.

@hmaarrfk seems to have made a start, and it doesn't look too hairy :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants