Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Treat accessor dataarrays as members of parent dataset #2517

Closed
ghost opened this issue Oct 26, 2018 · 5 comments
Closed

Treat accessor dataarrays as members of parent dataset #2517

ghost opened this issue Oct 26, 2018 · 5 comments

Comments

@ghost
Copy link

ghost commented Oct 26, 2018

Code Sample

import xarray as xr
import pandas as pd

# What I'm doing with comparison, I'd like to do with actual

comparison = xr.Dataset({'data': (['time'], [100, 30, 10, 3, 1]),
                         'altitude': (['time'], [5, 10, 15, 20, 25])},
                        coords={'time': pd.date_range('2014-09-06', periods=5, freq='1s')})

# With altitude as a data var, I can do the following:
comparison.swap_dims({'time': 'altitude'}).interp(altitude=12.0).data
# And
for (time, g) in comparison.groupby('time'):
    print(time)
    print(g.altitude.values)

@xr.register_dataset_accessor('acc')
class Accessor(object):
    def __init__(self, xarray_ds):
        self._ds = xarray_ds
        self._altitude = None

    @property
    def altitude(self):
        """ An expensive calculation that results in data that not everyone needs. """
        if self._altitude is None:
            self._altitude = xr.DataArray([5, 10, 15, 20, 25],
                                          coords=[('time', self._ds.time)])
        return self._altitude

actual = xr.Dataset({'data': (['time'], [100, 30, 10, 3, 1])},
                    coords={'time': pd.date_range('2014-09-06', periods=5, freq='1s')})

# This doesn't work:
actual.swap_dims({'time': 'altitude'}).interp(altitude=12.0).data
# Neither does this:
for (time, g) in actual.groupby('time'):
    print(time)
    print(g.acc.altitude.values)

Problem description

I've been using accessors to extend xarray with some custom computation. The altitude in the above dataset is not used every time the data is loaded, but when it is, it is an expensive computation to make (which is why I put it in as an accessor; if it isn't needed, it isn't computed).

Problem is, once it has been computed, I'd like to be able to use it as if it is a regular data_var of the dataset. For example, to interp on the newly computed column, or use it in a groupby.

Please advise if I'm going about this in the wrong way and how I should think about this problem instead.

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.18.16-arch1-1-ARCH machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_CA.UTF-8 LOCALE: en_CA.UTF-8 xarray: 0.10.8 pandas: 0.23.1 numpy: 1.14.5 scipy: 1.1.0 netCDF4: 1.4.0 h5netcdf: 0.6.1 h5py: 2.8.0 Nio: None zarr: None bottleneck: 1.2.1 cyordereddict: None dask: 0.17.5 distributed: 1.21.8 matplotlib: 2.2.2 cartopy: None seaborn: None setuptools: 39.2.0 pip: 9.0.3 conda: None pytest: 3.6.1 IPython: None sphinx: None
@TomNicholas
Copy link
Member

If you want to return your newly-calculated altitude and also have it be a full data_var in your dataset, one way would be to just alter the original dataset in-place. Something like

import xarray as xr
import pandas as pd
import xarray.testing as xrt


@xr.register_dataset_accessor('acc')
class Accessor(object):
    def __init__(self, xarray_ds):
        self._ds = xarray_ds
        self._altitude = None

    @property
    def altitude(self):
        """ An expensive calculation that results in data that not everyone needs. """
        if self._altitude is None:
            self._altitude = xr.DataArray([5, 10, 15, 20, 25],
                                          coords=[('time', self._ds.time)])

            # Here we add the calculated altitude to the dataset as a new data variable
            self._ds['altitude'] = self._altitude

        # Return just the altitude dataarray
        return self._altitude


expected = xr.Dataset({'data': (['time'], [100, 30, 10, 3, 1]),
                         'altitude': (['time'], [5, 10, 15, 20, 25])},
                        coords={'time': pd.date_range('2014-09-06', periods=5, freq='1s')})


actual = xr.Dataset({'data': (['time'], [100, 30, 10, 3, 1])},
                    coords={'time': pd.date_range('2014-09-06', periods=5, freq='1s')})


# Return newly-calculated altitude, but also store it in the actual dataset for later
altitude = actual.acc.altitude

# Check that worked
xrt.assert_equal(actual, expected)
xrt.assert_equal(actual['altitude'], actual.acc.altitude)

@ghost
Copy link
Author

ghost commented Oct 31, 2018

The only problem I see with this is that actual.acc.altitude must be called before actual.altitude, otherwise it will result in the data_var being used before it is created.

@TomNicholas
Copy link
Member

That's true, but unless you start subclassing dataset then isn't that always going to be the case?

You have some quantity which you can only calculate with either a function or an accessor method on the dataset, wouldn't you need to alter the __getitem__ method on the dataset object (or some subclass of it) in order to get the behaviour you're describing?

@shoyer
Copy link
Member

shoyer commented Nov 2, 2018

I think the cleanest way to do this in the long term would be to combine some sort of "lazy array" object with caching, e.g., along the lines of what's described in #2298. I'm not sure what the best solution in the short-term is, though.

@ghost
Copy link
Author

ghost commented Nov 5, 2018

I think #2298 is what I'm really waiting for and would solve the use cases I listed above. I'll have no trouble using the accessor methods in the time being.
Thanks, @shoyer

@ghost ghost closed this as completed Nov 5, 2018
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants