Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended way to extend xarray Datasets using accessors? #2473

Closed
TomNicholas opened this issue Oct 8, 2018 · 6 comments
Closed

Recommended way to extend xarray Datasets using accessors? #2473

TomNicholas opened this issue Oct 8, 2018 · 6 comments

Comments

@TomNicholas
Copy link
Member

Hi,

I'm now regularly using xarray (& dask) for organising and analysing the output of the simulation code I use (BOUT++) and it's very helpful, thank you!.

However my current approach is quite clunky at dealing the extra information and functionality that's specific to the simulation code I'm using, and I have questions about what the recommended way to extend the xarray Dataset class is. This seems like a general enough problem that I thought I would make an issue for it.

Desired

What I ideally want to do is extend the xarray.Dataset class to accommodate extra attributes and methods, while retaining as much xarray functionality as possible, but avoiding reimplementing any of the API. This might not be possible, but ideally I want to make a BoutDataset class which contains extra attributes to hold information about the run which doesn't naturally fit into the xarray data model, extra methods to perform analysis/plotting which only users of this code would require, but also be able to use xarray-specific methods and top-level functions:

bd = BoutDataset('/path/to/data')

ds = bd.data  # access the wrapped xarray dataset
extra_data = bd.extra_data  # access the BOUT-specific data

bd.isel(time=-1)  # use xarray dataset methods

bd2 = BoutDataset('/path/to/other/data')
concatenated_bd = xr.concat([bd, bd2])  # apply top-level xarray functions to the data

bd.plot_tokamak()  # methods implementing bout-specific functionality

Problems with my current approach

I have read the documentation about extending xarray, and the issue threads about subclassing Datasets (#706) and accessors (#1080), but I wanted to check that what I'm doing is the recommended approach.

Right now I'm trying to do something like

@xr.register_dataset_accessor('bout')
class BoutDataset:
    def __init__(self, path):
        self.data = collect_data(path)  # collect all my numerical data from output files
        self.extra_data = read_extra_data(path)  # collect extra data about the simulation 

    def plot_tokamak():
        plot_in_bout_specific_way(self.data, self.extra_data)

which works in the sense that I can do

bd = BoutDataset('/path/to/data')

ds = bd.bout.data  # access the wrapped xarray dataset
extra_data = bd.bout.extra_data  # access the BOUT-specific data
bd.bout.plot_tokamak()  # methods implementing bout-specific functionality

but not so well with

bd.isel(time=-1)  # AttributeError: 'BoutDataset' object has no attribute 'isel'
bd.bout.data.isel(time=-1)  # have to do this instead, but this returns an xr.Dataset not a BoutDataset

concatenated_bd = xr.concat([bd1, bd2])  # TypeError: can only concatenate xarray Dataset and DataArray objects, got <class 'BoutDataset'>
concatenated_ds = xr.concat([bd1.bout.data, bd2.bout.data])  # again have to do this instead, which again returns an xr.Dataset not a BoutDataset

If I have to reimplement the APl for methods like .isel() and top-level functions like concat(), then why should I not just subclass xr.Dataset?

There aren't very many top-level xarray functions so reimplementing them would be okay, but there are loads of Dataset methods. However I think I know how I want my BoutDataset class to behave when an xr.Dataset method is called on it: I want it to implement that method on the underlying dataset and return the full BoutDatset with extra data and attributes still attached.

Is it possible to do something like:
"if calling an xr.Dataset method on an instance of BoutDataset, call the corresponding method on the wrapped dataset and return a BoutDataset that has the extra BOUT-specific data propagated through"?

Thanks in advance, apologies if this is either impossible or relatively trivial, I just thought other xarray users might have the same questions.

@shoyer
Copy link
Member

shoyer commented Oct 8, 2018

Your use of an xarray accessor class looks a little funny to me. Accessors should only accept one argument, which is an xarray.Dataset. If you want to make your own objects, that's fine, but I don't think there's any point in making them an accessor.

What does your extra_data look like? Would it be possible to put it into xarray in the form of coordinates or attributes? If so, that would probably be the preferred path.

@TomNicholas
Copy link
Member Author

Accessors should only accept one argument, which is an xarray.Dataset.

Ah - I misunderstood how they work, if I do this:

@xr.register_dataset_accessor('bout')
class BoutDataset(object):
    def __init__(self, ds_object):
        self.data = ds_object
        
    def plot_tokamak():
        plot_in_bout_specific_way(self.data, self.extra_data)

then I can successfully use it like this:

ds = collect_data(path)

ds.bout.plot_tokamak()
print(ds.bout.data)

Thanks!

What does your extra_data look like? Would it be possible to put it into xarray in the form of coordinates or attributes?

My extra_data is a nested-dictionary-like object containing everything in the simulation input file. This I can store in the attributes dictionary (that's what I'm doing at the moment), but I'm still interested in the problem of attaching arbitrary objects because because losing the attributes on operations is annoying and I envisage either myself or other users of the code wanting to store more complicated structures in the future. (For example an xgcm grid object)

Also the philosophy of the BOUT framework encourages each group of users to add their own functionality depending on what exactly they are solving. Some kind of base BoutDataset class which could then be further subclassed to add more specific functionality would fit best with this system. Would it be a terrible idea to subclass xr.Dataset into a BoutDataset, then decorate every inherited method by looping over them so that my extra attached objects are dealt with appropriately (maybe like this)? Would that not in theory give me the whole API of xarray while allowing me to attach arbitrary objects and methods to my class, and allowing further subclassing if desired?

(Sorry if this is too far off the topic of xarray, feel free to close it if you think this is not going to be relevant to any of the rest of the community.)

@TomNicholas
Copy link
Member Author

Or alternatively I suppose I could just store my options as an attribute on the accessor:

ds = collect_data(path)
ds.bout.options = options_obj

# Now I have an object which stores all the data I want
print(ds.bout.options)

That seems to be what I should have tried first.

In that case would it be possible to have multiple accessors, one for the core bout functionality and others for extensions for even-more-specific users? i.e an accessor for a code called Storm which is based on BOUT++:

print(ds)  # xarray dataset

print(ds.bout.options)  # BOUT-specific extra information

print(ds.storm.special_storm_plot())  # storm-user-group-specific method

Can an accessor inherit from another accessor class?

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 9, 2018

Sorry for the triple comment, but I could also do this:

If I do this:

@xr.register_dataset_accessor('bout')
class BoutAccessor(object):
    def __init__(self, ds_object):
        self._data = ds_object
        
    def set_options(self, options):
        self.options = options

@register_dataset_accessor('storm')
class StormAccessor(BoutAccessor):
    def __init__(self, ds_object):
        super().__init__(ds_object)

    def special_storm_plot()
        special_storm_plot(self._data, self.options)

then I can have BOUT-specific attributes and methods stored via the bout accessor, all of which are inherited by the more-specific storm accessor

ds = collect_data(path)
options_obj = get_options(path)

ds.storm.set_options(options_obj)  # access inherited methods from the bout accessor to store arbitrary extra data

ds.storm.special_storm_plot()  # use data from the bout accessor when calling more-specific methods form the storm accessor

However then my attached objects don't survive the calling of xarray methods:

ds.storm.extra_info = 'options'
new_ds = ds.isel(t=-1)
print(new_ds.storm.extra_info)  # AttributeError: 'StormAccessor' object has no attribute 'extra_info'

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 12, 2018

I realised I'm massively overcomplicating this, I think all I need is the existing accessor functionality, and for attrs to always be preserved in order to get the behaviour I want. Therefore once #2482 is merged then I should be able to close this.

@TomNicholas
Copy link
Member Author

#2482 has now been merged which means that I can get the behaviour I want by using globally-permanent attrs to store arbitrary objects, and using accessors to attach methods to datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants