Recommended way to extend xarray Datasets using accessors? #2473

TomNicholas · 2018-10-08T12:19:21Z

Hi,

I'm now regularly using xarray (& dask) for organising and analysing the output of the simulation code I use (BOUT++) and it's very helpful, thank you!.

However my current approach is quite clunky at dealing the extra information and functionality that's specific to the simulation code I'm using, and I have questions about what the recommended way to extend the xarray Dataset class is. This seems like a general enough problem that I thought I would make an issue for it.

Desired

What I ideally want to do is extend the xarray.Dataset class to accommodate extra attributes and methods, while retaining as much xarray functionality as possible, but avoiding reimplementing any of the API. This might not be possible, but ideally I want to make a BoutDataset class which contains extra attributes to hold information about the run which doesn't naturally fit into the xarray data model, extra methods to perform analysis/plotting which only users of this code would require, but also be able to use xarray-specific methods and top-level functions:

bd = BoutDataset('/path/to/data')

ds = bd.data  # access the wrapped xarray dataset
extra_data = bd.extra_data  # access the BOUT-specific data

bd.isel(time=-1)  # use xarray dataset methods

bd2 = BoutDataset('/path/to/other/data')
concatenated_bd = xr.concat([bd, bd2])  # apply top-level xarray functions to the data

bd.plot_tokamak()  # methods implementing bout-specific functionality

Problems with my current approach

I have read the documentation about extending xarray, and the issue threads about subclassing Datasets (#706) and accessors (#1080), but I wanted to check that what I'm doing is the recommended approach.

Right now I'm trying to do something like

@xr.register_dataset_accessor('bout')
class BoutDataset:
    def __init__(self, path):
        self.data = collect_data(path)  # collect all my numerical data from output files
        self.extra_data = read_extra_data(path)  # collect extra data about the simulation 

    def plot_tokamak():
        plot_in_bout_specific_way(self.data, self.extra_data)

which works in the sense that I can do

bd = BoutDataset('/path/to/data')

ds = bd.bout.data  # access the wrapped xarray dataset
extra_data = bd.bout.extra_data  # access the BOUT-specific data
bd.bout.plot_tokamak()  # methods implementing bout-specific functionality

but not so well with

bd.isel(time=-1)  # AttributeError: 'BoutDataset' object has no attribute 'isel'
bd.bout.data.isel(time=-1)  # have to do this instead, but this returns an xr.Dataset not a BoutDataset

concatenated_bd = xr.concat([bd1, bd2])  # TypeError: can only concatenate xarray Dataset and DataArray objects, got <class 'BoutDataset'>
concatenated_ds = xr.concat([bd1.bout.data, bd2.bout.data])  # again have to do this instead, which again returns an xr.Dataset not a BoutDataset

If I have to reimplement the APl for methods like .isel() and top-level functions like concat(), then why should I not just subclass xr.Dataset?

There aren't very many top-level xarray functions so reimplementing them would be okay, but there are loads of Dataset methods. However I think I know how I want my BoutDataset class to behave when an xr.Dataset method is called on it: I want it to implement that method on the underlying dataset and return the full BoutDatset with extra data and attributes still attached.

Is it possible to do something like:
"if calling an xr.Dataset method on an instance of BoutDataset, call the corresponding method on the wrapped dataset and return a BoutDataset that has the extra BOUT-specific data propagated through"?

Thanks in advance, apologies if this is either impossible or relatively trivial, I just thought other xarray users might have the same questions.

The text was updated successfully, but these errors were encountered:

shoyer · 2018-10-08T18:06:23Z

Your use of an xarray accessor class looks a little funny to me. Accessors should only accept one argument, which is an xarray.Dataset. If you want to make your own objects, that's fine, but I don't think there's any point in making them an accessor.

What does your extra_data look like? Would it be possible to put it into xarray in the form of coordinates or attributes? If so, that would probably be the preferred path.

TomNicholas · 2018-10-09T11:10:46Z

Accessors should only accept one argument, which is an xarray.Dataset.

Ah - I misunderstood how they work, if I do this:

@xr.register_dataset_accessor('bout')
class BoutDataset(object):
    def __init__(self, ds_object):
        self.data = ds_object
        
    def plot_tokamak():
        plot_in_bout_specific_way(self.data, self.extra_data)

then I can successfully use it like this:

ds = collect_data(path)

ds.bout.plot_tokamak()
print(ds.bout.data)

Thanks!

What does your extra_data look like? Would it be possible to put it into xarray in the form of coordinates or attributes?

My extra_data is a nested-dictionary-like object containing everything in the simulation input file. This I can store in the attributes dictionary (that's what I'm doing at the moment), but I'm still interested in the problem of attaching arbitrary objects because because losing the attributes on operations is annoying and I envisage either myself or other users of the code wanting to store more complicated structures in the future. (For example an xgcm grid object)

Also the philosophy of the BOUT framework encourages each group of users to add their own functionality depending on what exactly they are solving. Some kind of base BoutDataset class which could then be further subclassed to add more specific functionality would fit best with this system. Would it be a terrible idea to subclass xr.Dataset into a BoutDataset, then decorate every inherited method by looping over them so that my extra attached objects are dealt with appropriately (maybe like this)? Would that not in theory give me the whole API of xarray while allowing me to attach arbitrary objects and methods to my class, and allowing further subclassing if desired?

(Sorry if this is too far off the topic of xarray, feel free to close it if you think this is not going to be relevant to any of the rest of the community.)

TomNicholas · 2018-10-09T12:21:53Z

Or alternatively I suppose I could just store my options as an attribute on the accessor:

ds = collect_data(path)
ds.bout.options = options_obj

# Now I have an object which stores all the data I want
print(ds.bout.options)

That seems to be what I should have tried first.

In that case would it be possible to have multiple accessors, one for the core bout functionality and others for extensions for even-more-specific users? i.e an accessor for a code called Storm which is based on BOUT++:

print(ds)  # xarray dataset

print(ds.bout.options)  # BOUT-specific extra information

print(ds.storm.special_storm_plot())  # storm-user-group-specific method

Can an accessor inherit from another accessor class?

TomNicholas · 2018-10-09T12:52:30Z

Sorry for the triple comment, but I could also do this:

If I do this:

@xr.register_dataset_accessor('bout')
class BoutAccessor(object):
    def __init__(self, ds_object):
        self._data = ds_object
        
    def set_options(self, options):
        self.options = options

@register_dataset_accessor('storm')
class StormAccessor(BoutAccessor):
    def __init__(self, ds_object):
        super().__init__(ds_object)

    def special_storm_plot()
        special_storm_plot(self._data, self.options)

then I can have BOUT-specific attributes and methods stored via the bout accessor, all of which are inherited by the more-specific storm accessor

ds = collect_data(path)
options_obj = get_options(path)

ds.storm.set_options(options_obj)  # access inherited methods from the bout accessor to store arbitrary extra data

ds.storm.special_storm_plot()  # use data from the bout accessor when calling more-specific methods form the storm accessor

However then my attached objects don't survive the calling of xarray methods:

ds.storm.extra_info = 'options'
new_ds = ds.isel(t=-1)
print(new_ds.storm.extra_info)  # AttributeError: 'StormAccessor' object has no attribute 'extra_info'

TomNicholas · 2018-10-12T19:05:24Z

I realised I'm massively overcomplicating this, I think all I need is the existing accessor functionality, and for attrs to always be preserved in order to get the behaviour I want. Therefore once #2482 is merged then I should be able to close this.

TomNicholas · 2018-10-31T09:58:05Z

#2482 has now been merged which means that I can get the behaviour I want by using globally-permanent attrs to store arbitrary objects, and using accessors to attach methods to datasets.

TomNicholas mentioned this issue Oct 8, 2018

Unsure of best way to access xr.Dataset methods TomNicholas/xBOUT#2

Closed

TomNicholas mentioned this issue Oct 12, 2018

Global option to always keep/discard attrs on operations #2482

Merged

4 tasks

TomNicholas closed this as completed Oct 31, 2018

margauxmouchene mentioned this issue Jun 27, 2019

Subsetting a DataRecord doesn't return a true DataRecord landlab/landlab#909

Closed

flokno mentioned this issue Feb 26, 2020

Properly attach hash to arrays flokno/xrcache#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended way to extend xarray Datasets using accessors? #2473

Recommended way to extend xarray Datasets using accessors? #2473

TomNicholas commented Oct 8, 2018

shoyer commented Oct 8, 2018

TomNicholas commented Oct 9, 2018

TomNicholas commented Oct 9, 2018

TomNicholas commented Oct 9, 2018 •

edited

Loading

TomNicholas commented Oct 12, 2018 •

edited

Loading

TomNicholas commented Oct 31, 2018

Recommended way to extend xarray Datasets using accessors? #2473

Recommended way to extend xarray Datasets using accessors? #2473

Comments

TomNicholas commented Oct 8, 2018

Desired

Problems with my current approach

shoyer commented Oct 8, 2018

TomNicholas commented Oct 9, 2018

TomNicholas commented Oct 9, 2018

TomNicholas commented Oct 9, 2018 • edited Loading

TomNicholas commented Oct 12, 2018 • edited Loading

TomNicholas commented Oct 31, 2018

TomNicholas commented Oct 9, 2018 •

edited

Loading

TomNicholas commented Oct 12, 2018 •

edited

Loading