Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Bug with arithmetic between datasets and datatrees (e.g. ds * dt) #146

Closed
TomNicholas opened this issue Aug 25, 2022 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@TomNicholas
Copy link
Member

TomNicholas commented Aug 25, 2022

Arithmetic involving one DataTree and one Dataset is supposed to be node-wise, i.e. the operation involving the single Dataset is automatically applied to every node of the tree individually, returning a new tree, like this:

In [8]: ds = xr.Dataset({"a": 1})

In [9]: ds
Out[9]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    a        int64 1

In [10]: dt = DataTree(data=ds)

In [11]: dt * ds
Out[11]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        a        int64 1

However there is a bad bug:

In [12]: ds * dt
Out[12]: 
<xarray.Dataset>
Dimensions:  ()
Data variables:
    a        int64 1

This didn't return a tree, so arithmetic between Datasets and DataTrees currently doesn't respect commutativity :(

Interestingly this does work fine for python scalars

In [27]: dt * 2
Out[27]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        a        int64 2

In [28]: 2 * dt
Out[28]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        a        int64 2

and for numpy arrays

In [30]: dt * np.array(2)
Out[30]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        a        int64 2

In [31]: np.array(2) * dt
Out[31]: 
DataTree('None', parent=None)
    Dimensions:  ()
    Data variables:
        a        int64 2

I do have tests for this arithmetic behaviour but it looks like I didn't think to check commutativity in any of those tests!

I haven't looked into this deeply yet but my hunch is that it's something to do with __mul__/__rmul__ on Dataset competing for priority with __mul__/__rmul__ on DataTree. If that is the case then it might only be possible to fix it upstream in xarray by changing the behaviour of Dataset to defer to DataTree...

@TomNicholas TomNicholas added the bug Something isn't working label Aug 25, 2022
@benjaminwoods
Copy link
Contributor

Alrighty, I had 1h spare to look into this.

I think I can get a fix going for this, at the Datatree end. Here's a quick precis:

  • __mul__ from Dataset will take preference
  • The operation is successful because Dataset.__mul__ is successfully able to operate on the two operands; the Dataset instance and the Datatree instance.
  • The chain of function/method calls is a bit gritty, but it looks like this is fixable at the Datatree end.

My gut instinct is that by targeting this bit of code, it's possible to trigger a NotImplemented error, which should propagate back through the call stack to the invocation of Dataset.__mul__ by the * operator, which should then trigger Datatree.__rmul__.

Quick class diagram:

classDiagram
    Dataset <|-- DatasetArithmetic
    DatasetArithmetic <|-- SupportsArithmetic
    DatasetArithmetic <|-- DatasetOpsMixin
    
    class Dataset {
        _binary_op(Any other, Operator f, bool reflexive=False, Optional[Any] join) -> Dataset:
    }

    class DatasetOpsMixin {
        _binary_op(Any other, Operator f, bool reflexive=False): Result[Operator, [Self, Other]]
        __mul__(Any other, Any other): Result[operator.mul]
    }

    class SupportsArithmetic {
        tuple _HANDLED_TYPES
        __array_ufunc__(Ufunc ufunc, str method, *inputs, **kwargs): Result[Ufunc, ArrayUfuncInputs[inputs, kwargs]]]
    }

    Datatree .. Dataset
Loading

@benjaminwoods
Copy link
Contributor

I'll give it a try now, but I think a simple bit of switch-y stuff with __getattr__ should do the job...

@TomNicholas
Copy link
Member Author

Oh cool, thanks for looking into this @benjaminwoods !

My gut instinct is that by targeting this bit of code, it's possible to trigger a NotImplemented error, which should propagate back through the call stack to the invocation of Dataset.mul by the * operator, which should then trigger Datatree.rmul.

So you're suggesting making DataTree.variable raise NotImplementedError?

at the Datatree end

Worth remembering that if this turns out to be too tricky to fix here we can just punt on it until datatree is integrated upstream in xarray, which is the ultimate goal.

@benjaminwoods
Copy link
Contributor

No worries!

So you're suggesting making DataTree.variable raise NotImplementedError?

Yep, basically. Unfortunately, I don't think it's currently doable, after a closer look. Here's a quick example, with the current call stack:

import xarray as xr
from datatree import DataTree
ds = xr.Dataset({"a": 1})
dt = DataTree(data=ds)
ds * dt

Stack when executing ds * dt:

> DatasetOpsMixin.__mul__(ds, dt)
> ds._binary_op(dt, operator.mul)
> ds._calculate_binary_op(operator.mul, dt)
> utils.is_dict_like(dt)
> isinstance(dt, Dataset)
> ds._calculate_binary_op.<__locals__>._apply_over_both(ds.data_vars, dt, ds.data_vars, dt)

xarray assumes that the Datatree is dictionary like, and therefore smashes ds and dt together as if they were ds and a dictionary. If one tries to remove .__getitem__ and .keys from DataTree, that will break quite a bit of stuff.

I think this needs fixing at the xarray end, really. Some ways to do this:

  1. Modify Dataset._binary_op to allow for a more extensible operator setup. The xarray code is internally considering that Dataset will always be the top level construct for xarray. Not particularly extensible, even from the xarray end.
  2. Tweak the metaclass for Dataset to allow customization of isinstance checks, via __instancecheck__. A very risky game of dice to play though. Artificial inheritance is a slippery slope to just straight-up broken OOP, but if done correctly, this would allow downstream libraries that extend xarray to register virtual subinstances/subclasses. (One could then make DataTree instances pass isinstance(dt, Dataset).)

I'm also not a fan of that second idea as this is essentially building a contract with internal code in an internal function. A true nightmare to attempt to maintain stability there, as an developer working on xarray would be wholly entitled to change code without an FYI, causing problems for DataTree downstream. It would need elevation to a public API for this to be at all reasonable for anyone to use downstream.

@owenlittlejohns
Copy link

Closed in favour of pydata/xarray#9365

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants