map over multiple subtrees #29
Description
I realised that part of the reason that arithmetic (#24) and ufuncs (#25) don't yet work is because the map_over_subtree
decorator currently only maps over a single subtree.
This works fine for mapping unary functions such as .isel
, because they only accept one tree-like argument (i.e. self
for the .isel
method). However for any type of binary function such as add(dt1, dt2)
then pairs of respective nodes in each tree need to be operated on together, as result_ds = add(dt1[node].ds, dt2[node].ds)
, before the output tree is built up from the results.
In the most general case we need to be able to map functions like
def func(*args, **kwargs)
# do stuff involving multiple Dataset objects
return output_trees
where any number of the args and kwargs could be DataTrees, and output_trees
could be a list of any number of DataTrees.
To implement this the map_over_subtree
decorator has to become a lot more general. It needs to
- Identify which of
args
andkwargs
are DataTree objects, - Check that all of those trees are isomorphic to one another, (EDIT: this was implemented in Check isomorphism #31)
- Walk along the nodes of all N trees simultaneously,
- Pass the respective N nodes from that position in each tree to
func
, as Datasets, without losing their position in*args
,**kwargs
, - Use the M output Datasets from
func
to rebuild M DataTree objects (which all have the same structure as the input trees), and return them.
We therefore have to decide what we mean by "isomorphic". The strictest definition would be that all node names are the same, so that
dt_1:
DataNode('foo')
| Data A
+---DataNode('bar')
+ Data B
could be mapped alongside
dt_2:
DataNode('foo')
| Data C
+---DataNode('bar')
+ Data D
but not alongside
dt_3:
DataNode('baz')
| Data C
+---DataNode('woz')
+ Data D
A more lenient definition would be that each node's ordered set of children must each have the same number of children as it's counterpart in the other tree. (In other words the tree structure must be the same, but the node names need not be. This requires the children to be ordered to avoid ambiguities.) This definition would allow dt_3
to be mapped over alongside dt_1
or dt_2
(or both simultaneously for a func
that accepts 3 Dataset arguments).