Skip to content
This repository was archived by the owner on Oct 24, 2024. It is now read-only.
This repository was archived by the owner on Oct 24, 2024. It is now read-only.

How to treat name of root node? #81

Closed
@TomNicholas

Description

@TomNicholas

In #76 I refactored the tree structure to use a path-like syntax. This includes referring to the root of a tree as "/", same as in cd / in a unix-like filesystem.

This makes accessing nodes and variables of nodes quite neat, because you can reference nodes via absolute or relative paths:

In [23]: from datatree.tests.test_datatree import create_test_datatree

In [24]: dt = create_test_datatree()

In [25]: dt['set2/a']
Out[25]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [26]: dt['/set2/a']
Out[26]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

In [27]: dt['./set2/a']
Out[27]: 
<xarray.DataArray 'a' (x: 2)>
array([2, 3])
Dimensions without coordinates: x

This refactor also made DataTree objects only optionally have a name, as opposed to be before when they were required to have a name. (They still have a .name attribute now, it just can be None.)

In [28]: dt.name

Normally this doesn't matter, because when assigned a .parent a node's .name property will just point to the key under which it is stored as a child. This echoes the way an unnamed DataArray can be stored in a Dataset.

In [29]: import xarray as xr

In [30]: ds = xr.Dataset()

In [31]: da = xr.DataArray(0)

In [32]: ds['foo'] = da

In [33]: ds['foo'].name
Out[33]: 'foo'

However this means that the root node of a tree is no longer required to have a name in general.


This is good because

  • As a user you normally don't care about the name of the root when manipulating the tree, only the names of the nodes,

  • It makes the __init__ signature simpler as name is no longer a required arg,

  • It most closely echoes how filepaths work (the filesystem root "/" doesn't have another name),

  • Roundtripping from Zarr/netCDF files still seems to work (see test_io.py),

  • Roundtripping from dictionaries still works if the root node is unnamed

    In [35]: d = {node.path: node.ds for node in dt.subtree}
    
    In [36]: roundtrip = DataTree.from_dict(d)
    
    In [37]: roundtrip
    Out[37]: 
    DataTree('None', parent=None)
    │   Dimensions:  (y: 3, x: 2)
    │   Dimensions without coordinates: y, xData variables:
    │       a        (y) int64 6 7 8set0     (x) int64 9 10
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')
    
    In [38]: dt.equals(roundtrip)
    Out[38]: True

But it's bad because

  • Roundtripping from dictionaries doesn't work anymore if the root node is named

    In [39]: dt2 = dt
    
    In [40]: dt2.name = "root"
    
    In [41]: d2 = {node.path: node.ds for node in dt2.subtree}
    
    In [42]: roundtrip2 = DataTree.from_dict(d2)
    
    In [43]: roundtrip2
    Out[43]: 
    DataTree('None', parent=None)
    │   Dimensions:  (y: 3, x: 2)
    │   Dimensions without coordinates: y, xData variables:
    │       a        (y) int64 6 7 8set0     (x) int64 9 10
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')
    
    In [44]: dt2.equals(roundtrip2)
    Out[44]: False
  • The signature of the DataTree.from_dict becomes a bit weird because if you want to name the root node the only way to do it is to pass a separate name argument, i.e.

    In [45]: dt3 = DataTree.from_dict(d, name='root')
    
    In [46]: dt3
    Out[46]: 
    DataTree('root', parent=None)
    ├── DataTree('set1')
    │   │   Dimensions:  ()
    │   │   Data variables:
    │   │       a        int64 0
    │   │       b        int64 1
    │   ├── DataTree('set1')
    │   └── DataTree('set2')
    ├── DataTree('set2')
    │   │   Dimensions:  (x: 2)
    │   │   Dimensions without coordinates: x
    │   │   Data variables:
    │   │       a        (x) int64 2 3
    │   │       b        (x) float64 0.1 0.2
    │   └── DataTree('set1')
    └── DataTree('set3')

What do we think about this behaviour? Does this seem like a good design, or annoyingly finicky?

@jhamman I notice that in the code you wrote for the io you put a note about not being able to specify a root group for the tree. Is that related to this question? Do you have any other thoughts on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    IORepresentation of particular file formats as treesdesign question

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions