Per-node DataTree chunking #9634

sjperkins · 2024-10-16T07:20:31Z

Is your feature request related to a problem?

In the radio astronomy domain specific xarray-ms, we construct a DataTree representing partitions of a legacy data format where each partition contains regular data cubes. As currently implemented, the custom backend supports a partition_chunks kwarg in the BackendEntrypoint.open_datatree method so that it is possible to specify different chunking schemas per partition:

https://xarray-ms.readthedocs.io/en/latest/tutorial.html#per-partition-chunking

The chunking specification above is specific to a radio astronomy legacy format, but it may be more generally useful to be able to specify per-DataTree node chunking.

Describe the solution you'd like

Currently, BackendEntrypoint.open_datatree passes it's chunks kwarg to each Dataset constructor in the DataTree. This is quite coarse-grained as it applies the same chunking schema to all Datasets in the DataTree.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node). For example:

import xarray

xdt = xarray.open_datatree(..., chunks={
  "/path/to/node1": {"time": 20, "frequency": 16},
  "/path/to/a/node2": {"time": 10, "frequency": 4},
}

Then, when constructing Datasets in the DataTree, the chunking schema appropriate to the node can be applied.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

xd = xarray.open_datatree(..., chunks={
  # Apply to node1 and any node below
  "/path/to/node1/...": {"time": 20, "frequency": 16}
}

Describe alternatives you've considered

We've implemented a custom partition_chunks kwarg argument in the BackendEntrypoint.open_datatree method for our legacy data format.

Additional context

No response

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-10-16T19:59:35Z

Really cool to see you using xarray for radio astronomy data! I didn't know we had users in that field.

I propose that the chunks kwarg in BackendEntrypoint.open_datatree support a chunking dictionary per path (i.e. DataTree Node)

Good idea! We would be happy to take a PR if you want to generalize this.

An entry in the above dictionary does not necessarily need to only apply to a single node. It could also apply the chunking schema to each subtree below the node. But it may be better to make this more explicit

I think we should avoid the temptation to make this overly clever, at least initially, because the chunks kwarg type is already heavily overloaded. Per-node and per-variable chunking would be sufficiently expressive for all use cases. The only other subtlety that the chunk dict validation code would need to watch out for is duplicated coordinates.

shoyer · 2024-10-21T22:05:19Z

Yes, this makes a lot of sense to me. Quite often dimension sizes will differ per node, so it does not make sense to use a single shared set of chunks.

sjperkins · 2024-10-22T06:27:20Z

Yes, in principle I'd like to submit a PR. Apologies for not replying, I need to devote more time to thinking about the change:

In particular, the open_datatree (and open_group_as_dict) defers to the backend''s open_datatree implementation

xarray/xarray/backends/api.py

Lines 859 to 864 in 863184d

    
           if engine is None: 
        
               engine = plugins.guess_engine(filename_or_obj) 
        
           backend = plugins.get_backend(engine) 
        
           return backend.open_datatree(filename_or_obj, **kwargs)

xarray/xarray/backends/api.py

Lines 896 to 901 in 863184d

    
           if engine is None: 
        
               engine = plugins.guess_engine(filename_or_obj) 
        
           backend = plugins.get_backend(engine) 
        
           return backend.open_groups_as_dict(filename_or_obj, **kwargs)

which seems to imply that it's the backend's responsbility to interpret the chunks dictionary and pass it through to the backend's or xarray's open_dataset method. There doesn't immediately see a good way to do this by intercepting chunks before the API calls and dispatching the appropriate chunking strategy/schema to each dataset.

Perhaps the full chunking schema/strategy could be passed to the open_dataset method, along with the tree node path so that open_dataset can make the decision? But that seems ugly.

Neither of the above seem appealing -- I'll try find some more time to think about this.

keewis · 2024-10-22T09:40:52Z

I'm not sure I didn't miss anything, but I don't think open_datatree does support dask / chunking at all right now: the code of the backends does not handle / receive chunks, which I believe is by design. open_dataset calls _dataset_from_backend_dataset after the call to backend.open_dataset to do that, so I think open_datatree should do something similar.

The missing _datatree_from_backend_datatree would then also be the natural place for handling the per-group chunk arguments.

sjperkins added the enhancement label Oct 16, 2024

headtr1ck added the topic-DataTree Related to the implementation of a DataTree class label Oct 16, 2024

TomNicholas mentioned this issue Oct 22, 2024

support chunks in open_groups and open_datatree #9660

Merged

2 tasks

sjperkins mentioned this issue Oct 28, 2024

Pin xarray to 2024.9.0 ratt-ru/xarray-ms#42

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-node DataTree chunking #9634

Per-node DataTree chunking #9634

sjperkins commented Oct 16, 2024

TomNicholas commented Oct 16, 2024

shoyer commented Oct 21, 2024 •

edited

Loading

sjperkins commented Oct 22, 2024 •

edited

Loading

keewis commented Oct 22, 2024 •

edited

Loading

Per-node DataTree chunking #9634

Per-node DataTree chunking #9634

Comments

sjperkins commented Oct 16, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

TomNicholas commented Oct 16, 2024

shoyer commented Oct 21, 2024 • edited Loading

sjperkins commented Oct 22, 2024 • edited Loading

keewis commented Oct 22, 2024 • edited Loading

shoyer commented Oct 21, 2024 •

edited

Loading

sjperkins commented Oct 22, 2024 •

edited

Loading

keewis commented Oct 22, 2024 •

edited

Loading