Allow DataArray to hold cell boundaries as coordinate variables #1475

JiaweiZhuang · 2017-07-11T20:58:44Z

Cell boundaries can be either N+1 sized arrays as suggested by MITgcm/xmitgcm#15, or (N,2) sized arrays as suggested by the CF convention. However, a DataArray cannot hold both kinds of coordinate variables because they contain a new dimension.

If you try to assign a new coordinate to a DataArray by dr.assign_coords(), you will get
ValueError: cannot add coordinates with new dimensions to a DataArray

On the other hand, if your DataSet contains cell boundary variables (for example, #667), the bounds will be dropped when you extract a single variable into a DataArray.

Having cell bounds available in a DataArray is important for a couple of applications:

Pass cell bounds to DataArray's plotting methods (N + 1 sized grids for X and Y MITgcm/xmitgcm#15). I am aware of the discussion about inferring boundaries (Don't infer x/y coordinates interval breaks for cartopy plot axes #781). However, for the Cube-Sphere grid or the Lat-Lon-Cap grid (reference) which have tiles covering the poles, I have to explicitly pass cell bounds to the original plt.pcolormesh() to get a good-looking plot. (see this comment for details)
For conservative (i.e. area-weighted) regridding (mentioned in API for multi-dimensional resampling/regridding #486). Cell centers are enough for bilinear interpolation or other simple resamping, but for any Finite-Volume meshes, knowing the boundaries is crucial if you want to conserve the total amount of mass or flux.

Plotting or regridding will work fine if you pass cell bounds as an additional argument to a wrapper function. However, having a single DataArray object containing boundary information seems like a more elegant solution. Is it possible to let DataArray accept N+1 sized coordinate variables, and be able to inherit them from the parent DataSet? If that's too drastic, is it possible to write an accessor to extend DataArray's capability? Say, a "bound" accessor for a new attribute ds.bnd['lat_b'], which can be kept when a DataArray gets extracted (ds['data_var'].bnd['lat_b'] )? Does this make sense?

The text was updated successfully, but these errors were encountered:

fmaussion · 2017-07-11T21:38:01Z

See also #1079 and #1079 (comment)

JiaweiZhuang · 2017-07-11T23:58:20Z

See also #1079 and #1079 (comment)

Thanks! The idea of NDIntervalIndex mentioned at pandas-dev/pandas#7640 comment seems powerful but too complicated to implement? Could there be a simpler way to hook the boundary attribute to DataArray?

shoyer · 2017-07-12T17:44:28Z

I don't think we need a full NDIntervalIndex unless we also want indexing, which is nice but not essential for just storing data. We do need a way to represent interval data in 1D arrays, though.

Probably the simplest option is to use structured dtypes, which should already work with the existing version of xarray, e.g.,

import numpy as np
import xarray

interval_dtype = np.dtype([('start', float), ('stop', float)])
coords = {'x': 0.5 + np.arange(3), 'x_bounds': ('x', np.array([(0, 1), (1, 2), (2, 3)], dtype=interval_dtype))}
da = xarray.DataArray(range(3), coords=coords, dims='x')

>>> da
<xarray.DataArray (x: 3)>
array([0, 1, 2])
Coordinates:
  * x         (x) float64 0.5 1.5 2.5
    x_bounds  (x) [('start', '<f8'), ('stop', '<f8')] (0.0, 1.0) (1.0, 2.0) ...

>>> da.x_bounds
<xarray.DataArray 'x_bounds' (x: 3)>
array([(0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], 
      dtype=[('start', '<f8'), ('stop', '<f8')])
Coordinates:
  * x         (x) float64 0.5 1.5 2.5
    x_bounds  (x) [('start', '<f8'), ('stop', '<f8')] (0.0, 1.0) (1.0, 2.0) ...

>>> da.x_bounds.data['start'], da.x_bounds.data['stop']
(array([ 0.,  1.,  2.]), array([ 1.,  2.,  3.]))

We could probably do a few things to make these easier to use:

Support indexing like da.x_bounds['start'] to return da.x_bounds.data['start'] wrapped in an xarray.DataArray.
Automatically create them as part of netCDF IO.

Conceptually, this is pretty similar to a MultiIndex (see #1426 for discussion).

JiaweiZhuang · 2017-07-12T18:50:02Z

Probably the simplest option is to use structured dtypes, which should already work with the existing version of xarray, e.g.,

Thanks, that's a nice trick! Supporting da.x_bounds['start'] will definitely be helpful!

However, I am still concerned about 2D boundaries. Using the structured data type, 2D bounds will be an array of size (Nx,Ny,4) instead of (Nx+1,Ny+1). Although this matches the CF convention, it takes 4x memory and needs to be converted back to (Nx+1,Ny+1) for pcolormesh(). Not a big problem though. I will be happy to go this way if (Nx+1,Ny+1)-sized bounds cannot be implemented.

rabernat · 2017-07-12T21:12:09Z

These are precisely the sort of issues we are trying to solve with xgcm. I am about to make a big new release. Using the xgcm concept of an Axis object (not yet in the online docs until the new release), it should be pretty easy to add this sort of plotting support in an arbitrary number of dimensions.

rabernat · 2018-08-30T02:49:12Z

cc @adcroft, who expressed interest in this topic.

rabernat · 2019-01-24T15:50:24Z

I'm just pinging this issue again to keep it fresh.

I am becoming more and more convinced that we need to allow for cell bounds in xarray's data model. Contrary to my comments above, I no longer think this is a problem to be solved with xgcm or some outside package.

CF conventions, which we partially support in other parts of xarray, have a clearly defined concept of cell geometry. When present, such coordinates could decoded and used for indexing and plotting.

Currently we distinguish between "dimension coordinates," which are converted to indexes, and "non-dimension coordinates." What if we added a new type of coordinate called "cell coordinates"? We could accomodate either (N+1) sized coordinates for quad-mesh geometries of (N,M) sized coordinates for unstructured meshes.

What is a concrete first step we could take towards this goal? Try to work out a design document?

shoyer · 2019-01-26T23:14:18Z

Currently we distinguish between "dimension coordinates," which are converted to indexes, and "non-dimension coordinates."

The long term plan in #1603 ("Explicit indexes") is to eliminate this distinction -- we'll simply have variables, which can be in the form of data variables or coordinates, and indexes, for look-up along any coordinate.

What if we added a new type of coordinate called "cell coordinates"? We could accomodate either (N+1) sized coordinates for quad-mesh geometries or (N,M) sized coordinates for unstructured meshes.

I understand (N+1) sized coordinates for quad-mesh geometries, where N is the number of physical dimensions.

I'm not sure I understand (N,M) sized coordinates for unstructured meshes -- what is M here? The total number of cells? Some arbitrary constant indicating the maximum number of sides for a single cell?

I do.

Logically I see two approaches here:

Putting cell bounds into structured dtypes, and adding sugar to make these easier to use (as discussed in Allow DataArray to hold cell boundaries as coordinate variables #1475 (comment)).
Putting cell bounds directly into xarray's data model in some form, so we can deviate from our current rule that "coordinates dimensions must be a subset of DataArray dimensions."

(1) feels like the safe approach (from xarray's perpsective). Maybe structured dtypes too annoying to use on a routine basis, but there also are other use cases for them that would benefit from some attention. I worry that solutions in the style of (2) would bake domain specific logic deep into xarray's data model and make the whole library more complex, though I do appreciate that cell bounds are a pretty ubiquitous concept for modeling physical phenomena.

One way of solving (2) would be to allow something like "isolated" or "non-aligned" dimensions, which aren't shared across a Dataset/DataArray and are allowed to deviate on a per-variable basis. Dataset.dims would be a dynamic (rather than computed) part of xarray's data model, and dimensions not found in dims would not be required to be aligned/consistent between variables. This is intriguing but is also a much bigger change:

By default (i.e., dims=None), dims would get filled in from all the variables in a Dataset. But the aligned dimensions in dims could also be set explicitly.
If a dimension isn't found in dims, you can't index or align along it and it's allowed to vary between variables.
DataArray objects would also need some way to distinguish between "aligned" and "non-aligned" dimensions. It's less clear what this would be.
Only aligned dimensions on coordinates of a DataArray are required to be found on the DataArray variable.

rabernat · 2019-01-27T10:49:07Z

I'm not sure I understand (N,M) sized coordinates for unstructured meshes -- what is M here? The total number of cells? Some arbitrary constant indicating the maximum number of sides for a single cell?

N is the number of cells. M is the number of points required to specify the cell vertices, e.g. 4 for 2D quadmesh, 3 for 2D trimesh, 8 for 3D quadmesh, etc.

Regarding your options 1 or 2, I guess I'm agnostic as to how it is implemented. I recognize 2 introduces lots of complications. What matters is how it will interact the indexes, i.e. can we easily select data based on cell bounds?

I will have to take some time to think about what you wrote, as it is hard for my brain... 🙃

shoyer · 2019-01-27T20:30:49Z

What matters is how it will interact the indexes, i.e. can we easily select data based on cell bounds?

Either way, we will need to write our own index classes for this (but this is totally doable). This will either be something xarray specific or possibly based on pandas.Index.

pandas.IntervalIndex is similar, but is much more complex because it handles overlapping cells. We would prefer a CellIndex that does not allow for overlap.

lukelbd · 2022-07-20T22:46:36Z

Not sure where this stands but another advantage might be the ability to call xr.open_dataarray on netcdf files containing individual variables plus coordinate bounds (data from CMIP5/6 are commonly stored this way).

rogvidarge · 2022-11-26T22:18:59Z

Has there been any progress on this?

SimonHeybrock · 2023-01-04T07:20:36Z

Recently I experimented with an (incomplete) duck-array prototype, wrapping an array of length N+1 in a duck array of length N (such that you can use it as a coordinate for a DataArray of length/shape N). It mostly worked (even though there may be some issues when you want to use it as an xarray index).

See https://github.com/scipp/scippx/blob/main/src/scippx/bin_edge_array.py (there is a bunch of unrelated stuff in the repo, you can mostly ignore that).

benbovy · 2023-08-24T13:21:05Z

xref a possible solution explained here: #8005 (comment)

Basically, it is very similar than @shoyer's #1475 (comment). In addition, the bounds coordinate would be indexed (and would share its Xarray index with the point-value coordinate).

The bounds coordinate would wrap a pd.IntervalArray (or pd.IntervalIndex?), but we could also have our own, simpler implementation (no bounds overlap) as suggested in #1475 (comment).

tomvothecoder · 2025-03-12T00:35:28Z

Hi, my team and I on the xCDAT project explored storing bounds on DataArray objects in 2021. We tried DataArray accessors, but found them unreliable due to state loss when creating new objects (e.g., copying) (#3268). Instead, we designed our APIs around Dataset accessors, since Dataset objects can store bounds as data variables.

We're revisiting this in PR #737, looking for ways to support bounds in DataArrays without modifying the Xarray API. This remains a complex issue with no clear solution. We'd love to collaborate with the Xarray community to explore potential approaches within Xarray (or a simple-ish way to do this in xCDAT).

Just reviving this conversation for anybody who is interested.

CC: @pochedls

benbovy · 2025-03-12T08:42:43Z

@tomvothecoder you might want to have a look at #8005. In summary, we discuss two possible approaches for dealing with bounds in DataArray:

decode the (CF) bounds coordinate into a 1-dimension coordinate wrapping a pandas.IntervalIndex or pandas.IntervalArray (enabled by Support extension array indexes #9671) such that it can be propagated with the current DataArray model
update DataArray coordinate propagation rules to include all indexed coordinates associated with DataArray's dimensions (e.g., if coordinates time(time) and time_bnds(time,nv) are both associated to an Xarray interval index they will be propagated with the index even though nv is not a dimension of the DataArray).

The second approach is more general and in principle could support >1 dimension cases like lat-lon 4-sided cells.

benbovy · 2025-03-12T12:21:55Z

#10116 is implementing the second approach.

tomvothecoder · 2025-03-17T16:46:34Z

Hi @benbovy, thank you for the updates. I am happy to see you and others are addressing the enhancement to include bounds in DataArrays! I'll take a look at both approaches. Let me know if you need help testing.

JiaweiZhuang mentioned this issue Nov 9, 2017

Regridding API design JiaweiZhuang/xESMF#9

Closed

JiaweiZhuang mentioned this issue Jan 19, 2018

integration with xgcm.Grid JiaweiZhuang/xESMF#13

Open

NicWayand mentioned this issue Feb 26, 2018

Conservative regridding of DataArray, N+1 dim issue JiaweiZhuang/xESMF#14

Closed

JiaweiZhuang mentioned this issue Mar 16, 2018

Using ESMF unstructured grids JiaweiZhuang/xESMF#18

Open

aidanheerdegen mentioned this issue Aug 7, 2018

Time bounds returned after an operation with resample-method #2231

Open

JiaweiZhuang mentioned this issue Aug 18, 2018

let's enumerate all the ways to represent a "grid" in python pangeo-data/pangeo#356

Closed

DWesl mentioned this issue Mar 26, 2019

Read grid mapping and bounds as coords #2844

Merged

3 tasks

rabernat mentioned this issue Mar 26, 2019

Cannot store data after group_by #2847

Open

rabernat mentioned this issue Apr 11, 2019

Fit bounding box to coarser resolution #2793

Open

TomNicholas mentioned this issue Apr 11, 2019

Option to retain boundary/guard cells boutproject/xBOUT#19

Closed

spencerkclark mentioned this issue May 17, 2019

Don't set encoding attributes on bounds variables. #2965

Merged

3 tasks

dcherian mentioned this issue Oct 28, 2019

Move general functionality upstream NCAR/esmlab#157

Open

spencerkclark mentioned this issue Dec 22, 2019

interp with long cftime coordinates raises an error #3641

Closed

rabernat mentioned this issue Jan 12, 2020

Decode CF bounds to coords #3689

Closed

dcherian mentioned this issue Jun 12, 2020

Cell Boundary aware operations xarray-contrib/cf-xarray#10

Open

dschwoerer mentioned this issue Sep 14, 2020

Multi-mesh support #4420

Closed

dcherian mentioned this issue Mar 3, 2021

Flexible indexes refactoring notes #4979

Merged

tomvothecoder mentioned this issue Aug 6, 2021

Add wrappers for opening datasets and data variables xCDAT/xcdat#81

Merged

9 tasks

tomvothecoder mentioned this issue Aug 31, 2021

DataArray accessor bounds attributes don't persist for all xarray functions that return new DataArrays xCDAT/xcdat#99

Closed

ethanrd mentioned this issue Feb 28, 2023

contiguous time axis #7525

Closed

dcherian mentioned this issue Jul 19, 2023

Design for IntervalIndex #8005

Open

pochedls mentioned this issue Feb 12, 2025

Prototype of bounded dataarray functionality xCDAT/xcdat#737

Open

9 tasks

benbovy mentioned this issue Mar 12, 2025

DataArray: propagate index coordinates with non-array dimensions #10116

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow DataArray to hold cell boundaries as coordinate variables #1475

Allow DataArray to hold cell boundaries as coordinate variables #1475

JiaweiZhuang commented Jul 11, 2017 •

edited

Loading

fmaussion commented Jul 11, 2017

JiaweiZhuang commented Jul 11, 2017

shoyer commented Jul 12, 2017

JiaweiZhuang commented Jul 12, 2017 •

edited

Loading

rabernat commented Jul 12, 2017 •

edited

Loading

rabernat commented Aug 30, 2018

rabernat commented Jan 24, 2019

shoyer commented Jan 26, 2019

rabernat commented Jan 27, 2019 •

edited

Loading

shoyer commented Jan 27, 2019

lukelbd commented Jul 20, 2022

rogvidarge commented Nov 26, 2022

SimonHeybrock commented Jan 4, 2023

benbovy commented Aug 24, 2023

tomvothecoder commented Mar 12, 2025 •

edited

Loading

benbovy commented Mar 12, 2025

benbovy commented Mar 12, 2025

tomvothecoder commented Mar 17, 2025

Allow DataArray to hold cell boundaries as coordinate variables #1475

Allow DataArray to hold cell boundaries as coordinate variables #1475

Comments

JiaweiZhuang commented Jul 11, 2017 • edited Loading

fmaussion commented Jul 11, 2017

JiaweiZhuang commented Jul 11, 2017

shoyer commented Jul 12, 2017

JiaweiZhuang commented Jul 12, 2017 • edited Loading

rabernat commented Jul 12, 2017 • edited Loading

rabernat commented Aug 30, 2018

rabernat commented Jan 24, 2019

shoyer commented Jan 26, 2019

rabernat commented Jan 27, 2019 • edited Loading

shoyer commented Jan 27, 2019

lukelbd commented Jul 20, 2022

rogvidarge commented Nov 26, 2022

SimonHeybrock commented Jan 4, 2023

benbovy commented Aug 24, 2023

tomvothecoder commented Mar 12, 2025 • edited Loading

benbovy commented Mar 12, 2025

benbovy commented Mar 12, 2025

tomvothecoder commented Mar 17, 2025

JiaweiZhuang commented Jul 11, 2017 •

edited

Loading

JiaweiZhuang commented Jul 12, 2017 •

edited

Loading

rabernat commented Jul 12, 2017 •

edited

Loading

rabernat commented Jan 27, 2019 •

edited

Loading

tomvothecoder commented Mar 12, 2025 •

edited

Loading