Efficient workaround to group by multiple dimensions #2438

mschrimpf · 2018-09-25T15:11:38Z

Grouping by multiple dimensions is not yet supported (#324):

d = DataAssembly([[1, 2, 3], [4, 5, 6]],
                 coords={'a': ('multi_dim', ['a', 'b']), 'c': ('multi_dim', ['c', 'c']), 'b': ['x', 'y', 'z']},
                 dims=['multi_dim', 'b'])
d.groupby(['a', 'b'])  # TypeError: `group` must be an xarray.DataArray or the name of an xarray variable or dimension

An inefficient solution is to run the for loops manually:

a, b = np.unique(d['a'].values), np.unique(d['b'].values)
result = xr.DataArray(np.zeros([len(a), len(b)]), coords={'a': a, 'b': b}, dims=['a', 'b'])
for a, b in itertools.product(a, b):
    cells = d.sel(a=a, b=b)
    merge = cells.mean()
    result.loc[{'a': a, 'b': b}] = merge
# result = DataArray (a: 2, b: 2)> array([[2., 3.], [5., 6.]])
#            Coordinates:
#              * a        (a) <U1 'x' 'y'
#              * b        (b) int64 0 1

This is however horribly slow for larger arrays.
Is there a more efficient / straight-forward work-around?

Output of `xr.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

Related: #324, https://stackoverflow.com/questions/52453426/grouping-by-multiple-dimensions

The text was updated successfully, but these errors were encountered:

shoyer · 2018-09-25T17:31:13Z

This is however horribly slow for larger arrays.

The existing (1 variable) groupby code actually basically does this same loop. We could potentially speed things up by leveraging a tool like numbagg but nobody has gotten around to that yet.

mschrimpf · 2018-09-25T18:06:11Z

Thanks @shoyer,
Your comment helped me realize that at least part of the "horribly slow" probably stems from a DataArray with MultiIndex. The above code sample takes 5-6 seconds for 1000 b values. When stacking the DataArray beforehand with d = d.stack(adim=['a'], bdim=['b']), it takes around 14 seconds.
However, both of these are unfortunately very slow compared to indexing in e.g. numpy or pandas.

mschrimpf · 2018-10-02T15:56:53Z

I built a manual solution in the stackoverflow thread. Maybe this helps someone.

mschrimpf changed the title ~~Workaround question: grouping by multiple dimensions~~ Efficient workaround to group by multiple dimensions Sep 25, 2018

mschrimpf mentioned this issue Oct 1, 2018

DataArray.sel extremely slow #2452

Closed

mschrimpf closed this as completed Oct 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Efficient workaround to group by multiple dimensions #2438

Efficient workaround to group by multiple dimensions #2438

mschrimpf commented Sep 25, 2018

shoyer commented Sep 25, 2018

mschrimpf commented Sep 25, 2018 •

edited

Loading

mschrimpf commented Oct 2, 2018

Efficient workaround to group by multiple dimensions #2438

Efficient workaround to group by multiple dimensions #2438

Comments

mschrimpf commented Sep 25, 2018

Output of xr.show_versions()

shoyer commented Sep 25, 2018

mschrimpf commented Sep 25, 2018 • edited Loading

mschrimpf commented Oct 2, 2018

Output of `xr.show_versions()`

mschrimpf commented Sep 25, 2018 •

edited

Loading