Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient workaround to group by multiple dimensions #2438

Closed
mschrimpf opened this issue Sep 25, 2018 · 3 comments
Closed

Efficient workaround to group by multiple dimensions #2438

mschrimpf opened this issue Sep 25, 2018 · 3 comments

Comments

@mschrimpf
Copy link

Grouping by multiple dimensions is not yet supported (#324):

d = DataAssembly([[1, 2, 3], [4, 5, 6]],
                 coords={'a': ('multi_dim', ['a', 'b']), 'c': ('multi_dim', ['c', 'c']), 'b': ['x', 'y', 'z']},
                 dims=['multi_dim', 'b'])
d.groupby(['a', 'b'])  # TypeError: `group` must be an xarray.DataArray or the name of an xarray variable or dimension

An inefficient solution is to run the for loops manually:

a, b = np.unique(d['a'].values), np.unique(d['b'].values)
result = xr.DataArray(np.zeros([len(a), len(b)]), coords={'a': a, 'b': b}, dims=['a', 'b'])
for a, b in itertools.product(a, b):
    cells = d.sel(a=a, b=b)
    merge = cells.mean()
    result.loc[{'a': a, 'b': b}] = merge
# result = DataArray (a: 2, b: 2)> array([[2., 3.], [5., 6.]])
#            Coordinates:
#              * a        (a) <U1 'x' 'y'
#              * b        (b) int64 0 1

This is however horribly slow for larger arrays.
Is there a more efficient / straight-forward work-around?

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.0.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-17134-Microsoft machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8 xarray: 0.10.8 pandas: 0.23.4 numpy: 1.15.1 scipy: 1.1.0 netCDF4: 1.4.1 h5netcdf: None h5py: None Nio: None zarr: None bottleneck: None cyordereddict: None dask: None distributed: None matplotlib: 2.2.3 cartopy: None seaborn: None setuptools: 40.2.0 pip: 10.0.1 conda: None pytest: 3.7.4 IPython: 6.5.0 sphinx: None

Related: #324, https://stackoverflow.com/questions/52453426/grouping-by-multiple-dimensions

@mschrimpf mschrimpf changed the title Workaround question: grouping by multiple dimensions Efficient workaround to group by multiple dimensions Sep 25, 2018
@shoyer
Copy link
Member

shoyer commented Sep 25, 2018

This is however horribly slow for larger arrays.

The existing (1 variable) groupby code actually basically does this same loop. We could potentially speed things up by leveraging a tool like numbagg but nobody has gotten around to that yet.

@mschrimpf
Copy link
Author

mschrimpf commented Sep 25, 2018

Thanks @shoyer,
Your comment helped me realize that at least part of the "horribly slow" probably stems from a DataArray with MultiIndex. The above code sample takes 5-6 seconds for 1000 b values. When stacking the DataArray beforehand with d = d.stack(adim=['a'], bdim=['b']), it takes around 14 seconds.
However, both of these are unfortunately very slow compared to indexing in e.g. numpy or pandas.

@mschrimpf
Copy link
Author

I built a manual solution in the stackoverflow thread. Maybe this helps someone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants