Inconsistent behavior in grouby depending on the dimension order #5361

fujiisoup · 2021-05-21T23:11:37Z

groupby works inconsistently depending on the dimension order of a DataArray.
Furthermore, in some cases, this causes a corrupted object.

In [4]: data = xr.DataArray(
   ...:     np.random.randn(4, 2),
   ...:     dims=['x', 'z'],
   ...:     coords={'x': ['a', 'b', 'a', 'c'], 'y': ('x', [0, 1, 0, 2])}
   ...: )
   ...: 
   ...: data.groupby('x').mean()
Out[4]: 
<xarray.DataArray (x: 3, z: 2)>
array([[ 0.95447186, -1.14467028],
       [ 0.76294958,  0.3751244 ],
       [-0.41030223, -1.35344548]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
Dimensions without coordinates: z

groupby works fine (although this drops nondimensional coordinate y, related to #3745).

However, groupby does not give a correct result if we work on the second dimension,

In [5]: data.T.groupby('x').mean()  # <--- change the dimension order, and do the same thing
Out[5]: 
<xarray.DataArray (z: 2, x: 3)>
array([[ 0.95447186,  0.76294958, -0.41030223],
       [-1.14467028,  0.3751244 , -1.35344548]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
    y        (x) int64 0 1 0 2  # <-- the size must be 3!!
Dimensions without coordinates: z

The bug has been discussed in #2944 and solved, but I found this is still there.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: 09d8a4a
python: 3.7.7 (default, Mar 23 2020, 22:36:06)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-72-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.4
libnetcdf: 4.6.1

xarray: 0.16.1.dev30+g1d3dee08.d20200808
pandas: 1.1.3
numpy: 1.18.1
scipy: 1.5.2
netCDF4: 1.4.2
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.6.0
distributed: 2.7.0
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 46.1.1.post20200323
pip: 20.0.2
conda: None
pytest: 5.2.1
IPython: 7.13.0
sphinx: None

The text was updated successfully, but these errors were encountered:

max-sixty · 2021-05-21T23:24:27Z

That's surprising indeed.

I confirmed the bug was present in 0.17.0

It also seems unrelated to uniqueness in the non-grouped-by dimension:

In [7]: data['y'].values = [0,1,2,3]

In [8]: data['y'].values
Out[8]: array([0, 1, 2, 3])

In [9]: data
Out[9]:
<xarray.DataArray (x: 4, z: 2)>
array([[ 0.13156972,  0.13986012],
       [ 1.61815504,  0.11421297],
       [ 0.15819393, -0.5183183 ],
       [ 0.30672251,  0.34373302]])
Coordinates:
  * x        (x) <U1 'a' 'b' 'a' 'c'
    y        (x) int64 0 1 2 3
Dimensions without coordinates: z

In [10]: data.T.groupby('x').mean()
Out[10]:
<xarray.DataArray (z: 2, x: 3)>
array([[ 0.14488182,  1.61815504,  0.30672251],
       [-0.18922909,  0.11421297,  0.34373302]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
    y        (x) int64 0 1 2 3   # <-- the size must be 3!!
Dimensions without coordinates: z

max-sixty · 2024-09-18T18:57:32Z

Update: this is still an issue, though raises an error rather than returning a corrupt object:

data.T.groupby('x').mean()

ValueError: cannot reindex or align along dimension 'x' because of conflicting dimension sizes: {3, 4} (note: an index is found along that dimension with size=3)

Quite surprising...

dcherian · 2024-09-18T19:48:15Z

works on main and latest release for me

import xarray as xr
import numpy as np

data = xr.DataArray(
  np.random.randn(4, 2),
   dims=['x', 'z'],
 coords={'x': ['a', 'b', 'a', 'c'], 'y': ('x', [0, 1, 0, 2])}
)
data.T.groupby('x').mean() # drops y

max-sixty · 2024-09-18T21:15:25Z

Ah — with flox installed it works

("Parents: tell your kids to use flox!")

dcherian · 2024-09-18T21:27:07Z

lol, my bad.

Closes pydata#5361

* Make _replace more lenient. Closes #5361 * review comments

* Make _replace more lenient. Closes pydata#5361 * review comments

fujiisoup mentioned this issue May 21, 2021

groupby does not correctly handle non-dimensional coordinate #2944

Closed

fujiisoup added the bug label May 21, 2021

dcherian added the topic-groupby label Mar 29, 2022

dcherian added a commit to dcherian/xarray that referenced this issue Sep 18, 2024

Make _replace more lenient.

6033bc9

Closes pydata#5361

dcherian mentioned this issue Sep 18, 2024

Make _replace more lenient. #9517

Merged

2 tasks

dcherian added a commit that referenced this issue Sep 19, 2024

Make _replace more lenient. (#9517)

3c74509

* Make _replace more lenient. Closes #5361 * review comments

dcherian closed this as completed in #9517 Sep 19, 2024

hollymandel pushed a commit to hollymandel/xarray that referenced this issue Sep 23, 2024

Make _replace more lenient. (pydata#9517)

a27ab2b

* Make _replace more lenient. Closes pydata#5361 * review comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior in grouby depending on the dimension order #5361

Inconsistent behavior in grouby depending on the dimension order #5361

fujiisoup commented May 21, 2021

INSTALLED VERSIONS

max-sixty commented May 21, 2021 •

edited

Loading

max-sixty commented Sep 18, 2024

dcherian commented Sep 18, 2024 •

edited

Loading

max-sixty commented Sep 18, 2024

dcherian commented Sep 18, 2024

Inconsistent behavior in grouby depending on the dimension order #5361

Inconsistent behavior in grouby depending on the dimension order #5361

Comments

fujiisoup commented May 21, 2021

INSTALLED VERSIONS

max-sixty commented May 21, 2021 • edited Loading

max-sixty commented Sep 18, 2024

dcherian commented Sep 18, 2024 • edited Loading

max-sixty commented Sep 18, 2024

dcherian commented Sep 18, 2024

max-sixty commented May 21, 2021 •

edited

Loading

dcherian commented Sep 18, 2024 •

edited

Loading