Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behavior in grouby depending on the dimension order #5361

Closed
fujiisoup opened this issue May 21, 2021 · 5 comments · Fixed by #9517
Closed

Inconsistent behavior in grouby depending on the dimension order #5361

fujiisoup opened this issue May 21, 2021 · 5 comments · Fixed by #9517

Comments

@fujiisoup
Copy link
Member

groupby works inconsistently depending on the dimension order of a DataArray.
Furthermore, in some cases, this causes a corrupted object.

In [4]: data = xr.DataArray(
   ...:     np.random.randn(4, 2),
   ...:     dims=['x', 'z'],
   ...:     coords={'x': ['a', 'b', 'a', 'c'], 'y': ('x', [0, 1, 0, 2])}
   ...: )
   ...: 
   ...: data.groupby('x').mean()
Out[4]: 
<xarray.DataArray (x: 3, z: 2)>
array([[ 0.95447186, -1.14467028],
       [ 0.76294958,  0.3751244 ],
       [-0.41030223, -1.35344548]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
Dimensions without coordinates: z

groupby works fine (although this drops nondimensional coordinate y, related to #3745).

However, groupby does not give a correct result if we work on the second dimension,

In [5]: data.T.groupby('x').mean()  # <--- change the dimension order, and do the same thing
Out[5]: 
<xarray.DataArray (z: 2, x: 3)>
array([[ 0.95447186,  0.76294958, -0.41030223],
       [-1.14467028,  0.3751244 , -1.35344548]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
    y        (x) int64 0 1 0 2  # <-- the size must be 3!!
Dimensions without coordinates: z

The bug has been discussed in #2944 and solved, but I found this is still there.

Output of xr.show_versions()

INSTALLED VERSIONS

commit: 09d8a4a
python: 3.7.7 (default, Mar 23 2020, 22:36:06)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-72-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.4
libnetcdf: 4.6.1

xarray: 0.16.1.dev30+g1d3dee08.d20200808
pandas: 1.1.3
numpy: 1.18.1
scipy: 1.5.2
netCDF4: 1.4.2
pydap: None
h5netcdf: 0.8.0
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.6.0
distributed: 2.7.0
matplotlib: 3.2.2
cartopy: None
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 46.1.1.post20200323
pip: 20.0.2
conda: None
pytest: 5.2.1
IPython: 7.13.0
sphinx: None

@max-sixty
Copy link
Collaborator

max-sixty commented May 21, 2021

That's surprising indeed.

I confirmed the bug was present in 0.17.0

It also seems unrelated to uniqueness in the non-grouped-by dimension:

In [7]: data['y'].values = [0,1,2,3]

In [8]: data['y'].values
Out[8]: array([0, 1, 2, 3])

In [9]: data
Out[9]:
<xarray.DataArray (x: 4, z: 2)>
array([[ 0.13156972,  0.13986012],
       [ 1.61815504,  0.11421297],
       [ 0.15819393, -0.5183183 ],
       [ 0.30672251,  0.34373302]])
Coordinates:
  * x        (x) <U1 'a' 'b' 'a' 'c'
    y        (x) int64 0 1 2 3
Dimensions without coordinates: z

In [10]: data.T.groupby('x').mean()
Out[10]:
<xarray.DataArray (z: 2, x: 3)>
array([[ 0.14488182,  1.61815504,  0.30672251],
       [-0.18922909,  0.11421297,  0.34373302]])
Coordinates:
  * x        (x) object 'a' 'b' 'c'
    y        (x) int64 0 1 2 3   # <-- the size must be 3!!
Dimensions without coordinates: z

@max-sixty
Copy link
Collaborator

Update: this is still an issue, though raises an error rather than returning a corrupt object:

data.T.groupby('x').mean()

ValueError: cannot reindex or align along dimension 'x' because of conflicting dimension sizes: {3, 4} (note: an index is found along that dimension with size=3)

Quite surprising...

@dcherian
Copy link
Contributor

dcherian commented Sep 18, 2024

works on main and latest release for me

import xarray as xr
import numpy as np

data = xr.DataArray(
  np.random.randn(4, 2),
   dims=['x', 'z'],
 coords={'x': ['a', 'b', 'a', 'c'], 'y': ('x', [0, 1, 0, 2])}
)
data.T.groupby('x').mean() # drops y

@max-sixty
Copy link
Collaborator

Ah — with flox installed it works

("Parents: tell your kids to use flox!")

@dcherian
Copy link
Contributor

lol, my bad.

dcherian added a commit to dcherian/xarray that referenced this issue Sep 18, 2024
dcherian added a commit that referenced this issue Sep 19, 2024
* Make _replace more lenient.

Closes #5361

* review comments
hollymandel pushed a commit to hollymandel/xarray that referenced this issue Sep 23, 2024
* Make _replace more lenient.

Closes pydata#5361

* review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants