Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

as_numpy changes MultiIndex #8001

Open
4 tasks done
cgahr opened this issue Jul 19, 2023 · 5 comments
Open
4 tasks done

as_numpy changes MultiIndex #8001

cgahr opened this issue Jul 19, 2023 · 5 comments

Comments

@cgahr
Copy link

cgahr commented Jul 19, 2023

What happened?

I have a DataArray with a MultiIndex. In some cases, I use dask, so I call .as_numpy() to compute the array and store it in memory. I would expect that this call does NOT change the MultiIndex.

This is the case for .persist(), however, it's not the case for .as_numpy().

In the following MWE the original coordinates are:

Coordinates:
  * z        (z) int64 0 1 2 3 4
  * r        (r) object MultiIndex
  * x        (r) int64 0 0 0 1 1 1
  * y        (r) int64 0 1 2 0 1 2

After .persist() we get the same result. After .as_numpy() we get

Coordinates:
  * z        (z) int64 0 1 2 3 4
  * r        (r) object (0, 0) (0, 1) (0, 2) (1, 0) (1, 1) (1, 2)
  * x        (r) int64 0 0 0 1 1 1
  * y        (r) int64 0 1 2 0 1 2

which is not the same and can lead to issues down the line.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

import xarray as xr

da = xr.DataArray(1, dims=('x', 'y', 'z'), coords={'x': [0, 1], 'y': [0, 1, 2], 'z': [0, 1, 2, 3, 4]})
da = da.stack(r=('x', 'y'))


xr.testing.assert_equal(da['r'], da.persist()['r'])   # ok
xr.testing.assert_equal(da['r'], da.as_numpy()['r'])  # err

xr.testing.assert_equal(da, da.persist())   # ok
xr.testing.assert_equal(da, da.as_numpy())  # ok but should err

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.3 | packaged by conda-forge | (main, Apr 6 2023, 08:57:19) [GCC 11.3.0] python-bits: 64 OS: Linux OS-release: 4.12.14-122.162-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.0 libnetcdf: None

xarray: 2023.6.0
pandas: 1.5.3
numpy: 1.25.0
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: 1.2.0
h5py: 3.8.0
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.4.1
distributed: 2023.4.1
matplotlib: 3.7.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2023.4.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.7.2
pip: 23.1.2
conda: None
pytest: 7.4.0
mypy: 1.4.1
IPython: 8.13.2
sphinx: None

@cgahr cgahr added bug needs triage Issue that has not been reviewed by xarray team member labels Jul 19, 2023
@welcome
Copy link

welcome bot commented Jul 19, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@keewis
Copy link
Collaborator

keewis commented Jul 19, 2023

note that for computing dask arrays you should use da.compute() instead of da.as_numpy(), which doesn't have that issue.

That said, this seems to be a bug in the implementation of .as_numpy. I'm not sure how to fix that without special-casing PandasMultiIndex, but maybe the answer is to remove the multi-index coordinate (r) instead?

@benbovy
Copy link
Member

benbovy commented Jul 19, 2023

Thanks for the report @cgahr.

as_numpy() has not been refactored during the Xarray index refactor, which indeed may cause some issues especially with multi-indexes (in the current implementation each coordinate variable is treated independently of each other whereas we should treat multi-index coordinates together).

but maybe the answer is to remove the multi-index coordinate (r) instead?

The docstrings of DataArray.as_numpy() says "Coerces wrapped data and coordinates into numpy arrays, returning a DataArray", but this is actually not true for indexed coordinates (even for single dimension coordinates):

>>> da.as_numpy()._coords["z"]._data   # not a numpy array!
PandasIndexingAdapter(array=Int64Index([0, 1, 2, 3, 4], dtype='int64'), dtype=dtype('int64'))

If we want to actually coerce all the coordinates into numpy arrays, then we should remove their index (if any). When the same underlying data is shared between the coordinates and their index (as it is the case for Xarray's PandasIndex and PandasMultiIndex wrappers), changing the coordinate data would make it out-of-sync with the index.

Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy() does not coerce indexed coordinates.

Or we could add an option for choosing between these two behaviors.

@benbovy benbovy added topic-indexing and removed needs triage Issue that has not been reviewed by xarray team member labels Jul 19, 2023
@cgahr
Copy link
Author

cgahr commented Jul 20, 2023

note that for computing dask arrays you should use da.compute() instead of da.as_numpy(), which doesn't have that issue.

I wanted to convert the DataArrays underlying data type to numpy. And as far as I understood, the proper way to do this is using as_numpy().

Alternatively, we could fix the multi-index issue reported here and clarify in the documentation that as_numpy() does not coerce indexed coordinates.

For me, it was actually not clear at all that as_numpy() should also coerce the indices. In my mind, it simply converts da.data to a numpy array. And in this case, IMO it should not touch the indices.

@benbovy
Copy link
Member

benbovy commented Aug 8, 2023

And in this case, IMO it should not touch the indices.

I mostly agree, but I'm not sure considering that:

  • some unindexed coordinates maybe be chunked (e.g., wrap dask arrays)
  • although I'm not aware of any existing out-of-core Xarray index, this is something that could be supported in the (near) future, i.e., indexed coordinates may also be backed by some chunked (dask) array

The behavior that would make the most sense to me in terms of clarity and predictability would be to either convert all the coordinates or not touch them at all. Practically, however, perhaps best is to convert all "pure" array data (i.e., data variables + unindexed coordinates but not the indexed coordinates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants