Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing terminologies and some errors in the official documentation #6866

Closed
4 tasks
v-liuwei opened this issue Aug 2, 2022 · 4 comments
Closed
4 tasks

Comments

@v-liuwei
Copy link

v-liuwei commented Aug 2, 2022

What happened?

To note, I'm using the stable version(2022.6.0).

First, I'm confused that both dimension coordinate/non-dimension coordinate and index coordinate/non-index coordinate appear in the documentation(search to see), but they seem to be the same thing.

Second, I found that there are some errors in the documentation:

  • It says that "The index associated with dimension name x can be retrieved by arr.indexes[x]. By construction, len(arr.dims) == len(arr.indexes)", which is inconsistent with actual behavior. See example code below:

    In [0]: import xarray as xr, numpy as np
    In [1]: arr = xr.DataArray(np.zeros((2, 3)), dims=['x', 'y'], coords={'x': ['a', 'b']})
    In [2]: assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}"
    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    <ipython-input-202-f217d18e6979> in <module>
    ----> 1 assert len(arr.dims) == len(arr.indexes), f"{len(arr.dims)=}, {len(arr.indexes)=}"
    
    AssertionError: len(arr.dims)=2, len(arr.indexes)=1
    In [3]: arr.indexes
    Out[3]:
    Indexes:
    x: Index(['a', 'b'], dtype='object', name='x')

    It seems that arr.indexes only returns indexes of dimensions that have coordinates. However, it's possible to get the index of
    dimension y through get_index():

    In [4]: arr.get_index('y')
    Out[4]: RangeIndex(start=0, stop=3, step=1, name='y')
  • It says that: (see link)

    For convenience multi-index levels are directly accessible as “virtual” or “derived” coordinates (marked by - when printing a dataset or data array):

    In [77]: mda["band"]
    Out[77]: 
    <xarray.DataArray 'band' (spec: 4)>
    array(['R', 'R', 'V', 'V'], dtype=object)
    Coordinates:
      * spec     (spec) object MultiIndex
      * band     (spec) object 'R' 'R' 'V' 'V'
      * wn       (spec) float64 0.1 0.2 0.7 0.9
    
    In [78]: mda.wn
    Out[78]: 
    <xarray.DataArray 'wn' (spec: 4)>
    array([0.1, 0.2, 0.7, 0.9])
    Coordinates:
      * spec     (spec) object MultiIndex
      * band     (spec) object 'R' 'R' 'V' 'V'
      * wn       (spec) float64 0.1 0.2 0.7 0.9

    As you can see, even in the given example code offered by the offical, all the "virtual" coordinates are marked as * instead of -, which is a little bit confusing when handling multi-index coordinates in my experience.

May I have missed something? Thanks in advance for the reply.

What did you expect to happen?

No response

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.10 (default, Sep 28 2021, 16:10:42)
[GCC 9.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.102.1-microsoft-standard-WSL2
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2022.6.0
pandas: 1.4.3
numpy: 1.23.1
scipy: 1.3.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: 3.1.2
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 45.2.0
pip: 22.2.1
conda: None
pytest: None
IPython: 7.13.0
sphinx: None

@v-liuwei v-liuwei added bug needs triage Issue that has not been reviewed by xarray team member labels Aug 2, 2022
@benbovy
Copy link
Member

benbovy commented Aug 2, 2022

Hi @v-liuwei, thanks for the report.

The issues that your are pointing are part of #6293. There has been many internal changes (+ some subtle public-facing changes) regarding indexes in the last release, but there is still some work for reflecting it in the documentation.

First, I'm confused that both dimension coordinate/non-dimension coordinate and index coordinate/non-index coordinate appear in the documentation(search to see), but they seem to be the same thing.

I agree, this has always been a source of confusion IMO. Xarray's data model has been updated in the last release such that these two concepts are now different and independent (i.e., it allows a non-dimension coordinate to have an index).

It seems that arr.indexes only returns indexes of dimensions that have coordinates. However, it's possible to get the index of dimension y through get_index()

get_index() creates a pandas index on the fly if it doesn't exists (and if that's possible). I'm wondering whether or not we should eventually depreciate it? I might be missing important use cases, though.

As you can see, even in the given example code offered by the offical, all the "virtual" coordinates are marked as * instead of -, which is a little bit confusing when handling multi-index coordinates in my experience.

This is because multi-index levels now have each their own, real coordinate (the documentation is not yet up-to-date). However, I agree that using the same symbol for multi-coordinate indexes may not be ideal as it is hard to distinguish which coordinate is associated with which index. On the other hand, using two different symbols wouldn't be an elegant solution either if we later depreciate the multi-index dimension coordinate (i.e., spec in your example). Maybe this issue could be addressed in the indexes repr section to be added (#6795).

@v-liuwei
Copy link
Author

v-liuwei commented Aug 2, 2022

Thanks for your explanations.

You said that "it allows a non-dimension coordinate to have an index", which confuses me even more. I want to confirm that, should we always(or is it only possible to) use the index coordinates to index the DataArray/Dataset in a label fasion?

@benbovy
Copy link
Member

benbovy commented Aug 2, 2022

Yes, performing selection using coordinate labels (i.e., .sel()) is only possible for coordinates that have an index. It has always been the case and it will always be.

Before v2022.6.0, only 1-dimensional coordinates with the name matching the dimension name could have a pandas index or multi-index. Hence the distinction between a "dimension coordinate" which most often implicitly wrapped a pandas index and a "non-dimension" coordinate for which label-based selection was impossible.

Starting from v2022.6.0, this constraint is relaxed. Although it is not yet fully operational, any coordinate or any group of coordinates (with arbitrary dimensions) may now have an index (either pandas-based or any xarray compatible custom index) and may therefore be used for label-based selection (if the index supports it).

@TomNicholas TomNicholas added topic-indexing topic-documentation and removed bug needs triage Issue that has not been reviewed by xarray team member labels Aug 5, 2022
@benbovy
Copy link
Member

benbovy commented Aug 23, 2023

I'm closing this issue as the terminology section has been updated in #7368, which now clearly distinguish between (non)dimension coordinate and (non)indexed coordinate. For the multi-index "virtual" coordinates in the repr let's track it in #8071.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants