-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated coordinates leads to unintuitive (broken?) indexing behaviour #3731
Comments
Thanks for this @ivirshup , I'm surprised at this too. The problem seems to be that the DataArray you've managed to create breaks xarray's own data model! There should be one dim for each axis of the wrapped array, but import xarray as xr
import numpy as np
sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx)
print(da) gives a dataarray object which somehow has only one dim while wrapping a 2D array!
Obviously xarray should have thrown you an error before allowing you to create this. It's no wonder the indexing is weird after this point. I would have expected to get an array with two dims, which you can do by being more explicit: da2d = xr.DataArray(np.eye(3), dims=['dim0', 'dim1'], coords=(sample_idx, sample_idx))
print(da2d)
(the coordinates aren't named how you want yet which is also a problem but at least this has a number of dimensions consistent with the data its wrapping.) Indexing that object behaves more like you (and I) would expect: da.shape
# (3, 3)
da[1, :].shape
# (3,)
da.loc["a", :].shape
# (3,)
da.loc[:, "a"].shape
# (3,)
da[:, 1]
<xarray.DataArray (dim0: 3)>
array([0., 1., 0.])
Coordinates:
* dim0 (dim0) <U1 'a' 'b' 'c'
dim1 <U1 'b' It also doesn't fit xarray's data model to have two coordinates along different dimensions with the same name as one another. I suggest that you create two separate coords (i.e. (We should also fix |
Why not allow multiple dimensions with the same name? They can be disambiguated with positional indexing for when it matters. I think support for this would be useful for pairwise measures. Here's a fun example/ current buggy behaviour: import numpy as np
import xarray as xr
from string import ascii_letters
idx1 = xr.IndexVariable("dim1", [f"dim1-{i}" for i in ascii_letters[:10]])
idx2 = xr.IndexVariable("dim2", [f"dim2-{i}" for i in ascii_letters[:5]])
da1 = xr.DataArray(np.random.random_sample((10, 5)), coords=(idx1, idx2))
da2 = xr.DataArray(np.random.random_sample((5, 10)), coords=(idx2, idx1))
da1 @ da2
# <xarray.DataArray ()>
# array(13.06261098) |
I'm not sure it's that simple... What would you suggest the behaviour for |
This has also come up over in DimensionalData.jl, which I think is going for behavior I like. What I think would happen:
The selection is over all dimensions of that name.
The command is to reduce over dimensions of that name, the reduction should be performed over all dimensions with that name. |
ooh this is a fun one! came across this issue when we stumbled across a pendantic case writing tests (H/T @brews). I expected this to "fail loudly in the constructor" but it doesn't. note that currently AFAICT you cannot use positional slicing to achieve an intuitive result - the behavior seems more undefined/unpredictable # setup
import xarray as xr, pandas as pd, numpy as np
da = xr.DataArray(np.arange(8).reshape(2, 2, 2), coords=[[0, 1], [0, 1], ['a', 'b']], dims=["ni", "ni", "shh"]) xarray seems to not know it has a problem: In [4]: da
Out[4]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
[2, 3]],
[[4, 5],
[6, 7]]])
Coordinates:
* ni (ni) int64 0 1
* shh (shh) <U1 'a' 'b' slicing (somewhat intuitively? slices along both dims): In [5]: da.sel(ni=0)
Out[5]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
ni int64 0
* shh (shh) <U1 'a' 'b' however, positional slicing (and any attempts I've made to handle the repeated dims differently) seems to have undefined behavior: In [6]: da[0, :, :] # positional slicing along first dim works as expected(?)
Out[6]:
<xarray.DataArray (ni: 2, shh: 2)>
array([[[0, 1],
[2, 3]],
[[4, 5],
[6, 7]]])
Coordinates:
* ni (ni) int64 0 1
* shh (shh) <U1 'a' 'b'
In [7]: da[:, 0, :] # positional slicing along second dim slices both dims
Out[7]:
<xarray.DataArray (shh: 2)>
array([0, 1])
Coordinates:
ni int64 0
* shh (shh) <U1 'a' 'b' |
FWIW I could definitely see use cases for allowing something like this... I have used cumbersome/ugly workarounds to work with variance-covariance matrices etc. So I'm not weighing in on the "this should raise an error" debate. I got briefly excited when I saw it didn't raise an error, until everything started unraveling 🙃 |
Are there any thoughts as to how to move forward with this? It seems like a first step might be fixing In the vein of
def __getitem__(self, key: Any) -> Self:
if isinstance(key, str):
return self._getitem_coord(key)
if self._has_repeated_dim_names:
return self.isel(indexers=zip(key[:len(self.dims)], self.dims))
return self.isel(indexers=self._item_key_to_dict(key))
At the end of the day, both options will probably involve completely re-writing a function because From the getter methods, we could then move on to "external" functions like P.S I have seen the discussion in #1378 about shared indices/splitting and I don't see this as mutually exclusive. I think the two could be quite compatible |
The very first step should be raising a loud error in the constructor, to avoid unintended and undefined behaviour afterwards. We should have done that a while ago!
If you narrow this to Beyond that I think this needs a lot more serious thought as to how hard it would be to support. I wouldn't want to start allowing it on a function-by-function basis until we later realise it doesn't generalize. (Despite your observations about I would have lots of questions, for example:
If someone wants to experiment with the above questions, it would make sense to start by doing it in the new (Another random thought - if we had a hypothesis-based test suite, we could test this idea via a one-line change to the strategy that generates |
I've added a check to raise on repeated dimension names in #8491. Now if you try @ivirshup 's original example you get a nice clear error: ---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 5
2 import numpy as np
4 sample_idx = xr.IndexVariable("sample_id", ["a", "b", "c"])
----> 5 da = xr.DataArray(np.eye(3), coords=(sample_idx, sample_idx))
File ~/Documents/Work/Code/xarray/xarray/core/dataarray.py:446, in DataArray.__init__(self, data, coords, dims, name, attrs, indexes, fastpath)
444 data = as_compatible_data(data)
445 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
--> 446 variable = Variable(dims, data, attrs, fastpath=True)
448 if not isinstance(coords, Coordinates):
449 coords = create_coords_with_default_indexes(coords)
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:365, in Variable.__init__(self, dims, data, attrs, encoding, fastpath)
338 def __init__(
339 self,
340 dims,
(...)
344 fastpath=False,
345 ):
346 """
347 Parameters
348 ----------
(...)
363 unrecognized encoding items.
364 """
--> 365 super().__init__(
366 dims=dims, data=as_compatible_data(data, fastpath=fastpath), attrs=attrs
367 )
369 self._encoding = None
370 if encoding is not None:
File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:252, in NamedArray.__init__(self, dims, data, attrs)
245 def __init__(
246 self,
247 dims: _DimsLike,
248 data: duckarray[Any, _DType_co],
249 attrs: _AttrsLike = None,
250 ):
251 self._data = data
--> 252 self._dims = self._parse_dimensions(dims)
253 self._attrs = dict(attrs) if attrs else None
File ~/Documents/Work/Code/xarray/xarray/namedarray/core.py:486, in NamedArray._parse_dimensions(self, dims)
484 if len(set(dims)) < len(dims):
485 repeated_dims = set([d for d in dims if dims.count(d) > 1])
--> 486 raise ValueError(
487 f"Repeated dimension names are forbidden, but dimensions {repeated_dims} appear more than once in dims={dims}"
488 )
489 return dims
ValueError: Repeated dimension names are forbidden, but dimensions {'sample_id'} appear more than once in dims=('sample_id', 'sample_id') I suggest we leave this issue open as the place to discuss supporting repeated dimension names. |
Final thought: we might also consider simply making it clearer which parts of xarray should and shouldn't work with repeated dimension names. Looking in the code at the moment repeated dimension names are not forbidden in the constructors, but are explicitly forbidden during broadcasting. This makes a fair amount of sense, but unfortunately this clarity of what is and isn't allowed doesn't currently extend into other parts of the library. Again |
Unless the dimensions are interchangable (e.g. a symmetric matrix), presumably it would be really difficult to support repeated dimension names? We would then need to guarantee order in operations, which is something we explicitly don't do. I might think we mark that out of scope for the medium term.... FWIW when I have symmetric matrices, I use |
I am personally not sure about netCDF but it does appear that many people are saying there is no restriction that says you can't. As for zarr, it is agnostic to the names of things in general (so I'm not sure what is being referred to here specifically @TomNicholas), but with this extension for v3, this issue was explicitly discussed and the repeated dimension names are allowed. So I would say, broadly speaking, the answer is "yes" to both. I will also say that this sort of rationale was the reason I thought a slow build-in would actually not be such a bad idea (instead of failing loudly). We aren't "breaking" anything and some stuff actually does work at the moment. Furthermore, xarray can currently read in the netCDF files (as was pointed out in the linked comment) even if the result is not very helpful (see here for another example of where this sort of "broken" behavior occurs).
I am not sure. I think we definitely could get these to work as expected, but I also think this gets back to the matter of internal API of I will definitely think on these more, and fiddle around with
Could you link to this? I am not sure what you mean.
Not sure! I will experiment with
Good question! I will say that I think the point of repeated dimensions is that they are "the same" so I can't say exactly how one could be chunked and the other not (but I am still a newbie with
I would be very much so in favor of this. Thanks for linking to the relevant work on this. |
I might take a stab at implementing #1378 (comment), which shouldn't be too hard while providing some handy functionality in the case of symmetric matrices. |
MCVE Code Sample
Expected Output
I had expected:
Problem Description
When coordinates are shared between dimensions (as would happen if a pairwise measurement is taken) indexing behaves strangely. It looks like indexing into the initial indices doesn't do anything, while indexing into the last index applies the selection across all dimensions.
Output of
xr.show_versions()
Update, adding link to related issue:
The text was updated successfully, but these errors were encountered: