-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-index indexing #802
Multi-index indexing #802
Changes from 16 commits
33ee76f
6695e85
647967a
de03fc9
f88cd42
8b17a79
881da7f
b786793
3b6d36d
0a5a7cf
8a9d488
fd26dce
dabb2ce
5f7d670
555f06e
8895e04
e900e9a
d73cff5
31b2c50
6e17a85
934beef
03c21bd
030ee25
712497c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -294,6 +294,60 @@ elements that are fully masked: | |
|
||
arr2.where(arr2.y < 2, drop=True) | ||
|
||
.. _multi-level indexing: | ||
|
||
Multi-level indexing | ||
-------------------- | ||
|
||
The ``loc`` and ``sel`` methods of ``Dataset`` and ``DataArray`` both accept | ||
dictionaries for label-based indexing on multi-index dimensions: | ||
|
||
.. ipython:: python | ||
|
||
idx = pd.MultiIndex.from_product([list('abc'), [0, 1]], | ||
names=('one', 'two')) | ||
da_midx = xr.DataArray(np.random.rand(6, 3), | ||
[('x', idx), ('y', range(3))]) | ||
da_midx | ||
da_midx.sel(x={'one': 'a', 'two': 0}) | ||
da_midx.loc[{'one': 'a'}, ...] | ||
|
||
As shown in the last example above, xarray handles partial selection on | ||
pandas multi-index ; it automatically renames the dimension and replaces the | ||
coordinate when a single index is returned (level drop). | ||
|
||
Like pandas, it is also possible to slice a multi-indexed dimension by providing | ||
a tuple of multiple indexers (i.e., slices, labels, list of labels, or any | ||
selector allowed by pandas). Note that for now xarray doesn't fully handle | ||
partial selection in that case (no level drop is done): | ||
|
||
.. ipython:: python | ||
|
||
da_midx.sel(x=(list('ab'), [0])) | ||
|
||
Lists or slices of tuples can be used to select several combinations of | ||
multi-index labels: | ||
|
||
.. ipython:: python | ||
|
||
da_midx.sel(x=[('a', 0), ('b', 1)]) | ||
|
||
A single, flat tuple can be used to select a given combination of | ||
multi-index labels: | ||
|
||
.. ipython:: python | ||
|
||
da_midx.sel(x=('a', 0)) | ||
|
||
Unlike pandas, xarray can't make the distinction between index levels and | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of "can't make the distinction", let's say "does not guess". |
||
dimensions when using ``loc`` in some ambiguous cases. For example, for | ||
``da_midx.loc[{'one': 'a', 'two': 0}]`` and ``da_midx.loc['a', 0]`` xarray | ||
always interprets ('one', 'two') and ('a', 0) as the names and | ||
labels of the 1st and 2nd dimension, respectively. You must specify all | ||
dimensions or use the ellipsis in the ``loc`` specifier, e.g. in the example | ||
above, ``da_midx.loc[{'one': 'a', 'two': 0}, :]`` or | ||
``da_midx.loc[('a', 0), ...]``. | ||
|
||
Multi-dimensional indexing | ||
-------------------------- | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,10 +39,14 @@ Enhancements | |
attributes are retained in the resampled object. By | ||
`Jeremy McGibbon <https://github.com/mcgibbon>`_. | ||
|
||
- DataArray and Dataset methods :py:meth:`sel` and :py:meth:`loc` now | ||
accept dictionaries or nested tuples for indexing on multi-index dimensions. | ||
By `Benoit Bovy <https://github.com/benbovy>`_. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also please add a note about the changed behavior (we now drop levels, which is consistent with pandas) in the "Breaking changes" section above? Also, add a reference here to the documentation section you added:
|
||
|
||
- New (experimental) decorators :py:func:`~xarray.register_dataset_accessor` and | ||
:py:func:`~xarray.register_dataarray_accessor` for registering custom xarray | ||
extensions without subclassing. They are described in the new documentation | ||
page on :ref:`internals`. By `Stephan Hoyer <https://github.com/shoyer>` | ||
page on :ref:`internals`. By `Stephan Hoyer <https://github.com/shoyer>`_. | ||
|
||
- Round trip boolean datatypes. Previously, writing boolean datatypes to netCDF | ||
formats would raise an error since netCDF does not have a `bool` datatype. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,24 +86,22 @@ def __init__(self, data_array): | |
self.data_array = data_array | ||
|
||
def _remap_key(self, key): | ||
def lookup_positions(dim, labels): | ||
index = self.data_array.indexes[dim] | ||
return indexing.convert_label_indexer(index, labels) | ||
|
||
if utils.is_dict_like(key): | ||
return dict((dim, lookup_positions(dim, labels)) | ||
for dim, labels in iteritems(key)) | ||
else: | ||
if not utils.is_dict_like(key): | ||
# expand the indexer so we can handle Ellipsis | ||
key = indexing.expanded_indexer(key, self.data_array.ndim) | ||
return tuple(lookup_positions(dim, labels) for dim, labels | ||
in zip(self.data_array.dims, key)) | ||
labels = indexing.expanded_indexer(key, self.data_array.ndim) | ||
key = dict(zip(self.data_array.dims, labels)) | ||
return indexing.remap_label_indexers(self.data_array, key) | ||
|
||
def __getitem__(self, key): | ||
return self.data_array[self._remap_key(key)] | ||
pos_indexers, new_indexes = self._remap_key(key) | ||
ds = self.data_array[pos_indexers]._to_temp_dataset() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we avoid creating the temporary dataset here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I agree this is not very nice. I did this to avoid duplicating the |
||
return self.data_array._from_temp_dataset( | ||
ds._replace_indexes(new_indexes) | ||
) | ||
|
||
def __setitem__(self, key, value): | ||
self.data_array[self._remap_key(key)] = value | ||
pos_indexers, new_indexes = self._remap_key(key) | ||
self.data_array[pos_indexers] = value | ||
|
||
|
||
class _ThisArray(object): | ||
|
@@ -599,8 +597,10 @@ def sel(self, method=None, tolerance=None, **indexers): | |
Dataset.sel | ||
DataArray.isel | ||
""" | ||
return self.isel(**indexing.remap_label_indexers( | ||
self, indexers, method=method, tolerance=tolerance)) | ||
ds = self._to_temp_dataset().sel( | ||
method=method, tolerance=tolerance, **indexers | ||
) | ||
return self._from_temp_dataset(ds) | ||
|
||
def isel_points(self, dim='points', **indexers): | ||
"""Return a new DataArray whose dataset is given by pointwise integer | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -419,6 +419,18 @@ def _replace_vars_and_dims(self, variables, coord_names=None, | |
obj = self._construct_direct(variables, coord_names, dims, attrs) | ||
return obj | ||
|
||
def _replace_indexes(self, indexes): | ||
variables = OrderedDict() | ||
for k, v in iteritems(self._variables): | ||
if k in indexes.keys(): | ||
idx = indexes[k] | ||
variables[k] = Coordinate(idx.name, idx) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I think we should create |
||
else: | ||
variables[k] = v | ||
obj = self._replace_vars_and_dims(variables) | ||
dim_names = {dim: idx.name for dim, idx in iteritems(indexes)} | ||
return obj.rename(dim_names) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we make the rename only done if necessary? I think this can be kind of expensive. Putting things together: variables = self._variables.copy()
for name, idx in indexes.items():
variables[name] = Coordinate(name, idx)
obj = self._replace_vars_and_dims(variables)
# switch from dimension to level names, if necessary
dim_names = {}
for dim, idx in indexes.items():
if idx.name != dim:
dim_names[dim] = idx.name
if dim_names:
obj = obj.rename(dim_name) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems much nicer! What about
at the top of the function, given that in many use cases indexes will be empty? (I don't know if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's even better!
|
||
|
||
def copy(self, deep=False): | ||
"""Returns a copy of this dataset. | ||
|
||
|
@@ -954,7 +966,9 @@ def sel(self, method=None, tolerance=None, **indexers): | |
Requires pandas>=0.17. | ||
**indexers : {dim: indexer, ...} | ||
Keyword arguments with names matching dimensions and values given | ||
by scalars, slices or arrays of tick labels. | ||
by scalars, slices or arrays of tick labels. For dimensions with | ||
multi-index, the indexer may also be a dict-like object with keys | ||
matching index level names. | ||
|
||
Returns | ||
------- | ||
|
@@ -972,8 +986,10 @@ def sel(self, method=None, tolerance=None, **indexers): | |
Dataset.isel_points | ||
DataArray.sel | ||
""" | ||
return self.isel(**indexing.remap_label_indexers( | ||
self, indexers, method=method, tolerance=tolerance)) | ||
pos_indexers, new_indexes = indexing.remap_label_indexers( | ||
self, indexers, method=method, tolerance=tolerance | ||
) | ||
return self.isel(**pos_indexers)._replace_indexes(new_indexes) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this handle the case where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nevermind, that can't happen. |
||
|
||
def isel_points(self, dim='points', **indexers): | ||
"""Returns a new dataset with each array indexed pointwise along the | ||
|
@@ -1114,8 +1130,9 @@ def sel_points(self, dim='points', method=None, tolerance=None, | |
Dataset.isel_points | ||
DataArray.sel_points | ||
""" | ||
pos_indexers = indexing.remap_label_indexers( | ||
self, indexers, method=method, tolerance=tolerance) | ||
pos_indexers, new_indexes = indexing.remap_label_indexers( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if we ignore |
||
self, indexers, method=method, tolerance=tolerance | ||
) | ||
return self.isel_points(dim=dim, **pos_indexers) | ||
|
||
def reindex_like(self, other, method=None, tolerance=None, copy=True): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
|
||
from . import utils | ||
from .pycompat import iteritems, range, dask_array_type, suppress | ||
from .utils import is_full_slice | ||
from .utils import is_full_slice, is_dict_like | ||
|
||
|
||
def expanded_indexer(key, ndim): | ||
|
@@ -135,11 +135,27 @@ def _asarray_tuplesafe(values): | |
return result | ||
|
||
|
||
def _is_nested_tuple(tup, index): | ||
"""Check for a compatible nested tuple and multiindex (taken from | ||
pandas.core.indexing.is_nested_tuple). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still trying to wrap my head around exactly what this check does :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I'm ! I've just stolen this from pandas without much modification :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is such a weird function the way it's currently written. Why not make this: def _is_nested_tuple(possible_tuple):
return (isinstance(possible_tuple, tuple)
and any(isinstance(value, (tuple, list, slice)
for value in possible_tuple)) The |
||
""" | ||
if not isinstance(tup, tuple): | ||
return False | ||
|
||
# are we nested tuple of: tuple,list,slice | ||
for i, k in enumerate(tup): | ||
if isinstance(k, (tuple, list, slice)): | ||
return isinstance(index, pd.MultiIndex) | ||
|
||
return False | ||
|
||
|
||
def convert_label_indexer(index, label, index_name='', method=None, | ||
tolerance=None): | ||
"""Given a pandas.Index and labels (e.g., from __getitem__) for one | ||
dimension, return an indexer suitable for indexing an ndarray along that | ||
dimension | ||
dimension. If label is a dict-like object and a pandas.MultiIndex is given, | ||
also return a new pandas.Index, otherwise return None. | ||
""" | ||
# backwards compatibility for pandas<0.16 (method) or pandas<0.17 | ||
# (tolerance) | ||
|
@@ -152,6 +168,8 @@ def convert_label_indexer(index, label, index_name='', method=None, | |
'the tolerance argument requires pandas v0.17 or newer') | ||
kwargs['tolerance'] = tolerance | ||
|
||
new_index = None | ||
|
||
if isinstance(label, slice): | ||
if method is not None or tolerance is not None: | ||
raise NotImplementedError( | ||
|
@@ -166,6 +184,17 @@ def convert_label_indexer(index, label, index_name='', method=None, | |
raise KeyError('cannot represent labeled-based slice indexer for ' | ||
'dimension %r with a slice over integer positions; ' | ||
'the index is unsorted or non-unique') | ||
|
||
elif is_dict_like(label): | ||
if not isinstance(index, pd.MultiIndex): | ||
raise ValueError('cannot use a dict-like object for selection on a ' | ||
'dimension that does not have a MultiIndex') | ||
indexer, new_index = index.get_loc_level(tuple(label.values()), | ||
level=tuple(label.keys())) | ||
|
||
elif _is_nested_tuple(label, index): | ||
indexer = index.get_locs(label) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we could reproduce what pandas does in terms of collapsing tuple levels if we call # untested!
elif isinstance(label, tuple) and isinstance(index, pd.MultiIndex):
if _is_nested_tuple(label):
indexer = index.get_locs(label)
else:
indexer, new_index = index.get_loc_level(label, level=range(len(label))) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes it works! However, using non-nested tuples here consists of selecting single elements and raises the question of how we handle returned scalar values. In that specific case we should drop the dimension but keep the 0-d (multi-level) coordinate so that More generally, I think we definitely need to carefully address level drop in all cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doing some tests, it seems like Good:
Bad:
Good:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I guess we need to check the length of the tuple (probably also in the elif isinstance(label, tuple) and isinstance(index, pd.MultiIndex):
if _is_nested_tuple(label):
indexer = index.get_locs(label)
elif len(label) == index.nlevels:
indexer = index.get_loc(label)
else:
indexer, new_index = index.get_loc_level(label, level=range(len(label))) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (EDIT: forget about this comment, it is complete nonsense :) ) def _maybe_drop_levels(index):
drop_levels = [i for i, lab in enumerate(index.labels)
if not np.ptp(lab.values())]
if len(drop_levels) < len(index.labels):
return index.droplevel(drop_levels)
else:
return index
def convert_label_indexer(...):
# ...
if isinstance(new_index, pd.MultiIndex):
new_index = _maybe_drop_levels(new_index)
return indexer, new_index There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The advantage of doing something like my proposed logic (which I think is similar to what pandas does) is that whether a level is dropped depends only on the indexer type and the number of multi-index levels, as opposed to dropping levels in a way that depends also on the particular values in the indexer and index. Code that depends only on type information rather than values is generally easier to understand and less error prone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, I used Anyway, I get your logic. It is also much more efficient! |
||
else: | ||
label = _asarray_tuplesafe(label) | ||
if label.ndim == 0: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is where scalars end up -- probably need to add a clause here to handle MultiIndex |
||
|
@@ -177,18 +206,36 @@ def convert_label_indexer(index, label, index_name='', method=None, | |
if np.any(indexer < 0): | ||
raise KeyError('not all values found in index %r' | ||
% index_name) | ||
return indexer | ||
return indexer, new_index | ||
|
||
|
||
def remap_label_indexers(data_obj, indexers, method=None, tolerance=None): | ||
"""Given an xarray data object and label based indexers, return a mapping | ||
of equivalent location based indexers. | ||
of equivalent location based indexers. Also return a mapping of pandas' | ||
single index objects returned from multi-index objects. | ||
""" | ||
if method is not None and not isinstance(method, str): | ||
raise TypeError('``method`` must be a string') | ||
return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label, | ||
dim, method, tolerance)) | ||
for dim, label in iteritems(indexers)) | ||
|
||
pos_indexers, new_indexes = {}, {} | ||
for dim, label in iteritems(indexers): | ||
index = data_obj[dim].to_index() | ||
|
||
if isinstance(index, pd.MultiIndex): | ||
# set default names for multi-index unnamed levels so that | ||
# we can safely rename dimension / coordinate later | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks great! We might also consider moving this logic to around this line of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't be better to move this logic to This is because I worry about implicit copy or in-place renaming. The problem would be to set default level names that are unique across dimensions, but maybe we can pass the variable name in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or maybe moving this here ( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It does looks like we already some similar logic in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Compared to >>> idx = pd.MultiIndex.from_product([['a', 'b'], [1, 2], [-1, -2]])
>>> y = xr.DataArray(np.random.rand(2 * 2 * 2), [('x', idx)])
>>> y.x.to_index()
MultiIndex(levels=[['a', 'b'], [1, 2], [-2, -1]],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [1, 0, 1, 0, 1, 0, 1, 0]],
names=['x_level_0', 'x_level_1', 'x_level_2'])
>>> idx.names = ('one', 'two', 'three')
>>> y.x.to_index()
MultiIndex(levels=[['a', 'b'], [1, 2], [-2, -1]],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 0, 1, 1, 0, 0, 1, 1], [1, 0, 1, 0, 1, 0, 1, 0]],
names=['one', 'two', 'three']) and we have also direct access to the dimension/coordinate name so that we can set default level names that are unique (as shown above). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
True, but unless we allow directly accessing levels as variables, |
||
valid_level_names = [name or '{}_level_{}'.format(dim, i) | ||
for i, name in enumerate(index.names)] | ||
index = index.copy() | ||
index.names = valid_level_names | ||
|
||
idxr, new_idx = convert_label_indexer(index, label, | ||
dim, method, tolerance) | ||
pos_indexers[dim] = idxr | ||
if new_idx is not None and not isinstance(new_idx, pd.MultiIndex): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What should happen if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We definitely need to add a test for this situation (e.g., 3 level index -> 2 level index). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes multi-indexes are not updated, but maybe we should do so (see my comment below on level drop). |
||
new_indexes[dim] = new_idx | ||
|
||
return pos_indexers, new_indexes | ||
|
||
|
||
def slice_slice(old_slice, applied_slice, size): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would use a new sentence instead of the semicolon.