Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set_xindex and drop_indexes methods #6971

Merged
merged 29 commits into from
Sep 28, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
3f6f637
temporary API to set custom indexes
benbovy Jul 16, 2022
bf30d54
add the temporary index API to DataArray
keewis Jul 16, 2022
9de9c46
add options argument to Index.from_variables()
benbovy Jul 17, 2022
aa403a4
fix mypy
benbovy Jul 17, 2022
210a59a
remove temporary API warning
benbovy Aug 31, 2022
d8c3985
add the Index class in Xarray's root namespace
benbovy Aug 31, 2022
c4afabf
improve set_xindex docstrings and add to api.rst
benbovy Aug 31, 2022
fe723ce
remove temp comments
benbovy Aug 31, 2022
a48c853
special case for pandas multi-index dim coord
benbovy Aug 31, 2022
01de6bd
add tests for set_xindex
benbovy Aug 31, 2022
201bd05
error message tweaks
benbovy Aug 31, 2022
41c896f
set_xindex with 1 coord: avoid reodering coords
benbovy Aug 31, 2022
1ec5ca6
mypy fixes
benbovy Aug 31, 2022
a6caa7a
add Dataset and DataArray drop_indexes methods
benbovy Aug 31, 2022
bb07d5a
improve assert_no_index_corrupted error msg
benbovy Aug 31, 2022
ec2f8fc
drop_indexes: add tests
benbovy Aug 31, 2022
f9601b9
add drop_indexes to api.rst
benbovy Aug 31, 2022
1a555bc
improve docstrings of legacy methods
benbovy Aug 31, 2022
0b7d582
add what's new entry
benbovy Aug 31, 2022
3ab0bc9
try using correct typing w/o mypy complaining
benbovy Sep 1, 2022
9e75f95
make index_cls arg optional
benbovy Sep 7, 2022
00c2711
docstrings fixes and tweaks
benbovy Sep 23, 2022
cb67612
make Index.from_variables options arg keyword only
benbovy Sep 23, 2022
af67168
Merge branch 'main' into add-set-xindex-and-drop-indexes
benbovy Sep 23, 2022
2cd0aa8
improve set_xindex invalid coordinates error msg
benbovy Sep 23, 2022
61d6e28
add xarray.indexes namespace
benbovy Sep 27, 2022
ec08d73
Merge branch 'main' into add-set-xindex-and-drop-indexes
benbovy Sep 27, 2022
20dbf5a
Merge branch 'main' into add-set-xindex-and-drop-indexes
benbovy Sep 27, 2022
b598447
type tweaks
benbovy Sep 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ Dataset contents
Dataset.swap_dims
Dataset.expand_dims
Dataset.drop_vars
Dataset.drop_indexes
Dataset.drop_duplicates
Dataset.drop_dims
Dataset.set_coords
Expand Down Expand Up @@ -146,6 +147,7 @@ Indexing
Dataset.reindex_like
Dataset.set_index
Dataset.reset_index
Dataset.set_xindex
Dataset.reorder_levels
Dataset.query

Expand Down Expand Up @@ -298,6 +300,7 @@ DataArray contents
DataArray.swap_dims
DataArray.expand_dims
DataArray.drop_vars
DataArray.drop_indexes
DataArray.drop_duplicates
DataArray.reset_coords
DataArray.copy
Expand Down Expand Up @@ -330,6 +333,7 @@ Indexing
DataArray.reindex_like
DataArray.set_index
DataArray.reset_index
DataArray.set_xindex
DataArray.reorder_levels
DataArray.query

Expand Down Expand Up @@ -1080,6 +1084,7 @@ Advanced API
Variable
IndexVariable
as_variable
Index
Context
register_dataset_accessor
register_dataarray_accessor
Expand Down
4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,10 @@ v2022.07.0 (unreleased)
New Features
~~~~~~~~~~~~

- Add :py:meth:`Dataset.set_xindex` and :py:meth:`Dataset.drop_indexes` and
their DataArray counterpart for setting and dropping pandas or custom indexes
given a set of arbitrary coordinates. (:pull:`6971`)
By `Benoît Bovy <https://github.com/benbovy>`_ and `Justus Magin <https://github.com/keewis>`_.

Breaking changes
~~~~~~~~~~~~~~~~
Expand Down
2 changes: 2 additions & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
from .core.dataarray import DataArray
from .core.dataset import Dataset
from .core.extensions import register_dataarray_accessor, register_dataset_accessor
from .core.indexes import Index
from .core.merge import Context, MergeError, merge
from .core.options import get_options, set_options
from .core.parallel import map_blocks
Expand Down Expand Up @@ -99,6 +100,7 @@
"Coordinate",
"DataArray",
"Dataset",
"Index",
"IndexVariable",
"Variable",
# Exceptions
Expand Down
67 changes: 67 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -2169,6 +2169,11 @@ def set_index(
"""Set DataArray (multi-)indexes using one or more existing
coordinates.

This legacy method is limited to pandas (multi-)indexes and
1-dimensional "dimension" coordinates. See
:py:meth:`~DataArray.set_xindex` for setting a pandas or a custom
Xarray-compatible index from one or more arbitrary coordinates.

Parameters
----------
indexes : {dim: index, ...}
Expand Down Expand Up @@ -2213,6 +2218,7 @@ def set_index(
See Also
--------
DataArray.reset_index
DataArray.set_xindex
"""
ds = self._to_temp_dataset().set_index(indexes, append=append, **indexes_kwargs)
return self._from_temp_dataset(ds)
Expand All @@ -2226,6 +2232,12 @@ def reset_index(
) -> DataArray:
"""Reset the specified index(es) or multi-index level(s).

This legacy method is specific to pandas (multi-)indexes and
1-dimensional "dimension" coordinates. See the more generic
:py:meth:`~DataArray.drop_indexes` and :py:meth:`~DataArray.set_xindex`
method to respectively drop and set pandas or custom indexes for
arbitrary coordinates.

Parameters
----------
dims_or_levels : Hashable or sequence of Hashable
Expand All @@ -2244,10 +2256,40 @@ def reset_index(
See Also
--------
DataArray.set_index
DataArray.set_xindex
DataArray.drop_indexes
"""
ds = self._to_temp_dataset().reset_index(dims_or_levels, drop=drop)
return self._from_temp_dataset(ds)

def set_xindex(
self,
coord_names: Hashable | Sequence[Hashable],
index_cls: type[Index],
**options,
) -> DataArray:
benbovy marked this conversation as resolved.
Show resolved Hide resolved
"""Set a new, Xarray-compatible index from one or more existing
coordinate(s).

Parameters
----------
coord_names : str or list
Name(s) of the coordinate(s) used to build the index.
If several names are given, their order matters.
index_cls : subclass of :class:`~xarray.Index`
The type of index to create.
**options
Options passed to the index constructor.

Returns
-------
obj : Dataset
benbovy marked this conversation as resolved.
Show resolved Hide resolved
Another dataarray, with this dataarray's data and with a new index.

"""
ds = self._to_temp_dataset().set_xindex(coord_names, index_cls, **options)
return self._from_temp_dataset(ds)

def reorder_levels(
self: T_DataArray,
dim_order: Mapping[Any, Sequence[int | Hashable]] | None = None,
Expand Down Expand Up @@ -2558,6 +2600,31 @@ def drop_vars(
ds = self._to_temp_dataset().drop_vars(names, errors=errors)
return self._from_temp_dataset(ds)

def drop_indexes(
self,
coord_names: Hashable | Iterable[Hashable],
*,
errors: ErrorOptions = "raise",
) -> DataArray:
"""Drop the indexes assigned to the given coordinates.

Parameters
----------
coord_names : hashable or iterable of hashable
Name(s) of the coordinate(s) for which to drop the index.
errors : {"raise", "ignore"}, default: "raise"
If 'raise', raises a ValueError error if any of the coordinates
passed have no index or are not in the dataset.
If 'ignore', no error is raised.

Returns
-------
dropped : DataArray
A new dataarray with dropped indexes.
"""
ds = self._to_temp_dataset().drop_indexes(coord_names, errors=errors)
return self._from_temp_dataset(ds)

def drop(
self: T_DataArray,
labels: Mapping[Any, Any] | None = None,
Expand Down
166 changes: 164 additions & 2 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3942,6 +3942,11 @@ def set_index(
"""Set Dataset (multi-)indexes using one or more existing coordinates
or variables.

This legacy method is limited to pandas (multi-)indexes and
1-dimensional "dimension" coordinates. See
:py:meth:`~Dataset.set_xindex` for setting a pandas or a custom
Xarray-compatible index from one or more arbitrary coordinates.

Parameters
----------
indexes : {dim: index, ...}
Expand Down Expand Up @@ -3989,6 +3994,7 @@ def set_index(
See Also
--------
Dataset.reset_index
Dataset.set_xindex
Dataset.swap_dims
"""
dim_coords = either_dict_or_kwargs(indexes, indexes_kwargs, "set_index")
Expand Down Expand Up @@ -4031,7 +4037,7 @@ def set_index(
f"dimension mismatch: try setting an index for dimension {dim!r} with "
f"variable {var_name!r} that has dimensions {var.dims}"
)
idx = PandasIndex.from_variables({dim: var})
idx = PandasIndex.from_variables({dim: var}, {})
benbovy marked this conversation as resolved.
Show resolved Hide resolved
idx_vars = idx.create_variables({var_name: var})
else:
if append:
Expand Down Expand Up @@ -4080,6 +4086,12 @@ def reset_index(
) -> T_Dataset:
"""Reset the specified index(es) or multi-index level(s).

This legacy method is specific to pandas (multi-)indexes and
1-dimensional "dimension" coordinates. See the more generic
:py:meth:`~Dataset.drop_indexes` and :py:meth:`~Dataset.set_xindex`
method to respectively drop and set pandas or custom indexes for
arbitrary coordinates.

Parameters
----------
dims_or_levels : Hashable or Sequence of Hashable
Expand All @@ -4097,6 +4109,8 @@ def reset_index(
See Also
--------
Dataset.set_index
Dataset.set_xindex
Dataset.drop_indexes
"""
if isinstance(dims_or_levels, str) or not isinstance(dims_or_levels, Sequence):
dims_or_levels = [dims_or_levels]
Expand Down Expand Up @@ -4149,6 +4163,101 @@ def reset_index(

return self._replace(variables, coord_names=coord_names, indexes=indexes)

def set_xindex(
self,
coord_names: Hashable | Sequence[Hashable],
index_cls: type[Index],
**options,
) -> Dataset:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Dataset:
) -> T_Dataset:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy is not happy with this:

xarray/tests/test_dataset.py:3307: error: Argument 1 to "set_xindex" of "Dataset" has incompatible type "List[str]"; expected "Hashable"  [arg-type]
xarray/tests/test_dataset.py:3307: note: Following member(s) of "List[str]" have conflicts:
xarray/tests/test_dataset.py:3307: note:     __hash__: expected "Callable[[], int]", got "None"
xarray/tests/test_dataset.py:3307: note: Protocol member Hashable.__hash__ expected instance variable, got classe variabl

#6971 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings are sequences apparently:

isinstance("str", typing.Sequence)
Out[63]: True

Try out CoordNames = Union[str, Iterable[Hashable]] seems to be succesful in #7048.
It would be nice if we aligned these tricky types so try to use named variables for repeated arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically a str is also an Iterable of Hashable :P
But the typing community is quite relaxed about violating that fact.
So as long as you don't need the two types to be "perpendicular" it should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for using a named variable like CoordNames. The tricky thing here is that the order is important. Do we use Sequence in Xarray in that case? I guess we would need to define two variables for each case where the order does / doesn't matter?

Also, I don't remember whether a single coordinate name should be str or Hashable. Should we treat it like a single dimension name or not?

I feel like this issue should be addressed more globally in Xarray than within the scope of this PR. Perhaps better to move on and merge this PR before the next release?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we try to move to str | Iterable [Hashable] for "one or more dims", and Hashable for a single dim.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually we try to move to str | Iterable[Hashable] for "one or more dims", and Hashable for a single dim.

Probably not in all cases? For example, with DataArray.__init__(..., dims: str | Iterable[Hashable]) the type checker would allow passing a set. Recently I had to figure out what was going on with xr.DataArray(data=np.zeros((10, 5)), dims={'x', 'time'}), which mypy should actually catch with Sequence[Hashable]. Slightly off-topic: should we have two variables Dims and OrderedDims defined in xarray.core.types?

Same issue here for coordinate names. str | Sequence[Hashable] seems to work well, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should a set not be allowed?
It's already since quite some time that the order is preserved? I think all built-in Iterables have conserved order, and internally we convert to tuple anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the order is preserved for sets (unlike dicts). This is what I can get with CPython 3.9 / Xarray v2022.6.0:

print(xr.DataArray(data=np.zeros((2, 3)), dims={'x', 'time'}))
# <xarray.DataArray (time: 2, x: 3)>
# array([[0., 0., 0.],
#        [0., 0., 0.]])
# Dimensions without coordinates: time, x

tuple({'x', 'time'})
# ('time', 'x')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops you are right, that was dicts.
Then indeed we need to distinguish between dims and ordered dims.

"""Set a new, Xarray-compatible index from one or more existing
coordinate(s).

Parameters
----------
coord_names : str or list
Name(s) of the coordinate(s) used to build the index.
If several names are given, their order matters.
index_cls : subclass of :class:`~xarray.Index`
The type of index to create.
**options
Options passed to the index constructor.

Returns
-------
obj : Dataset
Another dataset, with this dataset's data and with a new index.

"""
if not issubclass(index_cls, Index):
raise TypeError(f"{index_cls} is not a subclass of xarray.Index")

# the Sequence check is required for mypy
if is_scalar(coord_names) or not isinstance(coord_names, Sequence):
coord_names = [coord_names]

invalid_coords = set(coord_names) - self._coord_names

if invalid_coords:
raise ValueError(f"those coordinates don't exist: {invalid_coords}")
benbovy marked this conversation as resolved.
Show resolved Hide resolved

# we could be more clever here (e.g., drop-in index replacement if index
# coordinates do not conflict), but let's not allow this for now
indexed_coords = set(coord_names) & set(self._indexes)

if indexed_coords:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that you cannot use coords in more than one indexes? (I am not sure how important this is but could imagine a use case where lat & lon are used as 1D indexes and in a KDTree).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, allow multiple indexes per coordinate would make many things much harder.

There are indeed some examples (like the one you mention) where it could be useful to have multiple indexes. But I think it could be solved by either switching between indexes (if building them is not too expensive) or via a custom "meta-index" that would encapsulate both kinds of indexes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough - thanks for the clarification!

raise ValueError(
f"those coordinates already have an index: {indexed_coords}"
)

coord_vars = {name: self._variables[name] for name in coord_names}

index = index_cls.from_variables(coord_vars, options)

new_coord_vars = index.create_variables(coord_vars)

# special case for setting a pandas multi-index from level coordinates
# TODO: remove it once we depreciate pandas multi-index dimension (tuple
# elements) coordinate
if isinstance(index, PandasMultiIndex):
coord_names = [index.dim] + list(coord_names)

variables: dict[Hashable, Variable]
indexes: dict[Hashable, Index]

if len(coord_names) == 1:
variables = self._variables.copy()
indexes = self._indexes.copy()

name = list(coord_names).pop()
if name in new_coord_vars:
variables[name] = new_coord_vars[name]
indexes[name] = index
else:
# reorder variables and indexes so that coordinates having the same
# index are next to each other
variables = {}
for name, var in self._variables.items():
if name not in coord_names:
variables[name] = var

indexes = {}
for name, idx in self._indexes.items():
if name not in coord_names:
indexes[name] = idx

for name in coord_names:
try:
variables[name] = new_coord_vars[name]
except KeyError:
variables[name] = self._variables[name]
indexes[name] = index

return self._replace(
variables=variables,
coord_names=self._coord_names | set(coord_names),
indexes=indexes,
)

def reorder_levels(
self: T_Dataset,
dim_order: Mapping[Any, Sequence[int | Hashable]] | None = None,
Expand Down Expand Up @@ -4870,6 +4979,59 @@ def drop_vars(
variables, coord_names=coord_names, indexes=indexes
)

def drop_indexes(
self: T_Dataset,
coord_names: Hashable | Iterable[Hashable],
*,
errors: ErrorOptions = "raise",
) -> T_Dataset:
"""Drop the indexes assigned to the given coordinates.

Parameters
----------
coord_names : hashable or iterable of hashable
Name(s) of the coordinate(s) for which to drop the index.
errors : {"raise", "ignore"}, default: "raise"
If 'raise', raises a ValueError error if any of the coordinates
passed have no index or are not in the dataset.
If 'ignore', no error is raised.

Returns
-------
dropped : Dataset
A new dataset with dropped indexes.

"""
# the Iterable check is required for mypy
if is_scalar(coord_names) or not isinstance(coord_names, Iterable):
coord_names = {coord_names}
else:
coord_names = set(coord_names)

if errors == "raise":
invalid_coords = coord_names - self._coord_names
if invalid_coords:
raise ValueError(f"those coordinates don't exist: {invalid_coords}")

unindexed_coords = set(coord_names) - set(self._indexes)
if unindexed_coords:
raise ValueError(
f"those coordinates do not have an index: {unindexed_coords}"
)

assert_no_index_corrupted(self.xindexes, coord_names, action="remove index(es)")

variables = {}
for name, var in self._variables.items():
if name in coord_names:
variables[name] = var.to_base_variable()
else:
variables[name] = var

indexes = {k: v for k, v in self._indexes.items() if k not in coord_names}

return self._replace(variables=variables, indexes=indexes)

def drop(
self: T_Dataset,
labels=None,
Expand Down Expand Up @@ -7778,7 +7940,7 @@ def pad(
# reset default index of dimension coordinates
if (name,) == var.dims:
dim_var = {name: variables[name]}
index = PandasIndex.from_variables(dim_var)
index = PandasIndex.from_variables(dim_var, {})
benbovy marked this conversation as resolved.
Show resolved Hide resolved
index_vars = index.create_variables(dim_var)
indexes[name] = index
variables[name] = index_vars[name]
Expand Down
Loading