Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose "Coordinates" as part of Xarray's public API #7368

Merged
merged 76 commits into from
Jul 21, 2023

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Dec 8, 2022

This is a rework of #7214. It follows the suggestions made in #7214 (comment), #7214 (comment) and #7214 (comment):

  • No indexes argument is added to Dataset.__init__, and the indexes argument of DataArray.__init__ is kept private (i.e., valid only if fastpath=True)
  • When a Coordinates object is passed to a new Dataset or DataArray via the coords argument, both coordinate variables and indexes are copied/extracted and added to the new object
  • This PR also adds an IndexedCoordinates subclass Coordinates public constructors used to create Xarray coordinates and indexes from non-Xarray objects. For example, the Coordinates.from_pandas_multiindex() class method creates a new set of index and coordinates from an existing pd.MultiIndex.

EDIT: IndexCoordinates has been merged with Coordinates

EDIT2: it ended up as a pretty big refactor with the promotion of Coordinates has a 2nd-class Xarray container that supports alignment like Dataset and DataArray. It is still quite advanced API, useful for passing coordinate variables and indexes around. Internally, Coordinates objects are still "virtual" containers (i.e., proxies for coordinate variables and indexes stored in their corresponding DataArray or Dataset objects). For now, a "stand-alone" Coordinates object created from scratch wraps a Dataset with no data variables.

Some examples of usage:

import pandas as pd
import xarray as xr

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))

coords = xr.Coordinates.from_pandas_multiindex(midx, "x")
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2

ds = xr.Dataset(coords=coords)
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
# Data variables:
#     *empty*

ds_to_be_deprecated = xr.Dataset(coords={"x": midx})
ds_to_be_deprecated.identical(ds)
# True

da = xr.DataArray([1, 2, 3, 4], dims="x", coords=ds.coords)
# <xarray.DataArray (x: 4)>
# array([1, 2, 3, 4])
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2

TODO:

  • update assign_coords too so it has the same behavior if a Coordinates object is passed?
  • How to avoid building any default index? It seems silly to add or use the indexes argument just for that purpose? We could address that later. Solution: wrap the coordinates dict in a Coordinates objects, e.g., ds = xr.Dataset(coords=xr.Coordinates(coords_dict)).

@shoyer, @dcherian, anyone -- what do you think about the approach proposed here? I'd like to check that with you before going further with tests, docs, etc.

- easily create an empty Indexes collection
- check consistency between indexes and variables
TODO: check indexes shapes / dims for DataArray
Since its constructor can now be used publicly.

Copy input mappings and check the type of input indexes.
+ add `IndexedCoordinates.from_pandas_multiindex` helper.
Drop the `indexes` argument or keep it as private API.

When a `Coordinates` object is passed as `coords` argument, extract both
coordinate variables and indexes and add them to the new Dataset or
DataArray.
@benbovy
Copy link
Member Author

benbovy commented Dec 9, 2022

IndexedCoordinates and Indexes have a lot of overlap. At some point we might consider merging the two classes, like @shoyer suggests in #7214 (comment). The main difference is that one is a mapping of coordinates and the other is a mapping of indexes. IndexedCoordinates is mostly reusing Indexes and Dataset under the hood, it is only a facade.

Alternatively to an IndexedCoordinates subclass I was wondering if we could reuse the Coordinates base class? There's some benefit of providing a subclass:

  • besides specific constructors like .from_pandas_multiindex() it has a generic __init__ for advanced use cases. Not sure it is a good idea to add this constructor to the base class?
  • unlike Coordinates, IndexedCoordinates is immutable.

What if the Indexes class was a facade based on IndexedCoordinates instead of the other way around? It would probably make more sense but it would also be a bigger refactor. I've chosen the easy way :).

@benbovy
Copy link
Member Author

benbovy commented Dec 9, 2022

I added IndexedCoordinates.merge_coords so that it is easier to combine different coordinates to pass to a new Dataset / DataArray, e.g.,

midx = pd.MultiIndex.from_product([["a", "b"], [1, 2]], names=("one", "two"))

coords = xr.IndexedCoordinates.from_pandas_multiindex(midx, "x")

coords = coords.merge_coords({"y": [0, 1, 2]})
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
#   * y        (y) int64 0 1 2

ds = xr.Dataset(coords=coords)
# <xarray.Dataset>
# Dimensions:  (x: 4)
# Coordinates:
#   * x        (x) object MultiIndex
#   * one      (x) object 'a' 'a' 'b' 'b'
#   * two      (x) int64 1 2 1 2
#   * y        (y) int64 0 1 2
# Data variables:
#     *empty*

IndexedCoordinates.merge_coords is very much like Coordinates.merge except that it returns a new Coordinates object instead of a Dataset.

Or should we just use merge? It would require that:

  • Coordinates.merge accepts Mapping[Any, Any] for its other argument. Only changing the type hint is enough here since the implementation already accepts any input passed to Dataset.
  • When a Dataset is passed as coords argument to a new Dataset and DataArray, both variables and indexes should be extracted. It is already the case for Dataset but I think it only works for PandasIndex and PandasMultiIndex (default indexes & backwards compatibility).

@shoyer
Copy link
Member

shoyer commented Dec 10, 2022

what do you think about the approach proposed here? I'd like to check that with you before going further with tests, docs, etc.

Generally this looks great to me!

  • How to avoid building any default index? It seems silly to add or use the indexes argument just for that purpose? We could address that later.

My suggestion would be:

  • coords passed as a dict: create default indexes
  • coords passed as IndexedCoordinates: do not create defaults

Alternatively to an IndexedCoordinates subclass I was wondering if we could reuse the Coordinates base class?

Yes, this makes more sense to me!

What if the Indexes class was a facade based on IndexedCoordinates instead of the other way around?

Yes, I also agree! This makes more sense.

@shoyer
Copy link
Member

shoyer commented Dec 10, 2022

Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing.

@benbovy
Copy link
Member Author

benbovy commented Dec 10, 2022

Long term, do you think it would make sense to merge together Indexes, Coordinates and IndexedCoordinates? They are sort of all containers for the same thing.

Yes I think so.

I'm actually trying to merge IndexedCoordinates with Coordinates but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with DatasetCoordinates and DataArrayCoordinates. Do you have any idea on how best to proceed?

Ideally, I'd see Coordinates be exposed in Xarray's main namespace with at least the two following constructors:

class Coordinates:

    def __init__(
        self,
        coords: Mapping[Any, Any] | None = None,
        indexes: Mapping[Any, Index] | None = None,
    ):
        # Similar to Dataset.__init__ but without the need
        # to merge coords and data vars...
        # Probably ok to allow more flexibility / less safety here?
        ...

    @classmethod
    from_pandas_multiindex(cls, index: pd.MultiIndex, dim: str):
        ...

@shoyer
Copy link
Member

shoyer commented Dec 11, 2022

I'm actually trying to merge IndexedCoordinates with Coordinates but I'm stuck: the latter is abstract and I don't really see how I could refactor it together with DatasetCoordinates and DataArrayCoordinates

Coordinates is abstract because in the (current) Xarray data model, it doesn't actually store any data -- coordinates are stored in private attributes of the original Dataset (._variables and ._coord_names) or DataArray (._coords). So Coordinates needs to serve as a proxy for the data.

In the long term, I think we should refactor Dataset/DataArray to actually store data (coordinate variables, indexes and dimension sizes) on Coordinates, but that's a bigger refactor.

For now, it's worth noting that the current Coordinates class isn't actually exposed in Xarray's public API, just the DatasetCoordinates and DataArrayCoordinates classes (and not even their constructors). So the intermdiate step I would try is:

  1. Rename the current Coordinates baseclass to AbstractCoordinates.
  2. Rename your IndexedCoordinates class to Coordinates. Expose it in the public API. Make sure it can handle DatasetCoordinates and DataArrayCoordinates in the constructor.
  3. Maybe: use some Python magic to make DatasetCorodinates/DataArrayCoordinates subclasses of the new Coordinates. Or maybe make them actual subclassses, overriding many of the methods (including the constructor).

@benbovy
Copy link
Member Author

benbovy commented Dec 12, 2022

Thanks @shoyer, I've been thinking about similar short/long term plans although so far I haven't figured out how to implement your point 3. I'll give it another try.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally forgot that you can use Protocols to solve this typing issue, nice!

Obviously need to fix the issues found by the tests :)

import sys

if sys.version_info >= (3, 11):
from typing import Self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Finally this is merged.

This means we should be able to remove lots of the code that annotated self with a typevar (in another PR ofc).

@@ -93,10 +94,63 @@
DTypeLikeSave: Any = None


class Alignable(Protocol):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@benbovy
Copy link
Member Author

benbovy commented Jul 17, 2023

Thanks for your feedback @headtr1ck, your previous comments have been helpful. Yeah I gave it another try using Protocol, I've been struggling with it until trying with Self. This should be good now, hopefully. Another advantage is that the Alignable protocol class makes it pretty clear what is the required interface for an Xarray object to support alignment, i.e., it provides some self-documentation of the internals.


@classmethod
def _construct_direct(
cls: type[T_Coordinates],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you use Self here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍 I'll go trough the coordinate classes and see if I can replace type variables by Self.


@classmethod
def from_pandas_multiindex(
cls: type[T_Coordinates], midx: pd.MultiIndex, dim: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here?

return cast(T_Coordinates, results.coords)

def _reindex_callback(
self: T_Coordinates,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here (not sure if it works with cast though..., Can always use a type: ignore)

exclude_dims,
exclude_vars,
)
return cast(T_Coordinates, aligned.coords)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not checked this, but maybe mypy is right to complain?

If you start with a DataArrayCoordinates, do you not end up with a DatasetCoordinates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you start with a DataArrayCoordinates, do you not end up with a DatasetCoordinates?

This case shouldn't happen because Dataset and DataArray have their own _reindex_callback() called internally in Aligner. The callback here is called when aligning a Coordinates object, so we need to cast DatasetCoordinates into a Coordinates.

I should probably add a comment here to explain this.

@dcherian
Copy link
Contributor

RTD failure is real:

home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/7368/doc/api-hidden.rst:9: WARNING: autosummary: failed to import Coordinates.merge_coords.
Possible hints:
* ModuleNotFoundError: No module named 'xarray.Coordinates'
* PycodeError: no source found for module 'builtins'
* AttributeError: type object 'Coordinates' has no attribute 'merge_coords'
* ModuleNotFoundError: No module named 'Coordinates'
* KeyError: 'Coordinates'
/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/7368/doc/api-hidden.rst:9: WARNING: autosummary: failed to import core.coordinates.DatasetCoordinates.merge_coords.
Possible hints:
* ModuleNotFoundError: No module named 'xarray.core.coordinates.DatasetCoordinates'; 'xarray.core.coordinates' is not a package
* PycodeError: no source found for module 'builtins'
* AttributeError: type object 'DatasetCoordinates' has no attribute 'merge_coords'
* KeyError: 'core'
* ModuleNotFoundError: No module named 'core'
/home/docs/checkouts/readthedocs.org/user_builds/xray/checkouts/7368/doc/api-hidden.rst:9: WARNING: autosummary: failed to import core.coordinates.DataArrayCoordinates.merge_coords.
Possible hints:
* PycodeError: no source found for module 'builtins'
* KeyError: 'core'
* AttributeError: type object 'DataArrayCoordinates' has no attribute 'merge_coords'
* ModuleNotFoundError: No module named 'xarray.core.coordinates.DataArrayCoordinates'; 'xarray.core.coordinates' is not a package
* ModuleNotFoundError: No module named 'core'

@dcherian
Copy link
Contributor

@benbovy does this actually close #6633? It seems like we need a way to pass a Coordinates object to open_dataset and down to the backend code. We can totally address this in a followup PR.

@benbovy
Copy link
Member Author

benbovy commented Jul 18, 2023

@dcherian yes this PR would enable #6633 by passing a Coordinates object to the Dataset constructor in the backend code. It could be done in another PR indeed.

  • This doesn't create a default index for "x" (since no index is passed explicitly to the Coordinates object):
import xarray as xr

ds = xr.Dataset(
    data_vars={"foo": ("x", [0.0, 1.0])},
    coords=xr.Coordinates({"x": ("x", [1, 2])}),
)
  • Note that if "x" is passed as a data variable it will still be promoted as a coordinate with a default index, even when a Coordinates object is passed explicitly:
xr.Dataset(
    data_vars={"foo": ("x", [1.0, 2.0]), "x": [1, 2]},
    coords=xr.Coordinates(),
)

This might be slightly confusing but I don't know if it is really worth handling this special case (i.e., dimension coordinate variables in data_vars + empty Coordinates). I think we should keep promoting data variables as coordinates since data_vars is often used as a unique positional argument of Dataset.__init__().

@benbovy
Copy link
Member Author

benbovy commented Jul 18, 2023

I updated the docstrings of Dataset.__init__ and DataArray.__init__ to clarify that we can pass a Coordinates object to bypass the creation of default indexes for dimension coordinates.

I took this opportunity to improve the glossary in the documentation and make the concepts of (non)-dimension / (non)-indexed coordinate more in line with the current state of Xarray since the index refactor. I'm sure there are still other inconsistent parts in the documentation, though. I'll review and address that in a follow-up PR.

Pyright displays an info message "Self is not valid in this context" but
most important this should avoid runtime errors with python < 3.11.
@TomNicholas
Copy link
Member

I just want to say this is another amazing piece of work @benbovy 👏 👏

@dcherian
Copy link
Contributor

@benbovy shall we merge?

@benbovy
Copy link
Member Author

benbovy commented Jul 21, 2023

@dcherian yes!

@dcherian dcherian merged commit 4441f99 into pydata:main Jul 21, 2023
dcherian added a commit to dcherian/xarray that referenced this pull request Jul 24, 2023
…lazy-array

* upstream/main: (153 commits)
  Add HDF5 Section to read/write docs page (pydata#8012)
  [pre-commit.ci] pre-commit autoupdate (pydata#8014)
  Update interpolate_na in dataset.py (pydata#7974)
  improved docstring of to_netcdf (issue pydata#7127) (pydata#7947)
  Expose "Coordinates" as part of Xarray's public API (pydata#7368)
  Core team member guide (pydata#7999)
  join together duplicate entries in the text `repr` (pydata#7225)
  Update copyright year in README (pydata#8007)
  Allow opening datasets with nD dimenson coordinate variables. (pydata#7989)
  Move whats-new entry
  [pre-commit.ci] pre-commit autoupdate (pydata#7997)
  Add documentation on custom indexes (pydata#6975)
  Use variable name in all exceptions raised in `as_variable` (pydata#7995)
  Bump pypa/gh-action-pypi-publish from 1.8.7 to 1.8.8 (pydata#7994)
  New whatsnew section
  Remove future release notes before this release
  Update whats-new.rst for new release (pydata#7993)
  Remove hue_style from plot1d docstring (pydata#7925)
  Add new what's new section (pydata#7986)
  Release summary for v2023.07.0 (pydata#7979)
  ...
@headtr1ck headtr1ck mentioned this pull request Aug 3, 2023
4 tasks
@benbovy benbovy deleted the indexes-arg-constructors-2 branch August 30, 2023 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pass indexes to the Dataset and DataArray constructors
6 participants