[wip] Add docs

pydata · Jun 13, 2024 · f00d9d2 · f00d9d2
1 parent 7e74ec6
commit f00d9d2
Show file tree

Hide file tree

Showing 6 changed files with 194 additions and 46 deletions.
diff --git a/doc/api-hidden.rst b/doc/api-hidden.rst
@@ -693,3 +693,7 @@
 
    coding.times.CFTimedeltaCoder
    coding.times.CFDatetimeCoder
+
+   core.groupers.Grouper
+   core.groupers.Resampler
+   core.groupers.EncodedGroups
diff --git a/doc/api.rst b/doc/api.rst
@@ -801,6 +801,18 @@ DataArray
    DataArrayGroupBy.dims
    DataArrayGroupBy.groups
 
+Grouper Objects
+---------------
+
+.. currentmodule:: xarray.core
+
+.. autosummary::
+   :toctree: generated/
+
+   groupers.BinGrouper
+   groupers.UniqueGrouper
+   groupers.TimeResampler
+
 
 Rolling objects
 ===============
@@ -1026,17 +1038,20 @@ DataArray
 Accessors
 =========
 
-.. currentmodule:: xarray
+.. currentmodule:: xarray.core
 
 .. autosummary::
    :toctree: generated/
 
-   core.accessor_dt.DatetimeAccessor
-   core.accessor_dt.TimedeltaAccessor
-   core.accessor_str.StringAccessor
+   accessor_dt.DatetimeAccessor
+   accessor_dt.TimedeltaAccessor
+   accessor_str.StringAccessor
+
 
 Custom Indexes
 ==============
+.. currentmodule:: xarray
+
 .. autosummary::
    :toctree: generated/
 

diff --git a/doc/conf.py b/doc/conf.py
@@ -166,6 +166,7 @@
     "CategoricalIndex": "~pandas.CategoricalIndex",
     "TimedeltaIndex": "~pandas.TimedeltaIndex",
     "DatetimeIndex": "~pandas.DatetimeIndex",
+    "IntervalIndex": "~pandas.IntervalIndex",
     "Series": "~pandas.Series",
     "DataFrame": "~pandas.DataFrame",
     "Categorical": "~pandas.Categorical",

diff --git a/doc/user-guide/groupby.rst b/doc/user-guide/groupby.rst
@@ -1,3 +1,5 @@
+.. currentmodule:: xarray
+
 .. _groupby:
 
 GroupBy: Group and Bin Data
@@ -15,19 +17,20 @@ __ https://www.jstatsoft.org/v40/i01/paper
 - Apply some function to each group.
 - Combine your groups back into a single data object.
 
-Group by operations work on both :py:class:`~xarray.Dataset` and
-:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
+Group by operations work on both :py:class:`Dataset` and
+:py:class:`DataArray` objects. Most of the examples focus on grouping by
 a single one-dimensional variable, although support for grouping
 over a multi-dimensional variable has recently been implemented. Note that for
 one-dimensional data, it is usually faster to rely on pandas' implementation of
 the same pipeline.
 
 .. tip::
 
-   To substantially improve the performance of GroupBy operations, particularly
-   with dask `install the flox package <https://flox.readthedocs.io>`_. flox
+   `Install the flox package <https://flox.readthedocs.io>`_ to substantially improve the performance
+   of GroupBy operations, particularly with dask. flox
    `extends Xarray's in-built GroupBy capabilities <https://flox.readthedocs.io/en/latest/xarray.html>`_
-   by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.
+   by allowing grouping by multiple variables, and lazy grouping by dask arrays.
+   If installed, Xarray will automatically use flox by default.
 
 Split
 ~~~~~
@@ -87,7 +90,7 @@ Binning
 Sometimes you don't want to use all the unique values to determine the groups
 but instead want to "bin" the data into coarser groups. You could always create
 a customized coordinate, but xarray facilitates this via the
-:py:meth:`~xarray.Dataset.groupby_bins` method.
+:py:meth:`Dataset.groupby_bins` method.
 
 .. ipython:: python
 
@@ -110,7 +113,7 @@ Apply
 ~~~~~
 
 To apply a function to each group, you can use the flexible
-:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
+:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
 concatenated back together along the group axis:
 
 .. ipython:: python
@@ -121,8 +124,8 @@ concatenated back together along the group axis:
 
     arr.groupby("letters").map(standardize)
 
-GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
-methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
+GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and
+methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
 aggregation function:
 
 .. ipython:: python
@@ -183,7 +186,7 @@ Iterating and Squeezing
 Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over
 a GroupBy object. This behaviour is being removed.
 You can always squeeze explicitly later with the Dataset or DataArray
-:py:meth:`~xarray.DataArray.squeeze` methods.
+:py:meth:`DataArray.squeeze` methods.
 
 .. ipython:: python
 
@@ -217,7 +220,7 @@ __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dime
     da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)
 
 Because multidimensional groups have the ability to generate a very large
-number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
+number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins`
 may be desirable:
 
 .. ipython:: python
@@ -232,3 +235,66 @@ applying your function, and then unstacking the result:
 
     stacked = da.stack(gridcell=["ny", "nx"])
     stacked.groupby("gridcell").sum(...).unstack("gridcell")
+
+.. _groupby.groupers:
+
+Extending GroupBy: Grouper Objects
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. currentmodule:: xarray.core.groupers
+
+.. warning::
+
+   This is an advanced experimental API. We encourage you to experiment with it and let us know.
+   See the `design document <https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md>`_
+   for more background.
+
+The first step in executing a GroupBy analysis is to *identify* the groups and create an intermediate array where each group member is identified
+by a unique integer code. Commonly this step is executed using :py:func:`pandas.factorize` for grouping by a categorical variable (e.g. ``['a', 'b', 'a', 'b']``)
+and :py:func:`pandas.cut` or :py:func:`numpy.digitize` or :py:func:`numpy.searchsorted` for binning a numeric variable.
+
+Much of the complexity in more complex GroupBy problems can be abstracted to a specialized "factorize" operation identifying the necessary groups.
+:py:class:`groupers.Grouper` and :py:class:`groupers.Resampler` objects provide an extension point allowing Xarray's GroupBy machinery
+to use specialized "factorization" operations.
+Eventually, they will also provide a natural way to extend GroupBy to grouping by multiple variables: ``ds.groupby(x=BinGrouper(...), t=Resampler(freq="M", ...)).mean()``.
+
+Xarray provides three Grouper objects today
+
+1. :py:class:`UniqueGrouper` for categorical grouping
+2. :py:class:`BinGrouper` for binned grouping
+3. :py:class:`TimeResampler` for resampling along a datetime coordinate
+
+These objects mean that
+
+- ``ds.groupby("categories")`` is identical to ``ds.groupby(categories=UniqueGrouper())``
+- ``ds.groupby_bins("values", bins=5)`` is identical to ``ds.groupby(value=BinGrouper(bins=7))``.
+- ``ds.resample(time="H")`` is identical to ``ds.groupby(time=TimeResampler(freq="H"))``.
+
+For example consider a seasonal grouping ``ds.groupby("time.season")``. This approach treats ``ds.time.dt.season`` as a categorical variable to group by and is naive
+to the many complexities of time grouping. A specialized ``SeasonGrouper`` and ``SeasonResampler`` object would allow
+
+- Supporting seasons that span a year-end.
+- Only including seasons with complete data coverage.
+- Grouping over seasons of unequal length
+- Returning results with seasons in the appropriate chronological order
+
+To define a custom grouper simply subclass either the :py:class:`Grouper` or :py:class:`Resampler` abstract base class
+and provide a customized ``factorize`` method. This method must accept a :py:class:`DataArray` to group by and return
+an instance of :py:class:`EncodedGroups`.
+
+.. ipython:: python
+
+    from xarray import Variable
+
+
+    class YearGrouper(xr.groupers.Grouper):
+        """
+        An example re-implementation of ``.groupby("time.year")``.
+        """
+
+        def factorize(self, group) -> xr.groupers.EncodedGroups:
+            assert np.issubdtype(group.dtype, np.datetime64)
+            year = group.dt.year
+            codes, uniques = pd.factorize(year)
+            unique_coord = Variable(dims="year", data=uniques)
+            return EncodedGroups(codes=codes, unique_coord=unique_coord)
diff --git a/xarray/__init__.py b/xarray/__init__.py
@@ -56,6 +56,7 @@
 # `mypy --strict` running in projects that import xarray.
 __all__ = (
     # Sub-packages
+    "groupers",
     "testing",
     "tutorial",
     # Top-level functions
@@ -95,8 +96,6 @@
     "unify_chunks",
     "where",
     "zeros_like",
-    # Submodules
-    "groupers",
     # Classes
     "CFTimeIndex",
     "Context",