diff --git a/doc/api-hidden.rst b/doc/api-hidden.rst index d9c89649358..b62fdba4bc6 100644 --- a/doc/api-hidden.rst +++ b/doc/api-hidden.rst @@ -693,3 +693,7 @@ coding.times.CFTimedeltaCoder coding.times.CFDatetimeCoder + + core.groupers.Grouper + core.groupers.Resampler + core.groupers.EncodedGroups diff --git a/doc/api.rst b/doc/api.rst index a8f8ea7dd1c..9f0109f2474 100644 --- a/doc/api.rst +++ b/doc/api.rst @@ -801,6 +801,18 @@ DataArray DataArrayGroupBy.dims DataArrayGroupBy.groups +Grouper Objects +--------------- + +.. currentmodule:: xarray.core + +.. autosummary:: + :toctree: generated/ + + groupers.BinGrouper + groupers.UniqueGrouper + groupers.TimeResampler + Rolling objects =============== @@ -1026,17 +1038,20 @@ DataArray Accessors ========= -.. currentmodule:: xarray +.. currentmodule:: xarray.core .. autosummary:: :toctree: generated/ - core.accessor_dt.DatetimeAccessor - core.accessor_dt.TimedeltaAccessor - core.accessor_str.StringAccessor + accessor_dt.DatetimeAccessor + accessor_dt.TimedeltaAccessor + accessor_str.StringAccessor + Custom Indexes ============== +.. currentmodule:: xarray + .. autosummary:: :toctree: generated/ diff --git a/doc/conf.py b/doc/conf.py index 80b24445f71..d8c19cc6c5f 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -166,6 +166,7 @@ "CategoricalIndex": "~pandas.CategoricalIndex", "TimedeltaIndex": "~pandas.TimedeltaIndex", "DatetimeIndex": "~pandas.DatetimeIndex", + "IntervalIndex": "~pandas.IntervalIndex", "Series": "~pandas.Series", "DataFrame": "~pandas.DataFrame", "Categorical": "~pandas.Categorical", diff --git a/doc/user-guide/groupby.rst b/doc/user-guide/groupby.rst index 1ad2d52fc00..5a8b954c9d2 100644 --- a/doc/user-guide/groupby.rst +++ b/doc/user-guide/groupby.rst @@ -1,3 +1,5 @@ +.. currentmodule:: xarray + .. _groupby: GroupBy: Group and Bin Data @@ -15,8 +17,8 @@ __ https://www.jstatsoft.org/v40/i01/paper - Apply some function to each group. - Combine your groups back into a single data object. -Group by operations work on both :py:class:`~xarray.Dataset` and -:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by +Group by operations work on both :py:class:`Dataset` and +:py:class:`DataArray` objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas' implementation of @@ -24,10 +26,11 @@ the same pipeline. .. tip:: - To substantially improve the performance of GroupBy operations, particularly - with dask `install the flox package `_. flox + `Install the flox package `_ to substantially improve the performance + of GroupBy operations, particularly with dask. flox `extends Xarray's in-built GroupBy capabilities `_ - by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default. + by allowing grouping by multiple variables, and lazy grouping by dask arrays. + If installed, Xarray will automatically use flox by default. Split ~~~~~ @@ -87,7 +90,7 @@ Binning Sometimes you don't want to use all the unique values to determine the groups but instead want to "bin" the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the -:py:meth:`~xarray.Dataset.groupby_bins` method. +:py:meth:`Dataset.groupby_bins` method. .. ipython:: python @@ -110,7 +113,7 @@ Apply ~~~~~ To apply a function to each group, you can use the flexible -:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically +:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically concatenated back together along the group axis: .. ipython:: python @@ -121,8 +124,8 @@ concatenated back together along the group axis: arr.groupby("letters").map(standardize) -GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and -methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an +GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and +methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an aggregation function: .. ipython:: python @@ -183,7 +186,7 @@ Iterating and Squeezing Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over a GroupBy object. This behaviour is being removed. You can always squeeze explicitly later with the Dataset or DataArray -:py:meth:`~xarray.DataArray.squeeze` methods. +:py:meth:`DataArray.squeeze` methods. .. ipython:: python @@ -217,7 +220,7 @@ __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dime da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False) Because multidimensional groups have the ability to generate a very large -number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins` +number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins` may be desirable: .. ipython:: python @@ -232,3 +235,65 @@ applying your function, and then unstacking the result: stacked = da.stack(gridcell=["ny", "nx"]) stacked.groupby("gridcell").sum(...).unstack("gridcell") + +.. _groupby.groupers: + +Grouper Objects +~~~~~~~~~~~~~~~ + +Both ``groupby_bins`` and ``resample`` are specializations of the core ``groupby`` operation for binning, +and time resampling. Many problems demand more complex GroupBy application: for example, grouping by multiple +variables with a combination of categorical grouping, binning, and resampling; or more specializations like +spatial resampling; or more complex time grouping like special handling of seasons, or the ability to specify +custom seasons. To handle these use-cases and more, Xarray is evolving to providing an +extension point using ``Grouper`` objects. + +.. tip:: + + See the `grouper design`_ doc for more detail on the motivation and design ideas behind + Grouper objects. + +.. _grouper design: https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md + +For now Xarray provides three specialized Grouper objects: + +1. :py:class:`groupers.UniqueGrouper` for categorical grouping +2. :py:class:`groupers.BinGrouper` for binned grouping +3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate + +These provide functionality identical to the existing ``groupby``, ``groupby_bins``, and ``resample`` methods. +That is, +.. codeblock:: python + + ds.groupby("x") + +is identical to +.. codeblock:: python + + from xarray.groupers import UniqueGrouper + + ds.groupby(x=UniqueGrouper()) + +; and +.. codeblock:: python + + ds.groupby_bins("x", bins=bins) + +is identical to +.. codeblock:: python + + from xarray.groupers import BinGrouper + + ds.groupby(x=BinGrouper(bins)) + +; and +.. codeblock:: python + + ds.resample(time="ME") + +is identical to +.. codeblock:: python + + from xarray.groupers import TimeResampler + + ds.resample(time=TimeResampler("ME")) diff --git a/xarray/__init__.py b/xarray/__init__.py index c8d99bab38c..10e09bbf734 100644 --- a/xarray/__init__.py +++ b/xarray/__init__.py @@ -56,6 +56,7 @@ # `mypy --strict` running in projects that import xarray. __all__ = ( # Sub-packages + "groupers", "testing", "tutorial", # Top-level functions @@ -95,8 +96,6 @@ "unify_chunks", "where", "zeros_like", - # Submodules - "groupers", # Classes "CFTimeIndex", "Context", diff --git a/xarray/core/groupers.py b/xarray/core/groupers.py index 05045f98ff8..baa3414e798 100644 --- a/xarray/core/groupers.py +++ b/xarray/core/groupers.py @@ -40,15 +40,20 @@ class EncodedGroups: """ Dataclass for storing intermediate values for GroupBy operation. - Returned by factorize method on Grouper objects. + Returned by the ``factorize`` method on Grouper objects. - Parameters + Attributes ---------- - codes: integer codes for each group - full_index: pandas Index for the group coordinate - group_indices: optional, List of indices of array elements belonging - to each group. Inferred if not provided. - unique_coord: Unique group values present in dataset. Inferred if not provided + codes: DataArray + Same shape as the DataArray to group by. Values consist of a unique integer code for each group. + full_index: pd.Index + Pandas Index for the group coordinate containing unique group labels. + This can differ from ``unique_coord`` in the case of resampling and binning, + where certain groups in the output need not be present in the input. + group_indices: list of int or slice or list of int, optional + List of indices of array elements belonging to each group. Inferred if not provided. + unique_coord: Variable, optional + Unique group values present in dataset. Inferred if not provided """ codes: DataArray @@ -68,33 +73,39 @@ def __post_init__(self): class Grouper(ABC): - """Base class for Grouper objects that allow specializing GroupBy instructions.""" + """Abstract base class for Grouper objects that allow specializing GroupBy instructions.""" @property def can_squeeze(self) -> bool: - """TODO: delete this when the `squeeze` kwarg is deprecated. Only `UniqueGrouper` - should override it.""" + """ + Do not use. + + .. deprecated:: 2023.03.0 + This is a deprecated method. It will be deleted this when the `squeeze` kwarg is deprecated. + Only ``UniqueGrouper`` should override it. + """ return False @abstractmethod - def factorize(self, group) -> EncodedGroups: + def factorize(self, group: T_Group) -> EncodedGroups: """ - Takes the group, and creates intermediates necessary for GroupBy. - These intermediates are - 1. codes - Same shape as `group` containing a unique integer code for each group. - 2. group_indices - Indexes that let us index out the members of each group. - 3. unique_coord - Unique groups present in the dataset. - 4. full_index - Unique groups in the output. This differs from `unique_coord` in the - case of resampling and binning, where certain groups in the output are not present in - the input. - - Returns an instance of EncodedGroups. + Creates intermediates necessary for GroupBy. + + Parameters + ---------- + group: DataArray + DataArray we are grouping by. + + Returns + ------- + EncodedGroups """ pass class Resampler(Grouper): - """Base class for Grouper objects that allow specializing resampling-type GroupBy instructions. + """ + Abstract base class for Grouper objects that allow specializing resampling-type GroupBy instructions. Currently only used for TimeResampler, but could be used for SpaceResampler in the future. """ @@ -116,6 +127,7 @@ def is_unique_and_monotonic(self) -> bool: @property def group_as_index(self) -> pd.Index: + """Caches the group DataArray as a pandas Index.""" if self._group_as_index is None: self._group_as_index = self.group.to_index() return self._group_as_index @@ -126,7 +138,7 @@ def can_squeeze(self) -> bool: is_dimension = self.group.dims == (self.group.name,) return is_dimension and self.is_unique_and_monotonic - def factorize(self, group1d) -> EncodedGroups: + def factorize(self, group1d: T_Group) -> EncodedGroups: self.group = group1d if self.can_squeeze: @@ -178,18 +190,25 @@ def _factorize_dummy(self) -> EncodedGroups: @dataclass class BinGrouper(Grouper): - """Grouper object for binning numeric data.""" + """ + Grouper object for binning numeric data. + + Attributes + ---------- + bins: int, sequence of scalars, or IntervalIndex + Speciication for bins either as integer, or as bin edges. + cut_kwargs: dict + Keyword arguments forwarded to :py:func:`pandas.cut`. + """ bins: int | Sequence | pd.IntervalIndex cut_kwargs: Mapping = field(default_factory=dict) - binned: Any = None - name: Any = None def __post_init__(self) -> None: if duck_array_ops.isnull(self.bins).all(): raise ValueError("All bin edges are NaN.") - def factorize(self, group) -> EncodedGroups: + def factorize(self, group: T_Group) -> EncodedGroups: from xarray.core.dataarray import DataArray data = group.data @@ -221,7 +240,51 @@ def factorize(self, group) -> EncodedGroups: @dataclass class TimeResampler(Resampler): - """Grouper object specialized to resampling the time coordinate.""" + """ + Grouper object specialized to resampling the time coordinate. + + Attributes + ---------- + freq : str + Frequency to resample to. See `Pandas frequency + aliases `_ + for a list of possible values. + closed : {"left", "right"}, optional + Side of each interval to treat as closed. + label : {"left", "right"}, optional + Side of each interval to use for labeling. + base : int, optional + For frequencies that evenly subdivide 1 day, the "origin" of the + aggregated intervals. For example, for "24H" frequency, base could + range from 0 through 23. + + .. deprecated:: 2023.03.0 + Following pandas, the ``base`` parameter is deprecated in favor + of the ``origin`` and ``offset`` parameters, and will be removed + in a future version of xarray. + + origin : {'epoch', 'start', 'start_day', 'end', 'end_day'}, pandas.Timestamp, datetime.datetime, numpy.datetime64, or cftime.datetime, default 'start_day' + The datetime on which to adjust the grouping. The timezone of origin + must match the timezone of the index. + + If a datetime is not used, these values are also supported: + - 'epoch': `origin` is 1970-01-01 + - 'start': `origin` is the first value of the timeseries + - 'start_day': `origin` is the first day at midnight of the timeseries + - 'end': `origin` is the last value of the timeseries + - 'end_day': `origin` is the ceiling midnight of the last day + offset : pd.Timedelta, datetime.timedelta, or str, default is None + An offset timedelta added to the origin. + loffset : timedelta or str, optional + Offset used to adjust the resampled time labels. Some pandas date + offset strings are supported. + + .. deprecated:: 2023.03.0 + Following pandas, the ``loffset`` parameter is deprecated in favor + of using time offset arithmetic, and will be removed in a future + version of xarray. + + """ freq: str closed: SideOptions | None = field(default=None) @@ -321,7 +384,7 @@ def first_items(self) -> tuple[pd.Series, np.ndarray]: _apply_loffset(self.loffset, first_items) return first_items, codes - def factorize(self, group) -> EncodedGroups: + def factorize(self, group: T_Group) -> EncodedGroups: self._init_properties(group) full_index, first_items, codes_ = self._get_index_and_items() sbins = first_items.values.astype(np.int64)