Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grouper, Resampler as public api #8840

Merged
merged 32 commits into from
Jul 18, 2024
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
65d500c
Grouper, Resampler as public API
dcherian Jun 13, 2024
9f5b9fd
Add test
dcherian Apr 20, 2024
f338e45
Add docs
dcherian Apr 17, 2024
09fa705
Fix test
dcherian Jun 20, 2024
110517a
fix types.
dcherian Jun 21, 2024
01fbf50
bugfix
dcherian Jun 21, 2024
e250895
Better binning API
dcherian Jun 28, 2024
5572930
Merge branch 'main' into grouper-public-api
dcherian Jun 28, 2024
0dc1663
docs.fixes
dcherian Jun 28, 2024
94b3ffa
Apply suggestions from code review
dcherian Jul 1, 2024
cdee857
Fix typing
dcherian Jul 1, 2024
74268ef
clean up reprs
dcherian Jul 1, 2024
64f78cd
Allow passing dicts
dcherian Jul 1, 2024
58de38f
Merge branch 'main' into grouper-public-api
dcherian Jul 8, 2024
6b2ed08
Apply suggestions from code review
dcherian Jul 11, 2024
3bbcab2
Update xarray/core/common.py
dcherian Jul 11, 2024
69d62cb
Review comments
dcherian Jul 11, 2024
205c2a7
Fix docstring
dcherian Jul 11, 2024
1142663
Try to fix typing
dcherian Jul 11, 2024
c94d9c2
Merge branch 'main' into grouper-public-api
dcherian Jul 11, 2024
7838d57
Nicer error
dcherian Jul 11, 2024
6bfe03a
Merge branch 'main' into grouper-public-api
dcherian Jul 17, 2024
cb4faad
Try fixing types
dcherian Jul 17, 2024
792098c
fix
dcherian Jul 17, 2024
8e1ed58
Merge branch 'main' into grouper-public-api
dcherian Jul 17, 2024
8b40396
Apply suggestions from code review
dcherian Jul 18, 2024
a2b0f05
Review comments
dcherian Jul 18, 2024
d521849
Add whats-new note
dcherian Jul 18, 2024
bf23f3d
Fix
dcherian Jul 18, 2024
25c897d
Add more types
dcherian Jul 18, 2024
1cafa67
Fix link
dcherian Jul 18, 2024
78252b2
FIx docs
dcherian Jul 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/api-hidden.rst
Original file line number Diff line number Diff line change
Expand Up @@ -693,3 +693,7 @@

coding.times.CFTimedeltaCoder
coding.times.CFDatetimeCoder

core.groupers.Grouper
core.groupers.Resampler
core.groupers.EncodedGroups
23 changes: 19 additions & 4 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -803,6 +803,18 @@ DataArray
DataArrayGroupBy.dims
DataArrayGroupBy.groups

Grouper Objects
---------------

.. currentmodule:: xarray.core

.. autosummary::
:toctree: generated/

groupers.BinGrouper
groupers.UniqueGrouper
groupers.TimeResampler


Rolling objects
===============
Expand Down Expand Up @@ -1028,17 +1040,20 @@ DataArray
Accessors
=========

.. currentmodule:: xarray
.. currentmodule:: xarray.core

.. autosummary::
:toctree: generated/

core.accessor_dt.DatetimeAccessor
core.accessor_dt.TimedeltaAccessor
core.accessor_str.StringAccessor
accessor_dt.DatetimeAccessor
accessor_dt.TimedeltaAccessor
accessor_str.StringAccessor


Custom Indexes
==============
.. currentmodule:: xarray

.. autosummary::
:toctree: generated/

Expand Down
3 changes: 3 additions & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,8 @@
"Variable": "~xarray.Variable",
"DatasetGroupBy": "~xarray.core.groupby.DatasetGroupBy",
"DataArrayGroupBy": "~xarray.core.groupby.DataArrayGroupBy",
"Grouper": "~xarray.core.groupers.Grouper",
"Resampler": "~xarray.core.groupers.Resampler",
# objects without namespace: numpy
"ndarray": "~numpy.ndarray",
"MaskedArray": "~numpy.ma.MaskedArray",
Expand All @@ -169,6 +171,7 @@
"CategoricalIndex": "~pandas.CategoricalIndex",
"TimedeltaIndex": "~pandas.TimedeltaIndex",
"DatetimeIndex": "~pandas.DatetimeIndex",
"IntervalIndex": "~pandas.IntervalIndex",
"Series": "~pandas.Series",
"DataFrame": "~pandas.DataFrame",
"Categorical": "~pandas.Categorical",
Expand Down
93 changes: 82 additions & 11 deletions doc/user-guide/groupby.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. currentmodule:: xarray

.. _groupby:

GroupBy: Group and Bin Data
Expand All @@ -15,19 +17,20 @@ __ https://www.jstatsoft.org/v40/i01/paper
- Apply some function to each group.
- Combine your groups back into a single data object.

Group by operations work on both :py:class:`~xarray.Dataset` and
:py:class:`~xarray.DataArray` objects. Most of the examples focus on grouping by
Group by operations work on both :py:class:`Dataset` and
:py:class:`DataArray` objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas' implementation of
the same pipeline.

.. tip::

To substantially improve the performance of GroupBy operations, particularly
with dask `install the flox package <https://flox.readthedocs.io>`_. flox
`Install the flox package <https://flox.readthedocs.io>`_ to substantially improve the performance
of GroupBy operations, particularly with dask. flox
`extends Xarray's in-built GroupBy capabilities <https://flox.readthedocs.io/en/latest/xarray.html>`_
by allowing grouping by multiple variables, and lazy grouping by dask arrays. If installed, Xarray will automatically use flox by default.
by allowing grouping by multiple variables, and lazy grouping by dask arrays.
If installed, Xarray will automatically use flox by default.

Split
~~~~~
Expand Down Expand Up @@ -87,7 +90,7 @@ Binning
Sometimes you don't want to use all the unique values to determine the groups
but instead want to "bin" the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
:py:meth:`~xarray.Dataset.groupby_bins` method.
:py:meth:`Dataset.groupby_bins` method.

.. ipython:: python

Expand All @@ -110,7 +113,7 @@ Apply
~~~~~

To apply a function to each group, you can use the flexible
:py:meth:`~xarray.core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
:py:meth:`core.groupby.DatasetGroupBy.map` method. The resulting objects are automatically
concatenated back together along the group axis:

.. ipython:: python
Expand All @@ -121,8 +124,8 @@ concatenated back together along the group axis:

arr.groupby("letters").map(standardize)

GroupBy objects also have a :py:meth:`~xarray.core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`~xarray.core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
GroupBy objects also have a :py:meth:`core.groupby.DatasetGroupBy.reduce` method and
methods like :py:meth:`core.groupby.DatasetGroupBy.mean` as shortcuts for applying an
aggregation function:

.. ipython:: python
Expand Down Expand Up @@ -183,7 +186,7 @@ Iterating and Squeezing
Previously, Xarray defaulted to squeezing out dimensions of size one when iterating over
a GroupBy object. This behaviour is being removed.
You can always squeeze explicitly later with the Dataset or DataArray
:py:meth:`~xarray.DataArray.squeeze` methods.
:py:meth:`DataArray.squeeze` methods.

.. ipython:: python

Expand Down Expand Up @@ -217,7 +220,7 @@ __ https://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_two_dime
da.groupby("lon").map(lambda x: x - x.mean(), shortcut=False)

Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via :py:meth:`~xarray.Dataset.groupby_bins`
number of bins, coarse-binning via :py:meth:`Dataset.groupby_bins`
may be desirable:

.. ipython:: python
Expand All @@ -232,3 +235,71 @@ applying your function, and then unstacking the result:

stacked = da.stack(gridcell=["ny", "nx"])
stacked.groupby("gridcell").sum(...).unstack("gridcell")

.. _groupby.groupers:

Grouper Objects
~~~~~~~~~~~~~~~

Both ``groupby_bins`` and ``resample`` are specializations of the core ``groupby`` operation for binning,
and time resampling. Many problems demand more complex GroupBy application: for example, grouping by multiple
variables with a combination of categorical grouping, binning, and resampling; or more specializations like
spatial resampling; or more complex time grouping like special handling of seasons, or the ability to specify
custom seasons. To handle these use-cases and more, Xarray is evolving to providing an
extension point using ``Grouper`` objects.

.. tip::

See the `grouper design`_ doc for more detail on the motivation and design ideas behind
Grouper objects.

.. _grouper design: https://github.com/pydata/xarray/blob/main/design_notes/grouper_objects.md

For now Xarray provides three specialized Grouper objects:

1. :py:class:`groupers.UniqueGrouper` for categorical grouping
2. :py:class:`groupers.BinGrouper` for binned grouping
3. :py:class:`groupers.TimeResampler` for resampling along a datetime coordinate

These provide functionality identical to the existing ``groupby``, ``groupby_bins``, and ``resample`` methods.
That is,

.. code-block:: python

ds.groupby("x")

is identical to

.. code-block:: python

from xarray.groupers import UniqueGrouper

ds.groupby(x=UniqueGrouper())

; and
dcherian marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: python

ds.groupby_bins("x", bins=bins)

is identical to

.. code-block:: python

from xarray.groupers import BinGrouper

ds.groupby(x=BinGrouper(bins))

and

.. code-block:: python

ds.resample(time="ME")

is identical to

.. code-block:: python

from xarray.groupers import TimeResampler

ds.resample(time=TimeResampler("ME"))
2 changes: 2 additions & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from xarray.coding.cftimeindex import CFTimeIndex
from xarray.coding.frequencies import infer_freq
from xarray.conventions import SerializationWarning, decode_cf
from xarray.core import groupers
from xarray.core.alignment import align, broadcast
from xarray.core.combine import combine_by_coords, combine_nested
from xarray.core.common import ALL_DIMS, full_like, ones_like, zeros_like
Expand Down Expand Up @@ -55,6 +56,7 @@
# `mypy --strict` running in projects that import xarray.
__all__ = (
# Sub-packages
"groupers",
"testing",
"tutorial",
# Top-level functions
Expand Down
31 changes: 19 additions & 12 deletions xarray/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@

from xarray.core.dataarray import DataArray
from xarray.core.dataset import Dataset
from xarray.core.groupers import Resampler
from xarray.core.indexes import Index
from xarray.core.resample import Resample
from xarray.core.rolling_exp import RollingExp
Expand Down Expand Up @@ -876,7 +877,7 @@ def rolling_exp(
def _resample(
self,
resample_cls: type[T_Resample],
indexer: Mapping[Any, str] | None,
indexer: Mapping[Hashable, str | Resampler] | None,
skipna: bool | None,
closed: SideOptions | None,
label: SideOptions | None,
Expand All @@ -885,7 +886,7 @@ def _resample(
origin: str | DatetimeLike,
loffset: datetime.timedelta | str | None,
restore_coord_dims: bool | None,
**indexer_kwargs: str,
**indexer_kwargs: str | Resampler,
) -> T_Resample:
"""Returns a Resample object for performing resampling operations.

Expand Down Expand Up @@ -1068,7 +1069,7 @@ def _resample(

from xarray.core.dataarray import DataArray
from xarray.core.groupby import ResolvedGrouper
from xarray.core.groupers import TimeResampler
from xarray.core.groupers import Resampler, TimeResampler
from xarray.core.resample import RESAMPLE_DIM

# note: the second argument (now 'skipna') use to be 'dim'
Expand Down Expand Up @@ -1098,15 +1099,21 @@ def _resample(
name=RESAMPLE_DIM,
)

grouper = TimeResampler(
freq=freq,
closed=closed,
label=label,
origin=origin,
offset=offset,
loffset=loffset,
base=base,
)
grouper: Resampler
if isinstance(freq, str):
grouper = TimeResampler(
freq=freq,
closed=closed,
label=label,
origin=origin,
offset=offset,
loffset=loffset,
base=base,
)
elif isinstance(freq, Resampler):
grouper = freq
else:
raise ValueError("freq must be a str or a Resampler object")

rgrouper = ResolvedGrouper(grouper, group, self)

Expand Down
Loading
Loading