Skip to content

Commit

Permalink
Docstring and documentation improvement for the Dataset class (#8973)
Browse files Browse the repository at this point in the history
* Updates the example in the doc-string for the Dataset class to be clearer.

The example in the doc-string of the `Dataset` class prior to this commit
uses an example array whose size is `2 x 2 x 3` with the first two dimensions
labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was
confusing due to the fact that `"x"` and `"y"` are just arbitrary names for
these axes and that no reason is given for the data to be organized in a `2x2x3`
array instead of a `2x2` matrix. This commit clarifies the example.

See issue #8970 for more information.

* Updates the documentation of the Dataset class to have clearer examples.

These changes to the documentation bring it into alignment with the
changes to the `Dataset` doc-string committed previously.

See issue #8970 for more information.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Adds dataset size reports to the output of the example in the Dataset docstring.

* Fixes the documentation errors in the previous commits.

* Fixes indentation errors in the docs for previous commits.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
  • Loading branch information
3 people authored Apr 30, 2024
1 parent 08e43b9 commit 71372c1
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 43 deletions.
53 changes: 33 additions & 20 deletions doc/user-guide/data-structures.rst
Original file line number Diff line number Diff line change
Expand Up @@ -282,27 +282,40 @@ variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).

- ``attrs`` should be a dictionary.

Let's create some fake data for the example we show above:
Let's create some fake data for the example we show above. In this
example dataset, we will represent measurements of the temperature and
pressure that were made under various conditions:

* the measurements were made on four different days;
* they were made at two separate locations, which we will represent using
their latitude and longitude; and
* they were made using instruments by three different manufacutrers, which we
will refer to as `'manufac1'`, `'manufac2'`, and `'manufac3'`.

.. ipython:: python
temp = 15 + 8 * np.random.randn(2, 2, 3)
precip = 10 * np.random.rand(2, 2, 3)
lon = [[-99.83, -99.32], [-99.79, -99.23]]
lat = [[42.25, 42.21], [42.63, 42.59]]
np.random.seed(0)
temperature = 15 + 8 * np.random.randn(2, 3, 4)
precipitation = 10 * np.random.rand(2, 3, 4)
lon = [-99.83, -99.32]
lat = [42.25, 42.21]
instruments = ["manufac1", "manufac2", "manufac3"]
time = pd.date_range("2014-09-06", periods=4)
reference_time = pd.Timestamp("2014-09-05")
# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
ds = xr.Dataset(
{
"temperature": (["x", "y", "time"], temp),
"precipitation": (["x", "y", "time"], precip),
"temperature": (["loc", "instrument", "time"], temperature),
"precipitation": (["loc", "instrument", "time"], precipitation),
},
coords={
"lon": (["x", "y"], lon),
"lat": (["x", "y"], lat),
"time": pd.date_range("2014-09-06", periods=3),
"reference_time": pd.Timestamp("2014-09-05"),
"lon": (["loc"], lon),
"lat": (["loc"], lat),
"instrument": instruments,
"time": time,
"reference_time": reference_time,
},
)
ds
Expand Down Expand Up @@ -387,12 +400,12 @@ example, to create this example dataset from scratch, we could have written:
.. ipython:: python
ds = xr.Dataset()
ds["temperature"] = (("x", "y", "time"), temp)
ds["temperature_double"] = (("x", "y", "time"), temp * 2)
ds["precipitation"] = (("x", "y", "time"), precip)
ds.coords["lat"] = (("x", "y"), lat)
ds.coords["lon"] = (("x", "y"), lon)
ds.coords["time"] = pd.date_range("2014-09-06", periods=3)
ds["temperature"] = (("loc", "instrument", "time"), temperature)
ds["temperature_double"] = (("loc", "instrument", "time"), temperature * 2)
ds["precipitation"] = (("loc", "instrument", "time"), precipitation)
ds.coords["lat"] = (("loc",), lat)
ds.coords["lon"] = (("loc",), lon)
ds.coords["time"] = pd.date_range("2014-09-06", periods=4)
ds.coords["reference_time"] = pd.Timestamp("2014-09-05")
To change the variables in a ``Dataset``, you can use all the standard dictionary
Expand Down Expand Up @@ -452,8 +465,8 @@ follow nested function calls:
# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
plt.plot((2 * ds.temperature.sel(x=0)).mean("y"))
(ds.temperature.sel(x=0).pipe(lambda x: 2 * x).mean("y").pipe(plt.plot))
plt.plot((2 * ds.temperature.sel(loc=0)).mean("instrument"))
(ds.temperature.sel(loc=0).pipe(lambda x: 2 * x).mean("instrument").pipe(plt.plot))
Both ``pipe`` and ``assign`` replicate the pandas methods of the same names
(:py:meth:`DataFrame.pipe <pandas.DataFrame.pipe>` and
Expand All @@ -479,7 +492,7 @@ dimension and non-dimension variables:

.. ipython:: python
ds.coords["day"] = ("time", [6, 7, 8])
ds.coords["day"] = ("time", [6, 7, 8, 9])
ds.swap_dims({"time": "day"})
.. _coordinates:
Expand Down
61 changes: 38 additions & 23 deletions xarray/core/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -590,60 +590,75 @@ class Dataset(
Examples
--------
Create data:
In this example dataset, we will represent measurements of the temperature
and pressure that were made under various conditions:
* the measurements were made on four different days;
* they were made at two separate locations, which we will represent using
their latitude and longitude; and
* they were made using three instrument developed by three different
manufacturers, which we will refer to using the strings `'manufac1'`,
`'manufac2'`, and `'manufac3'`.
>>> np.random.seed(0)
>>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
>>> precipitation = 10 * np.random.rand(2, 2, 3)
>>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
>>> lat = [[42.25, 42.21], [42.63, 42.59]]
>>> time = pd.date_range("2014-09-06", periods=3)
>>> temperature = 15 + 8 * np.random.randn(2, 3, 4)
>>> precipitation = 10 * np.random.rand(2, 3, 4)
>>> lon = [-99.83, -99.32]
>>> lat = [42.25, 42.21]
>>> instruments = ["manufac1", "manufac2", "manufac3"]
>>> time = pd.date_range("2014-09-06", periods=4)
>>> reference_time = pd.Timestamp("2014-09-05")
Initialize a dataset with multiple dimensions:
Here, we initialize the dataset with multiple dimensions. We use the string
`"loc"` to represent the location dimension of the data, the string
`"instrument"` to represent the instrument manufacturer dimension, and the
string `"time"` for the time dimension.
>>> ds = xr.Dataset(
... data_vars=dict(
... temperature=(["x", "y", "time"], temperature),
... precipitation=(["x", "y", "time"], precipitation),
... temperature=(["loc", "instrument", "time"], temperature),
... precipitation=(["loc", "instrument", "time"], precipitation),
... ),
... coords=dict(
... lon=(["x", "y"], lon),
... lat=(["x", "y"], lat),
... lon=("loc", lon),
... lat=("loc", lat),
... instrument=instruments,
... time=time,
... reference_time=reference_time,
... ),
... attrs=dict(description="Weather related data."),
... )
>>> ds
<xarray.Dataset> Size: 288B
Dimensions: (x: 2, y: 2, time: 3)
<xarray.Dataset> Size: 552B
Dimensions: (loc: 2, instrument: 3, time: 4)
Coordinates:
lon (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
lat (x, y) float64 32B 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
lon (loc) float64 16B -99.83 -99.32
lat (loc) float64 16B 42.25 42.21
* instrument (instrument) <U8 96B 'manufac1' 'manufac2' 'manufac3'
* time (time) datetime64[ns] 32B 2014-09-06 ... 2014-09-09
reference_time datetime64[ns] 8B 2014-09-05
Dimensions without coordinates: x, y
Dimensions without coordinates: loc
Data variables:
temperature (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
precipitation (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
temperature (loc, instrument, time) float64 192B 29.11 18.2 ... 9.063
precipitation (loc, instrument, time) float64 192B 4.562 5.684 ... 1.613
Attributes:
description: Weather related data.
Find out where the coldest temperature was and what values the
other variables had:
>>> ds.isel(ds.temperature.argmin(...))
<xarray.Dataset> Size: 48B
<xarray.Dataset> Size: 80B
Dimensions: ()
Coordinates:
lon float64 8B -99.32
lat float64 8B 42.21
time datetime64[ns] 8B 2014-09-08
instrument <U8 32B 'manufac3'
time datetime64[ns] 8B 2014-09-06
reference_time datetime64[ns] 8B 2014-09-05
Data variables:
temperature float64 8B 7.182
precipitation float64 8B 8.326
temperature float64 8B -5.424
precipitation float64 8B 9.884
Attributes:
description: Weather related data.
Expand Down

0 comments on commit 71372c1

Please sign in to comment.