Docstring and documentation improvement for the Dataset class (#8973)

* Updates the example in the doc-string for the Dataset class to be clearer. The example in the doc-string of the `Dataset` class prior to this commit uses an example array whose size is `2 x 2 x 3` with the first two dimensions labeled `"x"` and `"y"` and the final dimension labeled `"time"`. This was confusing due to the fact that `"x"` and `"y"` are just arbitrary names for these axes and that no reason is given for the data to be organized in a `2x2x3` array instead of a `2x2` matrix. This commit clarifies the example. See issue #8970 for more information. * Updates the documentation of the Dataset class to have clearer examples. These changes to the documentation bring it into alignment with the changes to the `Dataset` doc-string committed previously. See issue #8970 for more information. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Adds dataset size reports to the output of the example in the Dataset docstring. * Fixes the documentation errors in the previous commits. * Fixes indentation errors in the docs for previous commits. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Maximilian Roos <5635139+max-sixty@users.noreply.github.com>
pydata · Apr 30, 2024 · 71372c1 · 71372c1
1 parent 08e43b9
commit 71372c1
Show file tree

Hide file tree

Showing 2 changed files with 71 additions and 43 deletions.
diff --git a/doc/user-guide/data-structures.rst b/doc/user-guide/data-structures.rst
@@ -282,27 +282,40 @@ variables (``data_vars``), coordinates (``coords``) and attributes (``attrs``).
 
 - ``attrs`` should be a dictionary.
 
-Let's create some fake data for the example we show above:
+Let's create some fake data for the example we show above. In this
+example dataset, we will represent measurements of the temperature and
+pressure that were made under various conditions:
+
+* the measurements were made on four different days;
+* they were made at two separate locations, which we will represent using
+  their latitude and longitude; and
+* they were made using instruments by three different manufacutrers, which we
+  will refer to as `'manufac1'`, `'manufac2'`, and `'manufac3'`.
 
 .. ipython:: python
 
-    temp = 15 + 8 * np.random.randn(2, 2, 3)
-    precip = 10 * np.random.rand(2, 2, 3)
-    lon = [[-99.83, -99.32], [-99.79, -99.23]]
-    lat = [[42.25, 42.21], [42.63, 42.59]]
+    np.random.seed(0)
+    temperature = 15 + 8 * np.random.randn(2, 3, 4)
+    precipitation = 10 * np.random.rand(2, 3, 4)
+    lon = [-99.83, -99.32]
+    lat = [42.25, 42.21]
+    instruments = ["manufac1", "manufac2", "manufac3"]
+    time = pd.date_range("2014-09-06", periods=4)
+    reference_time = pd.Timestamp("2014-09-05")
 
     # for real use cases, its good practice to supply array attributes such as
     # units, but we won't bother here for the sake of brevity
     ds = xr.Dataset(
         {
-            "temperature": (["x", "y", "time"], temp),
-            "precipitation": (["x", "y", "time"], precip),
+            "temperature": (["loc", "instrument", "time"], temperature),
+            "precipitation": (["loc", "instrument", "time"], precipitation),
         },
         coords={
-            "lon": (["x", "y"], lon),
-            "lat": (["x", "y"], lat),
-            "time": pd.date_range("2014-09-06", periods=3),
-            "reference_time": pd.Timestamp("2014-09-05"),
+            "lon": (["loc"], lon),
+            "lat": (["loc"], lat),
+            "instrument": instruments,
+            "time": time,
+            "reference_time": reference_time,
         },
     )
     ds
@@ -387,12 +400,12 @@ example, to create this example dataset from scratch, we could have written:
 .. ipython:: python
 
     ds = xr.Dataset()
-    ds["temperature"] = (("x", "y", "time"), temp)
-    ds["temperature_double"] = (("x", "y", "time"), temp * 2)
-    ds["precipitation"] = (("x", "y", "time"), precip)
-    ds.coords["lat"] = (("x", "y"), lat)
-    ds.coords["lon"] = (("x", "y"), lon)
-    ds.coords["time"] = pd.date_range("2014-09-06", periods=3)
+    ds["temperature"] = (("loc", "instrument", "time"), temperature)
+    ds["temperature_double"] = (("loc", "instrument", "time"), temperature * 2)
+    ds["precipitation"] = (("loc", "instrument", "time"), precipitation)
+    ds.coords["lat"] = (("loc",), lat)
+    ds.coords["lon"] = (("loc",), lon)
+    ds.coords["time"] = pd.date_range("2014-09-06", periods=4)
     ds.coords["reference_time"] = pd.Timestamp("2014-09-05")
 
 To change the variables in a ``Dataset``, you can use all the standard dictionary
@@ -452,8 +465,8 @@ follow nested function calls:
 
     # these lines are equivalent, but with pipe we can make the logic flow
     # entirely from left to right
-    plt.plot((2 * ds.temperature.sel(x=0)).mean("y"))
-    (ds.temperature.sel(x=0).pipe(lambda x: 2 * x).mean("y").pipe(plt.plot))
+    plt.plot((2 * ds.temperature.sel(loc=0)).mean("instrument"))
+    (ds.temperature.sel(loc=0).pipe(lambda x: 2 * x).mean("instrument").pipe(plt.plot))
 
 Both ``pipe`` and ``assign`` replicate the pandas methods of the same names
 (:py:meth:`DataFrame.pipe <pandas.DataFrame.pipe>` and
@@ -479,7 +492,7 @@ dimension and non-dimension variables:
 
 .. ipython:: python
 
-    ds.coords["day"] = ("time", [6, 7, 8])
+    ds.coords["day"] = ("time", [6, 7, 8, 9])
     ds.swap_dims({"time": "day"})
 
 .. _coordinates:

diff --git a/xarray/core/dataset.py b/xarray/core/dataset.py
@@ -590,60 +590,75 @@ class Dataset(
 
     Examples
     --------
-    Create data:
+    In this example dataset, we will represent measurements of the temperature
+    and pressure that were made under various conditions:
+
+    * the measurements were made on four different days;
+    * they were made at two separate locations, which we will represent using
+      their latitude and longitude; and
+    * they were made using three instrument developed by three different
+      manufacturers, which we will refer to using the strings `'manufac1'`,
+      `'manufac2'`, and `'manufac3'`.
 
     >>> np.random.seed(0)
-    >>> temperature = 15 + 8 * np.random.randn(2, 2, 3)
-    >>> precipitation = 10 * np.random.rand(2, 2, 3)
-    >>> lon = [[-99.83, -99.32], [-99.79, -99.23]]
-    >>> lat = [[42.25, 42.21], [42.63, 42.59]]
-    >>> time = pd.date_range("2014-09-06", periods=3)
+    >>> temperature = 15 + 8 * np.random.randn(2, 3, 4)
+    >>> precipitation = 10 * np.random.rand(2, 3, 4)
+    >>> lon = [-99.83, -99.32]
+    >>> lat = [42.25, 42.21]
+    >>> instruments = ["manufac1", "manufac2", "manufac3"]
+    >>> time = pd.date_range("2014-09-06", periods=4)
     >>> reference_time = pd.Timestamp("2014-09-05")
 
-    Initialize a dataset with multiple dimensions:
+    Here, we initialize the dataset with multiple dimensions. We use the string
+    `"loc"` to represent the location dimension of the data, the string
+    `"instrument"` to represent the instrument manufacturer dimension, and the
+    string `"time"` for the time dimension.
 
     >>> ds = xr.Dataset(
     ...     data_vars=dict(
-    ...         temperature=(["x", "y", "time"], temperature),
-    ...         precipitation=(["x", "y", "time"], precipitation),
+    ...         temperature=(["loc", "instrument", "time"], temperature),
+    ...         precipitation=(["loc", "instrument", "time"], precipitation),
     ...     ),
     ...     coords=dict(
-    ...         lon=(["x", "y"], lon),
-    ...         lat=(["x", "y"], lat),
+    ...         lon=("loc", lon),
+    ...         lat=("loc", lat),
+    ...         instrument=instruments,
     ...         time=time,
     ...         reference_time=reference_time,
     ...     ),
     ...     attrs=dict(description="Weather related data."),
     ... )
     >>> ds
-    <xarray.Dataset> Size: 288B
-    Dimensions:         (x: 2, y: 2, time: 3)
+    <xarray.Dataset> Size: 552B
+    Dimensions:         (loc: 2, instrument: 3, time: 4)
     Coordinates:
-        lon             (x, y) float64 32B -99.83 -99.32 -99.79 -99.23
-        lat             (x, y) float64 32B 42.25 42.21 42.63 42.59
-      * time            (time) datetime64[ns] 24B 2014-09-06 2014-09-07 2014-09-08
+        lon             (loc) float64 16B -99.83 -99.32
+        lat             (loc) float64 16B 42.25 42.21
+      * instrument      (instrument) <U8 96B 'manufac1' 'manufac2' 'manufac3'
+      * time            (time) datetime64[ns] 32B 2014-09-06 ... 2014-09-09
         reference_time  datetime64[ns] 8B 2014-09-05
-    Dimensions without coordinates: x, y
+    Dimensions without coordinates: loc
     Data variables:
-        temperature     (x, y, time) float64 96B 29.11 18.2 22.83 ... 16.15 26.63
-        precipitation   (x, y, time) float64 96B 5.68 9.256 0.7104 ... 4.615 7.805
+        temperature     (loc, instrument, time) float64 192B 29.11 18.2 ... 9.063
+        precipitation   (loc, instrument, time) float64 192B 4.562 5.684 ... 1.613
     Attributes:
         description:  Weather related data.
 
     Find out where the coldest temperature was and what values the
     other variables had:
 
     >>> ds.isel(ds.temperature.argmin(...))
-    <xarray.Dataset> Size: 48B
+    <xarray.Dataset> Size: 80B
     Dimensions:         ()
     Coordinates:
         lon             float64 8B -99.32
         lat             float64 8B 42.21
-        time            datetime64[ns] 8B 2014-09-08
+        instrument      <U8 32B 'manufac3'
+        time            datetime64[ns] 8B 2014-09-06
         reference_time  datetime64[ns] 8B 2014-09-05
     Data variables:
-        temperature     float64 8B 7.182
-        precipitation   float64 8B 8.326
+        temperature     float64 8B -5.424
+        precipitation   float64 8B 9.884
     Attributes:
         description:  Weather related data.