Merge branch 'master' into mobt_161_wxcode_tree_check

* master: Change pandas DataFrame.at to DataFrame.loc (metoppv#1655) MOBT-154: Reunification of wx decision trees (metoppv#1639) Consolidate scale parameter usage across EMOS and ECC (metoppv#1642) # Conflicts: # improver/wxcode/utilities.py
MoseleyS · Jan 31, 2022 · 6196ac9 · 6196ac9
2 parents 16760ac + 8eb0e74
commit 6196ac9
Show file tree

Hide file tree

Showing 16 changed files with 362 additions and 157 deletions.
diff --git a/...xtended_documentation/calibration/ensemble_calibration/ensemble_calibration.rst b/...xtended_documentation/calibration/ensemble_calibration/ensemble_calibration.rst
@@ -56,16 +56,16 @@ A normal (Gaussian) distribution is often represented using the syntax:
 where :math:`\mu` is mean and :math:`\sigma^{2}` is the variance. The normal
 distribution is a special case, where :math:`\mu` can be interpreted as both
 the mean and the location parameter and :math:`\sigma^{2}` can be interpreted
-as both the variance and the scale parameter. For an alternative distribution,
-such as a truncated normal distribution that has been truncated to lie within
-0 and infinity, the distribution can be represented as:
+as both the variance and the square of the scale parameter. For an alternative
+distribution, such as a truncated normal distribution that has been truncated
+to lie within 0 and infinity, the distribution can be represented as:
 
 .. math::
 
     \mathcal{N^0}(\mu,\,\sigma^{2})
 
 In this case, the :math:`\mu` is strictly interpreted as the location parameter
-and :math:`\sigma^{2}` is strictly interpreted as the scale parameter.
+and :math:`\sigma^{2}` is strictly interpreted as the square of the scale parameter.
 
 ===============================
 What is the location parameter?
@@ -82,6 +82,29 @@ The scale parameter indicates the width in the distribution. If the scale
 parameter is large, then the distribution will be broader. If the scale is
 smaller, then the distribution will be narrower.
 
+****************************************************
+Implementation details
+****************************************************
+
+In this implementation, we will choose to define the distributions
+using the scale parameter (as this matches scipy's expectation),
+rather than the square of the scale parameter:
+
+.. math::
+
+    \mathcal{N}(\mu,\,\sigma)
+
+The full equation when estimating the EMOS coefficients using
+the ensemble mean is therefore:
+
+.. math::
+
+    \mathcal{N}(a + b\bar{X}, \sqrt{c + dS^{2}})
+
+This matches the equations noted in `Allen et al., 2021`_.
+
+.. _Allen et al., 2021: https://doi.org/10.1002/qj.3983
+
 ****************************************************
 Estimating EMOS coefficients using the ensemble mean
 ****************************************************
@@ -91,7 +114,7 @@ If the predictor is the ensemble mean, coefficients are estimated as
 
 .. math::
 
-    \mathcal{N}(a + \bar{X}, c + dS^{2})
+    \mathcal{N}(a + b\bar{X}, \sqrt{c + dS^{2}})
 
 where N is a chosen distribution and values of a, b, c and d are solved in the
 format of :math:`\alpha, \beta, \gamma` and :math:`\delta`, see the equations
@@ -121,7 +144,7 @@ If the predictor is the ensemble realizations, coefficients are estimated for
 
 .. math::
 
-    \mathcal{N}(a + b_1X_1 + ... + b_mX_m, c + dS^{2})
+    \mathcal{N}(a + b_1X_1 + ... + b_mX_m, \sqrt{c + dS^{2}})
 
 where N is a chosen distribution, the values of a, b, c and d relate
 to alpha, beta, gamma and delta through the equations above with
@@ -140,14 +163,14 @@ The EMOS coefficients represent adjustments to the ensemble mean and ensemble
 variance, in order to generate the location and scale parameters that, for the
 chosen distribution, minimise the CRPS. The coefficients can therefore be used
 to construct the location parameter, :math:`\mu`, and scale parameter,
-:math:`\sigma^{2}`, for the calibrated forecast from today's ensemble mean, or
+:math:`\sigma`, for the calibrated forecast from today's ensemble mean, or
 ensemble realizations, and the ensemble variance.
 
 .. math::
 
     \mu = a + b\bar{X}
 
-    \sigma^{2} = c + dS^{2}
+    \sigma = \sqrt{c + dS^{2}}
 
 Note here that this procedure holds whether the distribution is normal, i.e.
 where the application of the EMOS coefficients to the raw ensemble mean results

diff --git a/doc/source/extended_documentation/wxcode/build_a_decision_tree.rst b/doc/source/extended_documentation/wxcode/build_a_decision_tree.rst
@@ -77,10 +77,16 @@ accessed with this key contains the essentials that make the node function.
     condition_combination. Alternatively, if they are being manipulated within
     the node (e.g. added together), they must be separated by the desired
     operators (e.g. 'diagnostic1', '+', 'diagnostic2').
-  - **diagnostic_thresholds** (List(List(float, str)): The diagnostic threshold
-    value and units being used in the test. A threshold [value, units] pair must
-    be provided for each diagnostic field with the same nested list structure; as
-    the basic unit is a list of value and unit, the overall nested structure is
+  - **diagnostic_thresholds** (List(List(float, str, Optional(int))): The
+    diagnostic threshold value and units being used in the test. An optional
+    third value provides a period in seconds that is associated with the
+    threshold value. For example, a precipitation accumulation threshold might
+    be given for a 1-hour period (3600 seconds). If instead 3-hour symbols are
+    being produced using 3-hour precipitation accumulations then the threshold
+    value will be scaled up by a factor of 3. Only thresholds with an
+    associated period will be scaled in this way. A threshold [value, units] pair
+    must be provided for each diagnostic field with the same nested list structure;
+    as the basic unit is a list of value and unit, the overall nested structure is
     one list deeper.
   - **diagnostic_conditions** (as diagnostic_fields): The expected inequality
     that has been used to construct the input probability field. This is checked

diff --git a/improver/calibration/ensemble_calibration.py b/improver/calibration/ensemble_calibration.py
@@ -1581,9 +1581,10 @@ def _calculate_scale_parameter(self) -> ndarray:
         )
 
         # Calculating the scale parameter, based on the raw variance S^2,
-        # where predicted variance = c + dS^2, where c = (gamma)^2 and
-        # d = (delta)^2
-        scale_parameter = (
+        # where predicted scale parameter (or equivalently standard deviation
+        # for a normal distribution) = sqrt(c + dS^2), where c = (gamma)^2 and
+        # d = (delta)^2.
+        scale_parameter = np.sqrt(
             self.coefficients_cubelist.extract_cube("emos_coefficient_gamma").data
             * self.coefficients_cubelist.extract_cube("emos_coefficient_gamma").data
             + self.coefficients_cubelist.extract_cube("emos_coefficient_delta").data
@@ -1622,7 +1623,7 @@ def _create_output_cubes(
         )
         scale_parameter_cube = create_new_diagnostic_cube(
             "scale_parameter",
-            f"({template_cube.units})^2",
+            template_cube.units,
             template_cube,
             template_cube.attributes,
             data=scale_parameter,

diff --git a/improver/cli/wxcode.py b/improver/cli/wxcode.py
@@ -40,6 +40,7 @@ def process(
     *cubes: cli.inputcube,
     wxtree: cli.inputjson = None,
     model_id_attr: str = None,
+    target_period: int = None,
     check_tree: bool = False,
 ):
     """ Processes cube for Weather symbols.
@@ -54,6 +55,13 @@ def process(
             Name of attribute recording source models that should be
             inherited by the output cube. The source models are expected as
             a space-separated string.
+        target_period:
+            The period in seconds that the weather symbol being produced should
+            represent. This should correspond with any period diagnostics, e.g.
+            precipitation accumulation, being used as input. This is used to scale
+            any threshold values that are defined with an associated period in
+            the decision tree. It will only be used if the decision tree
+            provided has threshold values defined with an associated period.
         check_tree (bool):
             If set, the decision tree will be checked to see if it conforms to
             the expected format and that all nodes can be reached; the only other
@@ -68,7 +76,7 @@ def process(
     if check_tree:
         from improver.wxcode.utilities import check_tree
 
-        return check_tree(wxtree)
+        return check_tree(wxtree, target_period=target_period)
 
     from iris.cube import CubeList
 
@@ -77,4 +85,6 @@ def process(
     if not cubes:
         raise RuntimeError("Not enough input arguments. See help for more information.")
 
-    return WeatherSymbols(wxtree, model_id_attr=model_id_attr)(CubeList(cubes))
+    return WeatherSymbols(
+        wxtree, model_id_attr=model_id_attr, target_period=target_period
+    )(CubeList(cubes))
diff --git a/improver/ensemble_copula_coupling/ensemble_copula_coupling.py b/improver/ensemble_copula_coupling/ensemble_copula_coupling.py
@@ -871,10 +871,10 @@ def _location_and_scale_parameters_to_percentiles(
             (len(percentiles_as_fractions), location_data.shape[0]), dtype=np.float32
         )
 
-        self._rescale_shape_parameters(location_data, np.sqrt(scale_data))
+        self._rescale_shape_parameters(location_data, scale_data)
 
         percentile_method = self.distribution(
-            *self.shape_parameters, loc=location_data, scale=np.sqrt(scale_data)
+            *self.shape_parameters, loc=location_data, scale=scale_data
         )
 
         # Loop over percentiles, and use the distribution as the
@@ -885,8 +885,9 @@ def _location_and_scale_parameters_to_percentiles(
             result[index, :] = percentile_method.ppf(percentile_list)
             # If percent point function (PPF) returns NaNs, fill in
             # mean instead of NaN values. NaN will only be generated if the
-            # variance is zero. Therefore, if the variance is zero, the mean
-            # value is used for all gridpoints with a NaN.
+            # scale parameter (standard deviation) is zero. Therefore, if the
+            # scale parameter (standard deviation) is zero, the mean value is
+            # used for all gridpoints with a NaN.
             if np.any(scale_data == 0):
                 nan_index = np.argwhere(np.isnan(result[index, :]))
                 result[index, nan_index] = location_data[nan_index]
@@ -1039,11 +1040,9 @@ def _check_unit_compatibility(
     ) -> None:
         """
         The location parameter, scale parameters, and threshold values come
-        from three different cubes. They should all be in the same base unit,
-        with the units of the scale parameter being the squared units of the
-        location parameter and threshold values. This is a sanity check to
-        ensure the units are as expected, converting units of the location
-        parameter and scale parameter if possible.
+        from three different cubes. This is a sanity check to ensure the units
+        are as expected, converting units of the location parameter and
+        scale parameter if possible.
 
         Args:
             location_parameter:
@@ -1060,11 +1059,11 @@ def _check_unit_compatibility(
 
         try:
             location_parameter.convert_units(threshold_units)
-            scale_parameter.convert_units(threshold_units ** 2)
+            scale_parameter.convert_units(threshold_units)
         except ValueError as err:
             msg = (
-                "Error: {} This is likely because the mean "
-                "variance and template cube threshold units are "
+                "Error: {} This is likely because the location parameter, "
+                "scale parameter and template cube threshold units are "
                 "not equivalent/compatible.".format(err)
             )
             raise ValueError(msg)
@@ -1107,7 +1106,7 @@ def _location_and_scale_parameters_to_probabilities(
         relative_to_threshold = probability_is_above_or_below(probability_cube_template)
 
         self._rescale_shape_parameters(
-            location_parameter.data.flatten(), np.sqrt(scale_parameter.data).flatten()
+            location_parameter.data.flatten(), scale_parameter.data.flatten()
         )
 
         # Loop over thresholds, and use the specified distribution with the
@@ -1118,7 +1117,7 @@ def _location_and_scale_parameters_to_probabilities(
         distribution = self.distribution(
             *self.shape_parameters,
             loc=location_parameter.data.flatten(),
-            scale=np.sqrt(scale_parameter.data.flatten()),
+            scale=scale_parameter.data.flatten(),
         )
 
         probability_method = distribution.cdf

diff --git a/improver/wxcode/utilities.py b/improver/wxcode/utilities.py
@@ -31,7 +31,7 @@
 """This module defines the utilities required for wxcode plugin """
 
 from collections import OrderedDict
-from typing import Any, Dict, List
+from typing import Any, Dict, List, Optional
 
 import iris
 import numpy as np
@@ -103,22 +103,49 @@
 }
 
 
-def update_tree_units(tree: Dict[str, Dict[str, Any]]) -> Dict[str, Dict[str, Any]]:
+def update_tree_thresholds(
+    tree: Dict[str, Dict[str, Any]], target_period: Optional[int] = None
+) -> Dict[str, Dict[str, Any]]:
     """
     Replaces value / unit pairs from tree definition with an Iris AuxCoord
-    that encodes the same information.
+    that encodes the same information. Also scales any threshold values that
+    have an associated period (e.g. accumulation in 3600 seconds) by a factor
+    to reflect the target period (e.g. a 3-hour, 10800 second, weather symbol).
 
     Args:
         tree:
             Weather symbols decision tree.
+        target_period:
+            The period in seconds that the weather symbol being produced should
+            represent. This should correspond with any period diagnostics, e.g.
+            precipitation accumulation, being used as input. This is used to scale
+            any threshold values that are defined with an associated period in
+            the decision tree.
     Returns:
-        The tree now containing AuxCoords instead of value / unit pairs.
+        The tree now containing AuxCoords instead of value / unit pairs, with
+        period diagnostic threshold values scaled appropriately.
+
+    Raises:
+        ValueError: If thresholds are defined with an associated period and no
+                    target_period is provided.
     """
 
     def _make_thresholds_with_units(items):
         if isinstance(items[0], list):
             return [_make_thresholds_with_units(item) for item in items]
-        values, units = items
+        try:
+            values, units, period = items
+        except ValueError:
+            values, units = items
+        else:
+            if not target_period:
+                raise ValueError(
+                    "The decision tree contains thresholds defined for a particular "
+                    "time period, e.g. the accumulation over 3600 seconds. To "
+                    "use such a decision tree a target_period must be provided "
+                    "such that the threshold values can be scaled appropriately."
+                )
+            values = values * target_period / period
         return iris.coords.AuxCoord(values, units=units)
 
     for query in tree.values():
@@ -339,13 +366,21 @@ def _checker(lists):
     return _checker(query)
 
 
-def check_tree(wxtree: Dict[str, Dict[str, Any]]) -> str:
+def check_tree(
+    wxtree: Dict[str, Dict[str, Any]], target_period: Optional[int] = None
+) -> str:
     """Perform some checks to ensure the provided decision tree is valid.
 
     Args:
         wxtree:
             Weather symbols decision tree definition, provided as a
             dictionary.
+        target_period:
+            The period in seconds that the weather symbol being produced should
+            represent. This should correspond with any period diagnostics, e.g.
+            precipitation accumulation, being used as input. This is used to scale
+            any threshold values that are defined with an associated period in
+            the decision tree.
 
     Returns:
         A list of problems found in the decision tree, or if none are found, the
@@ -363,7 +398,7 @@ def check_tree(wxtree: Dict[str, Dict[str, Any]]) -> str:
     all_targets = np.array(
         [(n["if_true"], n["if_false"]) for n in wxtree.values()]
     ).flatten()
-    wxtree = update_tree_units(wxtree)
+    wxtree = update_tree_thresholds(wxtree, target_period)
     valid_codes = list(WX_DICT.keys())
 
     all_key_words = REQUIRED_KEY_WORDS + OPTIONAL_KEY_WORDS