quickstart and notebook refresh and updates (#179)

BCG-X-Official · Jan 8, 2021 · 9d65b16 · 9d65b16
1 parent b99a13b
commit 9d65b16
Show file tree

Hide file tree

Showing 13 changed files with 8,401 additions and 2,120 deletions.
diff --git a/README.rst b/README.rst
@@ -78,6 +78,15 @@ up and running with FACET.
 Enhanced Machine Learning Workflow
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+To demonstrate the model inspection capability of FACET, we first create a
+pipeline to fit a learner. In this simple example using the Boston housing
+data, we will train a Random Forest regressor using 10 repeated 5-fold CV
+to predict median house price. With the use of *sklearndf* we can create a
+*pandas* DataFrame compatible workflow. However, FACET provides additional
+enhancements to keep track of our feature matrix and target vector using a
+sample object (`Sample`) and easily compare hyperparameter configurations
+and even multiple learners with the `LearnerRanker`.
+
 .. code-block:: Python
 
     # standard imports
@@ -120,7 +129,7 @@ Enhanced Machine Learning Workflow
     # create repeated k-fold CV iterator
     rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
 
-    # rank your candidate models by performance
+    # rank your candidate models by performance (default is mean CV score - 2*SD)
     ranker = LearnerRanker(
         grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
     ).fit(sample=boston_sample)
@@ -131,48 +140,133 @@ Enhanced Machine Learning Workflow
 .. image:: sphinx/source/_static/ranker_summary.png
    :width: 600
 
+We can see based on this minimal workflow that a value of 8 for minimum samples
+in the leaf was the best performing of the three considered values. This approach
+easily extends to multiple hyperparameters for the learner and multiple learners.
+
 Model Inspection
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 FACET implements several model inspection methods for
 `scikit-learn <https://scikit-learn.org/stable/index.html>`__ estimators.
-
-Fundamentally, FACET enables post-hoc model inspection by breaking down the interaction
-effects of the features used for model training:
-
-- **Redundancy**
-  represents how much information is shared between two features' contributions to
-  the model predictions. For example, temperature and pressure in a pressure cooker are
-  redundant features for predicting cooking time since pressure will rise relative to
-  the temperature, and vice versa. Therefore, knowing just one of either temperature or
-  pressure will likely enable the same predictive accuracy. Redundancy is expressed as
-  a percentage ranging from 0% (full uniqueness) to 100% (full redundancy).
+FACET enhances model inspection by providing global metrics that complement
+the local perspective of SHAP. The key global metrics for each pair of
+features in a model are:
 
 - **Synergy**
-  represents how much the combined information of two features contributes to
-  the model predictions. For example, given features X and Y as
-  coordinates on a chess board, the colour of a square can only be predicted when
-  considering X and Y in combination. Synergy is expressed as a
+
+  The degree to which the model combines information from one feature with
+  another to predict the target. For example, let's assume we are predicting
+  cardiovascular health using age and gender and the fitted model includes
+  a complex interaction between them. This means these two features are
+  synergistic for predicting cardiovascular health. Further, both features
+  are important to the model and removing either one would significantly
+  impact performance. Let's assume age brings more information to the joint
+  contribution than gender. This asymmetric contribution means the synergy for
+  (age, gender) is less than the synergy for (gender, age). To think about it another
+  way, imagine the prediction is a coordinate you are trying to reach.
+  From your starting point, age gets you much closer to this point than
+  gender, however, you need both to get there. Synergy reflects the fact
+  that gender gets more help from age (higher synergy from the perspective
+  of gender) than age does from gender (lower synergy from the perspective of
+  age) to reach the prediction. *This leads to an important point: synergy
+  is a naturally asymmetric property of the global information two interacting
+  features contribute to the model predictions.* Synergy is expressed as a
   percentage ranging from 0% (full autonomy) to 100% (full synergy).
 
 
+- **Redundancy**
+
+  The degree to which a feature in a model duplicates the information of a
+  second feature to predict the target. For example, let's assume we had
+  house size and number of bedrooms for predicting house price. These
+  features capture similar information as the more bedrooms the larger
+  the house and likely a higher price on average. The redundancy for
+  (number of bedrooms, house size) will be greater than the redundancy
+  for (house size, number of bedrooms). This is because house size
+  "knows" more of what number of bedrooms does for predicting house price
+  than vice-versa. Hence, there is greater redundancy from the perspective
+  of number of bedrooms. Another way to think about it is removing house
+  size will be more detrimental to model performance than removing number
+  of bedrooms, as house size can better compensate for the absence of
+  number of bedrooms. This also implies that house size would be a more
+  important feature than number of bedrooms in the model. *The important
+  point here is that like synergy, redundancy is a naturally asymmetric
+  property of the global information feature pairs have for predicting
+  an outcome.* Redundancy is expressed as a percentage ranging from 0%
+  (full uniqueness) to 100% (full redundancy).
+
 .. code-block:: Python
 
     # fit the model inspector
     from facet.inspection import LearnerInspector
     inspector = LearnerInspector()
     inspector.fit(crossfit=ranker.best_model_crossfit_)
 
-    # visualise redundancy as a matrix
+**Synergy**
+
+.. code-block:: Python
+
+    # visualise synergy as a matrix
     from pytools.viz.matrix import MatrixDrawer
+    synergy_matrix = inspector.feature_synergy_matrix(symmetrical=True)
+    MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Synergy Matrix")
+
+.. image:: sphinx/source/_static/synergy_matrix.png
+    :width: 600
+
+As before the matrix row represents the "perspective from" feature in the pair.
+Looking across the row for `LSTAT` there is relatively minimal synergy (≤14%)
+with other features in the model. However, looking down the column for `LSTAT`
+(i.e., perspective of other features in a pair with `LSTAT`) we find many
+features (the rows) are synergistic (12% to 47%) with `LSTAT`. We can conclude that:
+
+-   `LSTAT` is a strongly autonomous feature, displaying minimal synergy with other
+    features for predicting median house price.
+-   The contribution of other features to predicting median house price is partly
+    enabled by the strong contribution from `LSTAT`.
+
+High synergy features must be considered carefully when investigating business
+impact, as they work together to predict the outcome. It would not make much
+sense to consider `ZN` (proportion of residential land zoned for lots over
+25,000 sq.ft) without `LSTAT` given the 47% synergy of `ZN` with `LSTAT` for
+predicting median house price.
+
+**Redundancy**
+
+.. code-block:: Python
+
+    # visualise redundancy as a matrix
     redundancy_matrix = inspector.feature_redundancy_matrix()
     MatrixDrawer(style="matplot%").draw(redundancy_matrix, title="Redundancy Matrix")
 
 .. image:: sphinx/source/_static/redundancy_matrix.png
     :width: 600
 
-We can also better visualize redundancy as a dendrogram so we can identify clusters of
-features with redundancy.
+For any feature pair (A, B), the first feature (A) is the row, and the second
+feature (B) the column. For example, if we look at the feature pair (`LSTAT`, `RM`)
+from the perspective of `LSTAT` (percentage of lower status of the population),
+then we look-up the row for `LSTAT` and the column for `RM` (average number of
+rooms per dwelling) and find 39% redundancy. This means that 39% of the
+information in `LSTAT` is duplicated with `RM` to predict median house price.
+We can also see looking across the row for `LSTAT` that apart from the 39%
+redundancy with `RM`, `LSTAT` has minimal redundancy (<5%) with any of the
+other features included in the model.
+
+**Clustering redundancy**
+
+As detailed above redundancy and synergy for a feature pair is from the
+"perspective" of one of the features in the pair, and so yields two distinct
+values. However, a symmetric version can also be computed that provides not
+only a simplified perspective but allows the use of (1 - metric) as a
+feature distance. With this distance hierarchical, single linkage clustering
+is applied to create a dendrogram visualization. This helps to identify
+groups of low distance, features which activate "in tandem" to predict the
+outcome. Such information can then be used to either reduce clusters of
+highly redundant features to a subset or highlight clusters of highly
+synergistic features that should always be considered together.
+
+Let's look at the example for redundancy.
 
 .. code-block:: Python
 
@@ -184,16 +278,12 @@ features with redundancy.
 .. image:: sphinx/source/_static/redundancy_dendrogram.png
     :width: 600
 
-For feature synergy, we can get a similar picture
-
-.. code-block:: Python
-
-    # visualise synergy as a matrix
-    synergy_matrix = inspector.feature_synergy_matrix()
-    MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Synergy Matrix")
-
-.. image:: sphinx/source/_static/synergy_matrix.png
-    :width: 600
+Based on the dendrogram we can see that the feature pairs (`LSTAT`, `RM`)
+and (`CRIM`: per capita crime rate by town, `NOX`: nitric oxides concentration
+in parts per 10 million) each represent a cluster in the dendrogram and
+that `LSTAT` and `RM` have high importance. As a next action we could
+remove RM (and maybe NOX) to further simplify the model and obtain a
+set of independent features.
 
 Please see the
 `API reference <https://bcg-gamma.github.io/facet/apidoc/facet.html>`__
@@ -241,7 +331,7 @@ quantify the uncertainty by using bootstrap confidence intervals.
     ).fit(sample=boston_sample)
 
     SIM_FEAT = "LSTAT"
-    simulator = UnivariateUpliftSimulator(crossfit=boot_crossfit, n_jobs=3)
+    simulator = UnivariateUpliftSimulator(crossfit=boot_crossfit, n_jobs=-3)
 
     # split the simulation range into equal sized partitions
     partitioner = ContinuousRangePartitioner()
@@ -254,6 +344,10 @@ quantify the uncertainty by using bootstrap confidence intervals.
 
 .. image:: sphinx/source/_static/simulation_output.png
 
+We would conclude from the figure that lower values of `LSTAT` are associated with
+an increase in median house price, and that the lower `LSTAT` of 8% or less results
+in a significant uplift in median house price.
+
 
 Contributing
 ---------------------------