Skip to content

Commit

Permalink
update README with latest quickstart changes
Browse files Browse the repository at this point in the history
  • Loading branch information
jason-bentley committed Jan 15, 2021
1 parent 27098a4 commit 30123e7
Showing 1 changed file with 69 additions and 57 deletions.
126 changes: 69 additions & 57 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,19 +79,25 @@ Enhanced Machine Learning Workflow
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To demonstrate the model inspection capability of FACET, we first create a
pipeline to fit a learner. In this simple example using the Boston housing
data, we will train a Random Forest regressor using 10 repeated 5-fold CV
to predict median house price. With the use of *sklearndf* we can create a
*pandas* DataFrame compatible workflow. However, FACET provides additional
enhancements to keep track of our feature matrix and target vector using a
sample object (`Sample`) and easily compare hyperparameter configurations
and even multiple learners with the `LearnerRanker`.
pipeline to fit a learner. In this simple example we use the
`diabetes dataset <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt>`__
which contains age, sex, BMI and blood pressure along with 6 blood serum
measurements as features. A transformed version of this dataset is also available
on scikit-learn
`here <https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset>`__.


In this quickstart we will train a Random Forest regressor using 10 repeated
5-fold CV to predict disease progression after one year. With the use of
*sklearndf* we can create a *pandas* DataFrame compatible workflow. However,
FACET provides additional enhancements to keep track of our feature matrix
and target vector using a sample object (`Sample`) and easily compare
hyperparameter configurations and even multiple learners with the `LearnerRanker`.

.. code-block:: Python
# standard imports
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import RepeatedKFold
# some helpful imports from sklearndf
Expand All @@ -102,14 +108,11 @@ and even multiple learners with the `LearnerRanker`.
from facet.data import Sample
from facet.selection import LearnerRanker, LearnerGrid
# load Boston housing dataset
boston = load_boston()
boston_df = pd.DataFrame(data=boston.data, columns=boston.feature_names).assign(
MEDIAN_HOUSE_PRICE=boston.target
)
# load the diabetes dataset
diabetes_df = pd.read_csv('diabetes_quickstart.csv')
# create FACET sample object
boston_sample = Sample(observations=boston_df, target_name="MEDIAN_HOUSE_PRICE")
diabetes_sample = Sample(observations=diabetes_df, target_name="Disease_progression")
# create a (trivial) pipeline for a random forest regressor
rnd_forest_reg = RegressorPipelineDF(
Expand All @@ -132,17 +135,18 @@ and even multiple learners with the `LearnerRanker`.
# rank your candidate models by performance (default is mean CV score - 2*SD)
ranker = LearnerRanker(
grids=rnd_forest_grid, cv=rkf_cv, n_jobs=-3
).fit(sample=boston_sample)
).fit(sample=diabetes_sample)
# get summary report
ranker.summary_report()
.. image:: sphinx/source/_static/ranker_summary.png
:width: 600

We can see based on this minimal workflow that a value of 8 for minimum samples
in the leaf was the best performing of the three considered values. This approach
easily extends to multiple hyperparameters for the learner and multiple learners.
We can see based on this minimal workflow that a value of 11 for minimum
samples in the leaf was the best performing of the three considered values.
This approach easily extends to multiple hyperparameters for the learner
and multiple learners.

Model Inspection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -209,28 +213,30 @@ features in a model are:
# visualise synergy as a matrix
from pytools.viz.matrix import MatrixDrawer
synergy_matrix = inspector.feature_synergy_matrix(symmetrical=True)
synergy_matrix = inspector.feature_synergy_matrix()
MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Synergy Matrix")
.. image:: sphinx/source/_static/synergy_matrix.png
:width: 600

As before the matrix row represents the "perspective from" feature in the pair.
Looking across the row for `LSTAT` there is relatively minimal synergy (≤14%)
with other features in the model. However, looking down the column for `LSTAT`
(i.e., perspective of other features in a pair with `LSTAT`) we find many
features (the rows) are synergistic (12% to 47%) with `LSTAT`. We can conclude that:
For any feature pair (A, B), the first feature (A) is the row, and the second
feature (B) the column. For example, looking across the row for `LTG` (Lamotrigine)
there is relatively minimal synergy (≤14%) with other features in the model.
However, looking down the column for `LTG` (i.e., perspective of other features
in a pair with `LTG`) we find many features (the rows) are synergistic (12% to 34%)
with `LTG`. We can conclude that:


- `LSTAT` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting median house price.
- The contribution of other features to predicting median house price is partly
enabled by the strong contribution from `LSTAT`.
- `LTG` is a strongly autonomous feature, displaying minimal synergy with other
features for predicting disease progression after one year.
- The contribution of other features to predicting disease progression after one
year is partly enabled by the strong contribution from `LTG`.

High synergy features must be considered carefully when investigating business
impact, as they work together to predict the outcome. It would not make much
sense to consider `ZN` (proportion of residential land zoned for lots over
25,000 sq.ft) without `LSTAT` given the 47% synergy of `ZN` with `LSTAT` for
predicting median house price.

High synergy features must be considered carefully when investigating impact,
as they work together to predict the outcome. It would not make much sense to
consider `TC` (T-Cells) without `LTG` given the 34% synergy of `TC` with `LTG`
for predicting progression after one year.

**Redundancy**

Expand All @@ -243,15 +249,20 @@ predicting median house price.
.. image:: sphinx/source/_static/redundancy_matrix.png
:width: 600

For any feature pair (A, B), the first feature (A) is the row, and the second
feature (B) the column. For example, if we look at the feature pair (`LSTAT`, `RM`)
from the perspective of `LSTAT` (percentage of lower status of the population),
then we look-up the row for `LSTAT` and the column for `RM` (average number of
rooms per dwelling) and find 39% redundancy. This means that 39% of the
information in `LSTAT` is duplicated with `RM` to predict median house price.
We can also see looking across the row for `LSTAT` that apart from the 39%
redundancy with `RM`, `LSTAT` has minimal redundancy (<5%) with any of the
other features included in the model.
For any feature pair (A, B), the first feature (A) is the row, and the second feature
(B) the column. For example, if we look at the feature pair (`LDL`, `TC`) from the
perspective of `LDL` (Low-Density Lipoproteins), then we look-up the row for `LDL`
and the column for `TC` and find 47% redundancy. This means that 47% of the
information in `LDL` is duplicated with `TC` to predict disease progression
after one year. This redundancy is similar when looking "from the perspective"
of `TC` for (`TC`, `LDL`) which is 50%.


If we look across the columns for the `LTG` row we can see that apart from the
32% redundancy with `BMI`, `LTG` has minimal redundancy (<9%) with the other
features included in the model. Further, if we look cross the rows for the
`LTG` column we can see a number of the features have moderate redundancy
with `LTG`.

**Clustering redundancy**

Expand All @@ -278,12 +289,12 @@ Let's look at the example for redundancy.
.. image:: sphinx/source/_static/redundancy_dendrogram.png
:width: 600

Based on the dendrogram we can see that the feature pairs (`LSTAT`, `RM`)
and (`CRIM`: per capita crime rate by town, `NOX`: nitric oxides concentration
in parts per 10 million) each represent a cluster in the dendrogram and
that `LSTAT` and `RM` have high importance. As a next action we could
remove RM (and maybe NOX) to further simplify the model and obtain a
set of independent features.
Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
and (`LTG`, `BMI`: body mass index) each represent a cluster in the
dendrogram and that `LTG` and `BMI` have high the highest importance.
As potential next actions we could remove `TC` and explore the impact of
removing one of `LTG` or `BMI` to further simplify the model and obtain a
reduced set of independent features.

Please see the
`API reference <https://bcg-gamma.github.io/facet/apidoc/facet.html>`__
Expand All @@ -292,19 +303,19 @@ for more detail.
Model Simulation
~~~~~~~~~~~~~~~~~~

Taking the LSTAT feature as an example, we do the following for the simulation:
Taking the `BMI` feature as an example, we do the following for the simulation:

- We use FACET's `ContinuousRangePartitioner` to split the range of observed values of
LSTAT into intervals of equal size. Each partition is represented by the central
`BMI` into intervals of equal size. Each partition is represented by the central
value of that partition.
- For each partition, the simulator creates an artificial copy of the original sample
assuming the variable to be simulated has the same value across all observations -
which is the value representing the partition. Using the best `LearnerCrossfit`
acquired from the ranker, the simulator now re-predicts all targets using the models
trained for all folds and determines the average uplift of the target variable
resulting from this.
- The FACET `SimulationDrawer` allows us to visualise the result; both in a matplotlib
and a plain-text style.
- The FACET `SimulationDrawer` allows us to visualise the result; both in a
matplotlib and a plain-text style.

Finally, because FACET can use bootstrap cross validation, we can create a crossfit
from our previous `LearnerRanker` best model to perform the simulation so we can
Expand All @@ -328,9 +339,9 @@ quantify the uncertainty by using bootstrap confidence intervals.
cv=bscv,
n_jobs=-3,
verbose=False,
).fit(sample=boston_sample)
).fit(sample=diabetes_sample)
SIM_FEAT = "LSTAT"
SIM_FEAT = "BMI"
simulator = UnivariateUpliftSimulator(crossfit=boot_crossfit, n_jobs=-3)
# split the simulation range into equal sized partitions
Expand All @@ -344,9 +355,10 @@ quantify the uncertainty by using bootstrap confidence intervals.
.. image:: sphinx/source/_static/simulation_output.png

We would conclude from the figure that lower values of `LSTAT` are associated with
an increase in median house price, and that the lower `LSTAT` of 8% or less results
in a significant uplift in median house price.
We would conclude from the figure that higher values of `BMI` are associated with
an increase in disease progression after one year, and that for a `BMI` of 29
and above, there is a significant increase in disease progression after one year
of at least 26 points.


Contributing
Expand Down

0 comments on commit 30123e7

Please sign in to comment.