Docs/notebook refactoring (#60)

* fixing typos * initial reformatting of notebook to template * data for turbine tutorial * typo fix * removed sim classification example until updates are completed * added context for example Co-authored-by: Jason Bentley <Bentley.Jason@bcg.com>
BCG-X-Official · Sep 11, 2020 · 7492ee8 · 7492ee8
1 parent 890fa2c
commit 7492ee8
Show file tree

Hide file tree

Showing 4 changed files with 12,264 additions and 1,186 deletions.
diff --git a/sphinx/source/examples.rst b/sphinx/source/examples.rst
@@ -23,10 +23,6 @@ Predictive Maintenance Regression
 
 Classification on a simulated dataset
 ------------------------------------------------
-.. toctree::
-   :maxdepth: 2
-
-   tutorial/Classification_simulation_example_Facet
 
 
 

diff --git a/sphinx/source/tutorial/Classification_with_Facet.ipynb b/sphinx/source/tutorial/Classification_with_Facet.ipynb
@@ -23,7 +23,7 @@
     "\n",
     "1. Create a robust pipeline for learner selection using LearnerRanker and enabling the use of bootstrap cross-validation.\n",
     "\n",
-    "2. Enhance our model inspection to understand drivers of predictions using local explanations of features via SHAP values by applying a novel methodology that decomposes SHAP values into measures of synergy, redundancy, and independence between each pair of features.\n",
+    "2. Enhance our model inspection to understand drivers of predictions using local explanations of features via [SHAP values](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions) by applying a novel methodology that decomposes SHAP values into measures of synergy, redundancy, and independence between each pair of features.\n",
     "\n",
     "3. Quickly apply historical simulation to gain key insights into feature values that minimize or maximize the predicted outcome.\n",
     "\n",
@@ -33,9 +33,12 @@
     "\n",
     "Prediabetes is a treatable condition that leads to many health complications and eventually type 2 diabetes. Identification of  individuals at risk of prediabetes can improve early intervention and provide insights into those interventions that work best.\n",
     "Using a cohort of healthy (n=2847) and prediabetic (n=1509) patients derived \n",
-    "from the [NHANES 2013-14 U.S. cross-sectional survey](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013) we aim to create a classifier for prediabetes. For further details on data sources, definitions and the study cohort please see the Appendix ([7.1  Data source and study cohort](#Data-source-and-study-cohort)).\n",
+    "from the [NHANES 2013-14 U.S. cross-sectional survey](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013) we aim to create a classifier for prediabetes. For further details on data sources, definitions and the study cohort please see the Appendix ([Data source and study cohort](#Data-source-and-study-cohort)).\n",
     "\n",
-    "Utilizing FACET we will create a pipeline to find identify a well-performing classifier, and perform model inspection and simulation to gain understanding and insight into key factors predictive of prediabetes.\n",
+    "Utilizing FACET we will do the following:\n",
+    "\n",
+    "- create a pipeline to find identify a well-performing classifier\n",
+    "- perform model inspection and simulation to gain understanding and insight into key factors predictive of prediabetes.\n",
     "\n",
     "***\n",
     "\n",
@@ -176,7 +179,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First we need to load our prediabetes data and create a simple preprocessing pipeline. For those interested some initial EDA can be found in the Appendix ([7.2 Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA)))."
+    "First we need to load our prediabetes data and create a simple preprocessing pipeline. For those interested some initial EDA can be found in the Appendix ([Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA)))."
    ]
   },
   {
@@ -206,7 +209,7 @@
     "raw_mimetype": "text/markdown"
    },
    "source": [
-    "Using FACET we first create a sample object, which carries information used in other FACET functions."
+    "Using FACET we create a sample object, which carries information used in other FACET functions."
    ]
   },
   {
@@ -229,7 +232,7 @@
     "raw_mimetype": "text/markdown"
    },
    "source": [
-    "Next we create a minimum preprocessing pipeline which based on our EDA initial EDA ([7.2 Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))) needs to address the following:\n",
+    "Next we create a minimum preprocessing pipeline which based on our initial EDA ([Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))) needs to address the following:\n",
     "\n",
     "1. Simple imputation for missing values in both continuous and categorical features\n",
     "2. One-hot encoding for categorical features"
@@ -301,7 +304,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Boruta identified 19 features (out of a potential 47) that we will retain for classification. Note that this feature selection process could be included in a general pre-processing pipeline, however due to the computation involved, we have utilized Boruta here as an initial one-off processing step to narrow down the features for our classifier development."
+    "Boruta identified 19 features (out of a potential 47) that we will retain for classification. Note that this feature selection process could be included in a general preprocessing pipeline, however due to the computation involved, we have utilized Boruta here as an initial one-off processing step to narrow down the features for our classifier development."
    ]
   },
   {
@@ -352,7 +355,7 @@
    "source": [
     "# 1. Random forest learner\n",
     "rf_pipeline=ClassifierPipelineDF(\n",
-    "    classifier=RandomForestClassifierDF(),\n",
+    "    classifier=RandomForestClassifierDF(random_state=42),\n",
     "    preprocessing=preprocessing,\n",
     ")\n",
     "rf_grid=LearnerGrid(\n",
@@ -363,7 +366,7 @@
     "\n",
     "# 2. Light gradient boosting learner\n",
     "gb_pipeline=ClassifierPipelineDF(\n",
-    "    classifier=LGBMClassifierDF(),\n",
+    "    classifier=LGBMClassifierDF(random_state=42),\n",
     "    preprocessing=preprocessing,\n",
     ")\n",
     "gb_grid=LearnerGrid(\n",
@@ -446,9 +449,7 @@
     "inspector=LearnerInspector(\n",
     "    n_jobs=-3,\n",
     "    verbose=10,\n",
-    ").fit(\n",
-    "    crossfit=ranker.best_model_crossfit.resize(20)\n",
-    ")"
+    ").fit(crossfit=ranker.best_model_crossfit.resize(20))"
    ]
   },
   {
@@ -508,7 +509,7 @@
    "source": [
     "## Synergy and redundancy\n",
     "\n",
-    "Synergy and redundancy and synergy are part of the key extensions FACET makes to using SHAP values to understand model predictions."
+    "Synergy and redundancy are part of the key extensions FACET makes to using SHAP values to understand model predictions."
    ]
   },
   {
@@ -559,7 +560,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The dendrogram represents the extent of clustering among the features. Taking the `Waist_circumference` and `Waist_to_hgt` features which have the highest redundancy, we can see these features cluster together earliest in the dendrogram. Ideally we want to see features only start to cluster as close to the righthand side of the dendrogram as possible. This implies all features in the model are contributing uniquely to our predictions."
+    "The dendrogram represents the extent of clustering among the features. Taking the `Waist_circumference` and `Waist_to_hgt` features which have the highest redundancy, we can see these features cluster together earliest in the dendrogram. Ideally we want to see features only start to cluster as close to the right-hand side of the dendrogram as possible. This implies all features in the model are contributing uniquely to our predictions."
    ]
   },
   {