Skip to content

Commit

Permalink
Docs/notebook refactoring (#60)
Browse files Browse the repository at this point in the history
* fixing typos

* initial reformatting of notebook to template

* data for turbine tutorial

* typo fix

* removed sim classification example until updates are completed

* added context for example

Co-authored-by: Jason Bentley <Bentley.Jason@bcg.com>
  • Loading branch information
jason-bentley and jason-bentley authored Sep 11, 2020
1 parent 890fa2c commit 7492ee8
Show file tree
Hide file tree
Showing 4 changed files with 12,264 additions and 1,186 deletions.
4 changes: 0 additions & 4 deletions sphinx/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,6 @@ Predictive Maintenance Regression

Classification on a simulated dataset
------------------------------------------------
.. toctree::
:maxdepth: 2

tutorial/Classification_simulation_example_Facet



Expand Down
29 changes: 15 additions & 14 deletions sphinx/source/tutorial/Classification_with_Facet.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
"\n",
"1. Create a robust pipeline for learner selection using LearnerRanker and enabling the use of bootstrap cross-validation.\n",
"\n",
"2. Enhance our model inspection to understand drivers of predictions using local explanations of features via SHAP values by applying a novel methodology that decomposes SHAP values into measures of synergy, redundancy, and independence between each pair of features.\n",
"2. Enhance our model inspection to understand drivers of predictions using local explanations of features via [SHAP values](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions) by applying a novel methodology that decomposes SHAP values into measures of synergy, redundancy, and independence between each pair of features.\n",
"\n",
"3. Quickly apply historical simulation to gain key insights into feature values that minimize or maximize the predicted outcome.\n",
"\n",
Expand All @@ -33,9 +33,12 @@
"\n",
"Prediabetes is a treatable condition that leads to many health complications and eventually type 2 diabetes. Identification of individuals at risk of prediabetes can improve early intervention and provide insights into those interventions that work best.\n",
"Using a cohort of healthy (n=2847) and prediabetic (n=1509) patients derived \n",
"from the [NHANES 2013-14 U.S. cross-sectional survey](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013) we aim to create a classifier for prediabetes. For further details on data sources, definitions and the study cohort please see the Appendix ([7.1 Data source and study cohort](#Data-source-and-study-cohort)).\n",
"from the [NHANES 2013-14 U.S. cross-sectional survey](https://wwwn.cdc.gov/nchs/nhanes/Search/DataPage.aspx?Component=Examination&CycleBeginYear=2013) we aim to create a classifier for prediabetes. For further details on data sources, definitions and the study cohort please see the Appendix ([Data source and study cohort](#Data-source-and-study-cohort)).\n",
"\n",
"Utilizing FACET we will create a pipeline to find identify a well-performing classifier, and perform model inspection and simulation to gain understanding and insight into key factors predictive of prediabetes.\n",
"Utilizing FACET we will do the following:\n",
"\n",
"- create a pipeline to find identify a well-performing classifier\n",
"- perform model inspection and simulation to gain understanding and insight into key factors predictive of prediabetes.\n",
"\n",
"***\n",
"\n",
Expand Down Expand Up @@ -176,7 +179,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"First we need to load our prediabetes data and create a simple preprocessing pipeline. For those interested some initial EDA can be found in the Appendix ([7.2 Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA)))."
"First we need to load our prediabetes data and create a simple preprocessing pipeline. For those interested some initial EDA can be found in the Appendix ([Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA)))."
]
},
{
Expand Down Expand Up @@ -206,7 +209,7 @@
"raw_mimetype": "text/markdown"
},
"source": [
"Using FACET we first create a sample object, which carries information used in other FACET functions."
"Using FACET we create a sample object, which carries information used in other FACET functions."
]
},
{
Expand All @@ -229,7 +232,7 @@
"raw_mimetype": "text/markdown"
},
"source": [
"Next we create a minimum preprocessing pipeline which based on our EDA initial EDA ([7.2 Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))) needs to address the following:\n",
"Next we create a minimum preprocessing pipeline which based on our initial EDA ([Exploratory Data Analysis](#Exploratory-Data-Analysis-(EDA))) needs to address the following:\n",
"\n",
"1. Simple imputation for missing values in both continuous and categorical features\n",
"2. One-hot encoding for categorical features"
Expand Down Expand Up @@ -301,7 +304,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Boruta identified 19 features (out of a potential 47) that we will retain for classification. Note that this feature selection process could be included in a general pre-processing pipeline, however due to the computation involved, we have utilized Boruta here as an initial one-off processing step to narrow down the features for our classifier development."
"Boruta identified 19 features (out of a potential 47) that we will retain for classification. Note that this feature selection process could be included in a general preprocessing pipeline, however due to the computation involved, we have utilized Boruta here as an initial one-off processing step to narrow down the features for our classifier development."
]
},
{
Expand Down Expand Up @@ -352,7 +355,7 @@
"source": [
"# 1. Random forest learner\n",
"rf_pipeline=ClassifierPipelineDF(\n",
" classifier=RandomForestClassifierDF(),\n",
" classifier=RandomForestClassifierDF(random_state=42),\n",
" preprocessing=preprocessing,\n",
")\n",
"rf_grid=LearnerGrid(\n",
Expand All @@ -363,7 +366,7 @@
"\n",
"# 2. Light gradient boosting learner\n",
"gb_pipeline=ClassifierPipelineDF(\n",
" classifier=LGBMClassifierDF(),\n",
" classifier=LGBMClassifierDF(random_state=42),\n",
" preprocessing=preprocessing,\n",
")\n",
"gb_grid=LearnerGrid(\n",
Expand Down Expand Up @@ -446,9 +449,7 @@
"inspector=LearnerInspector(\n",
" n_jobs=-3,\n",
" verbose=10,\n",
").fit(\n",
" crossfit=ranker.best_model_crossfit.resize(20)\n",
")"
").fit(crossfit=ranker.best_model_crossfit.resize(20))"
]
},
{
Expand Down Expand Up @@ -508,7 +509,7 @@
"source": [
"## Synergy and redundancy\n",
"\n",
"Synergy and redundancy and synergy are part of the key extensions FACET makes to using SHAP values to understand model predictions."
"Synergy and redundancy are part of the key extensions FACET makes to using SHAP values to understand model predictions."
]
},
{
Expand Down Expand Up @@ -559,7 +560,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The dendrogram represents the extent of clustering among the features. Taking the `Waist_circumference` and `Waist_to_hgt` features which have the highest redundancy, we can see these features cluster together earliest in the dendrogram. Ideally we want to see features only start to cluster as close to the righthand side of the dendrogram as possible. This implies all features in the model are contributing uniquely to our predictions."
"The dendrogram represents the extent of clustering among the features. Taking the `Waist_circumference` and `Waist_to_hgt` features which have the highest redundancy, we can see these features cluster together earliest in the dendrogram. Ideally we want to see features only start to cluster as close to the right-hand side of the dendrogram as possible. This implies all features in the model are contributing uniquely to our predictions."
]
},
{
Expand Down
Loading

0 comments on commit 7492ee8

Please sign in to comment.