API: Rename ModelSelector to LearnerSelector (#348)

* API: Rename ModelSelector to LearnerSelector * TEST: fix parameter space definitions in regressor_parameters() * API: accept multiple spaces as param parameter_space of LearnerInspector * TEST: adjust test for an updated exception message * TEST: minor tweaks * DOC: update classification tutorial (random search, LearnerSelector API) * DOC: tweak classification tutorial * BUILD: update package dependencies * DOC: update classification tutorial (random search, LearnerSelector API) * DOC: update tutorial notebooks in preparation for the FACET 2.0 release * DOC: tweak a headline * BUILD: update package dependencies * API: improve column names & sequence of LearnerSelector.summary_report() * DOC: documentation tweaks * API: clarify 'candidate name' terminology * DOC: tweak release notes * DOC: add missing intersphinx mappings * DOC: fix link to catboost package * DOC: address sphinx error messages * DOC: move images from _static/ to _images/ Co-authored-by: Jan Ittner <ittner.jan@bcg.com>
BCG-X-Official · Sep 19, 2022 · 1bdc49b · 1bdc49b
1 parent dba2d2c
commit 1bdc49b
Show file tree

Hide file tree

Showing 48 changed files with 1,833 additions and 2,096 deletions.
diff --git a/README.rst b/README.rst
@@ -1,4 +1,4 @@
-.. image:: sphinx/source/_static/Gamma_Facet_Logo_RGB_LB.svg
+.. image:: sphinx/source/_images/Gamma_Facet_Logo_RGB_LB.svg
 
 |
 
@@ -103,7 +103,7 @@ In this quickstart we will train a Random Forest regressor using 10 repeated
 *sklearndf* we can create a *pandas* DataFrame compatible workflow. However,
 FACET provides additional enhancements to keep track of our feature matrix
 and target vector using a sample object (`Sample`) and easily compare
-hyperparameter configurations and even multiple learners with the `ModelSelector`.
+hyperparameter configurations and even multiple learners with the `LearnerSelector`.
 
 .. code-block:: Python
 
@@ -117,7 +117,7 @@ hyperparameter configurations and even multiple learners with the `ModelSelector
 
     # relevant FACET imports
     from facet.data import Sample
-    from facet.selection import ModelSelector, ParameterSpace
+    from facet.selection import LearnerSelector, ParameterSpace
 
     # declaring url with data
     data_url = 'https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'
@@ -153,7 +153,7 @@ hyperparameter configurations and even multiple learners with the `ModelSelector
     rkf_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
 
     # rank your candidate models by performance
-    selector = ModelSelector(
+    selector = LearnerSelector(
         searcher_type=GridSearchCV,
         parameter_space=rnd_forest_ps,
         cv=rkf_cv,
@@ -164,7 +164,7 @@ hyperparameter configurations and even multiple learners with the `ModelSelector
     # get summary report
     selector.summary_report()
 
-.. image:: sphinx/source/_static/ranker_summary.png
+.. image:: sphinx/source/_images/ranker_summary.png
    :width: 600
 
 We can see based on this minimal workflow that a value of 11 for minimum
@@ -245,7 +245,7 @@ The key global metrics for each pair of features in a model are:
     synergy_matrix = inspector.feature_synergy_matrix()
     MatrixDrawer(style="matplot%").draw(synergy_matrix, title="Synergy Matrix")
 
-.. image:: sphinx/source/_static/synergy_matrix.png
+.. image:: sphinx/source/_images/synergy_matrix.png
     :width: 600
 
 For any feature pair (A, B), the first feature (A) is the row, and the second
@@ -273,7 +273,7 @@ to 27% synergy of `LDL` with `LTG` for predicting progression after one year.
     redundancy_matrix = inspector.feature_redundancy_matrix()
     MatrixDrawer(style="matplot%").draw(redundancy_matrix, title="Redundancy Matrix")
 
-.. image:: sphinx/source/_static/redundancy_matrix.png
+.. image:: sphinx/source/_images/redundancy_matrix.png
     :width: 600
 
 
@@ -312,7 +312,7 @@ Let's look at the example for redundancy.
     redundancy = inspector.feature_redundancy_linkage()
     DendrogramDrawer().draw(data=redundancy, title="Redundancy Dendrogram")
 
-.. image:: sphinx/source/_static/redundancy_dendrogram.png
+.. image:: sphinx/source/_images/redundancy_dendrogram.png
     :width: 600
 
 Based on the dendrogram we can see that the feature pairs (`LDL`, `TC`)
@@ -371,7 +371,7 @@ we do the following for the simulation:
     # visualise results
     SimulationDrawer().draw(data=simulation, title=SIM_FEAT)
 
-.. image:: sphinx/source/_static/simulation_output.png
+.. image:: sphinx/source/_images/simulation_output.png
 
 We would conclude from the figure that higher values of `BMI` are associated with
 an increase in disease progression after one year, and that for a `BMI` of 28
@@ -427,15 +427,15 @@ BCG GAMMA team. If you would like to know more you can find out about
 or have a look at
 `career opportunities <https://www.bcg.com/en-gb/beyond-consulting/bcg-gamma/careers>`_.
 
-.. |pipe| image:: sphinx/source/_static/icons/pipe_icon.png
+.. |pipe| image:: sphinx/source/_images/icons/pipe_icon.png
    :width: 100px
    :class: facet_icon
 
-.. |inspect| image:: sphinx/source/_static/icons/inspect_icon.png
+.. |inspect| image:: sphinx/source/_images/icons/inspect_icon.png
    :width: 100px
    :class: facet_icon
 
-.. |sim| image:: sphinx/source/_static/icons/sim_icon.png
+.. |sim| image:: sphinx/source/_images/icons/sim_icon.png
    :width: 100px
    :class: facet_icon
 

diff --git a/RELEASE_NOTES.rst b/RELEASE_NOTES.rst
@@ -2,12 +2,15 @@ Release Notes
 =============
 
 .. |mypy| replace:: :external+mypy:doc:`mypy <index>`
+.. |shap| replace:: :external+shap:doc:`shap <index>`
+.. |nbsp| unicode:: 0xA0
+   :trim:
 
 FACET 2.0
 ---------
 
-FACET 2.0 brings numerous API enhancements and improvements, accelerates model
-inspection by factor 50 in many practical settings, makes major improvements to
+FACET |nbsp| 2.0 brings numerous API enhancements and improvements, accelerates model
+inspection by factor |nbsp| 50 in many practical settings, makes major improvements to
 visualizations, and is now fully type-checked by |mypy|.
 
 
@@ -28,28 +31,30 @@ visualizations, and is now fully type-checked by |mypy|.
 
 - API: :class:`.LearnerInspector` no longer uses learner crossfits and instead inspects
   models using a single pass of SHAP calculations, usually leading to performance gains
-  of up to a factor of 50
-- API: return :class:`.LearnerInspector` matrix outputs as :class:`.Matrix` instances
+  of up to a factor of |nbsp| 50
+- API: return :class:`.LearnerInspector` matrix outputs as :class:`~pytools.data.Matrix`
+  instances
 - API: diagonals of feature synergy, redundancy, and association matrices are now
-  ``nan`` instead of 1.0
-- API: the leaf order of :class:`.LinkageTree` objects generated by
+  ``nan`` instead of |nbsp| 1.0
+- API: the leaf order of :class:`~pytools.data.LinkageTree` objects generated by
   ``feature_…_linkage`` methods of :class:`.LearnerInspector` is now the same as the
-  row and column order of :class:`.Matrix` objects returned by the corresponding
-  ``feature_…_matrix`` methods of :class:`.LearnerInspector`, minimizing the distance
-  between adjacent leaves
-  The old sorting behaviour of FACET 1 can be restored using method
-  :meth:`.LinkageTree.sort_by_weight`
+  row and column order of :class:`~pytools.data.Matrix` objects returned by the
+  corresponding ``feature_…_matrix`` methods of :class:`.LearnerInspector`, minimizing
+  the distance between adjacent leaves.
+  The old sorting behaviour of FACET |nbsp| 1.x can be restored using method
+  :meth:`~pytools.data.LinkageTree.sort_by_weight`
 
 ``facet.selection``
 ^^^^^^^^^^^^^^^^^^^
 
-- API: :class:`.ModelSelector` replaces FACET 1 class ``LearnerRanker``, and now
-  supports any CV searcher that supports `scikit-learn`'s CV search API, including
-  `scikit-learn`'s native searchers such as :class:`.GridSearchCV` or
-  :class:`.RandomizedSearchCV`
-- API: new classes :class:`.ParameterSpace` and :class:`MultiParameterSpace` offer an
-  a more convenient and robust mechanism for declaring options or distributions for
-  hyperparameter tuning
+- API: :class:`.LearnerSelector` replaces FACET |nbsp| 1.x class ``LearnerRanker``, and
+  now supports any CV searcher that supports `scikit-learn`'s CV search API, including
+  `scikit-learn`'s native searchers such as
+  :class:`~sklearn.model_selection.GridSearchCV` or
+  :class:`~sklearn.model_selection.RandomizedSearchCV`
+- API: new classes :class:`.ParameterSpace` and :class:`.MultiEstimatorParameterSpace`
+  offer a more convenient and robust mechanism for declaring options or distributions
+  for hyperparameter tuning
 
 ``facet.simulation``
 ^^^^^^^^^^^^^^^^^^^^
@@ -63,18 +68,19 @@ visualizations, and is now fully type-checked by |mypy|.
 ``facet.validation``
 ^^^^^^^^^^^^^^^^^^^^
 
-- API: remove class ``FullSampleValidator``
+- API: removed class ``FullSampleValidator``
 
 Other
 ^^^^^
 
-- API: class ``LearnerCrossfit`` is no longer needed in FACET 2.0 and has been removed
+- API: class ``LearnerCrossfit`` is no longer needed in FACET |nbsp| 2.0 and has been
+  removed
 
 
 FACET 1.2
 ---------
 
-FACET 1.2 adds support for *sklearndf* 1.2 and *scikit-learn* 0.24.
+FACET |nbsp| 1.2 adds support for *sklearndf* |nbsp| 1.2 and *scikit-learn* |nbsp| 0.24.
 It also introduces the ability to run simulations on a subsample of the data used to
 fit the underlying crossfit.
 One example where this can be useful is to use only a recent period of a time series as
@@ -84,21 +90,21 @@ the baseline of a simulation.
 1.2.2
 ~~~~~
 
-- catch up with FACET 1.1.2
+- catch up with FACET |nbsp| 1.1.2
 
 
 1.2.1
 ~~~~~
 
 - FIX: fix a bug in :class:`.UnivariateProbabilitySimulator` that was introduced in
-  FACET 1.2.0
-- catch up with FACET 1.1.1
+  FACET |nbsp| 1.2.0
+- catch up with FACET |nbsp| 1.1.1
 
 
 1.2.0
 ~~~~~
 
-- BUILD: added support for *sklearndf* 1.2 and *scikit-learn* 0.24
+- BUILD: added support for *sklearndf* |nbsp| 1.2 and *scikit-learn* |nbsp| 0.24
 - API: new optional parameter ``subsample`` in method
   :meth:`.BaseUnivariateSimulator.simulate_feature` can be used to specify a subsample
   to be used in the simulation (but simulating using a crossfit based on the full
@@ -108,18 +114,20 @@ the baseline of a simulation.
 FACET 1.1
 ---------
 
-FACET 1.1 refines and enhances the association/synergy/redundancy calculations provided
-by the :class:`.LearnerInspector`.
+FACET |nbsp| 1.1 refines and enhances the association/synergy/redundancy calculations
+provided by the :class:`.LearnerInspector`.
 
 
 1.1.2
 ~~~~~
 
 - DOC: use a downloadable dataset in the `getting started` notebook
-- FIX: import :mod:`catboost` if present, else create a local module mockup
+- FIX: import `catboost <https://catboost.ai/en/docs/>`_ if present, else create a local
+  module mockup
 - FIX: correctly identify if ``sample_weights`` is undefined when re-fitting a model
-  on the full dataset in a :class:`.LearnerCrossfit`
-- BUILD: relax package dependencies to support any `numpy` version 1.`x` from 1.16
+  on the full dataset in a ``LearnerCrossfit``
+- BUILD: relax package dependencies to support any `numpy` version |nbsp| 1.`x` from
+  |nbsp| 1.16
 
 
 1.1.1
@@ -143,9 +151,9 @@ by the :class:`.LearnerInspector`.
   across matrices as an indication of confidence for each calculated value.
 - API: Method :meth:`.LearnerInspector.shap_plot_data` now returns SHAP values for the
   positive class of binary classifiers.
-- API: Increase efficiency of :class:`.ModelSelector` parallelization by adopting the
+- API: Increase efficiency of ``ModelSelector`` parallelization by adopting the
   new :class:`pytools.parallelization.JobRunner` API provided by :mod:`pytools`
-- BUILD: add support for :mod:`shap` 0.38 and 0.39
+- BUILD: add support for :mod:`shap` |nbsp| 0.38 and |nbsp| 0.39
 
 
 FACET 1.0
@@ -154,19 +162,20 @@ FACET 1.0
 1.0.3
 ~~~~~
 
-- FIX: restrict package requirements to *gamma-pytools* 1.0.* and *sklearndf* 1.0.x,
-  since FACET 1.0 is not compatible with *gamma-pytools* 1.1.*
+- FIX: restrict package requirements to *gamma-pytools* |nbsp| 1.0.* and
+  *sklearndf* |nbsp| 1.0.x, since FACET |nbsp| 1.0 is not compatible with
+  *gamma-pytools* |nbsp| 1.1.*
 
 1.0.2
 ~~~~~
 
 This is a maintenance release focusing on enhancements to the CI/CD pipeline and bug
 fixes.
 
-- API: add support for :mod:`shap` 0.36 and 0.37 via a new :class:`.BaseExplainer`
-  stub class
+- API: add support for |shap| |nbsp| 0.36 and |nbsp| 0.37 via a new
+  :class:`.BaseExplainer` stub class
 - FIX: apply color scheme to the histogram section in :class:`.SimulationMatplotStyle`
-- BUILD: add support for :mod:`numpy` 1.20
+- BUILD: add support for :mod:`numpy` |nbsp| 1.20
 - BUILD: updates and changes to the CI/CD pipeline
 
 

diff --git a/environment.yml b/environment.yml
@@ -12,7 +12,7 @@ dependencies:
   - numpy ~= 1.22
   - pandas ~= 1.4
   - python ~= 3.9
-  - scikit-learn ~= 1.0.2
+  - scikit-learn ~= 1.1
   - scipy ~= 1.8
   - shap ~= 0.41
   - sklearndf ~= 2.0
@@ -38,6 +38,8 @@ dependencies:
   - sphinx-autodoc-typehints ~= 1.19
   - pydata-sphinx-theme ~= 0.8.1
   # notebooks
+  - ipywidgets ~= 8.0
   - jupyterlab ~= 3.2
   - openpyxl ~= 3.0
   - seaborn ~= 0.11
+  - tableone ~= 0.7
diff --git a/pyproject.toml b/pyproject.toml
@@ -74,15 +74,15 @@ no-binary.min = ["matplotlib", "shap"]
 
 [build.matrix.min]
 # direct requirements of gamma-facet
-gamma-pytools  = "~=2.0.2"
+gamma-pytools  = "~=2.0.4"
 matplotlib     = "~=3.0.3"
 numpy          = "==1.21.6"  # cannot use ~= due to conda bug
 packaging      = "~=20.9"
 pandas         = "~=1.0.5"
 python         = ">=3.7.12,<3.8a"    # cannot use ~= due to conda bug
 scipy          = "~=1.4.1"
 shap           = "~=0.34.0"
-sklearndf      = "~=2.0.0"
+sklearndf      = "~=2.0.1"
 # additional minimum requirements of sklearndf
 boruta         = "~=0.3.0"
 lightgbm       = "~=3.0.0"
@@ -105,11 +105,11 @@ pandas         = "~=1.4"
 python         = ">=3.9,<4a"   # cannot use ~= due to conda bug
 scipy          = "~=1.8"
 shap           = "~=0.41"
-sklearndf      = "~=2.0"
+sklearndf      = "~=2.1"
 # additional maximum requirements of sklearndf
 boruta         = "~=0.3"
 lightgbm       = "~=3.3"
-scikit-learn   = "~=1.0.2"
+scikit-learn   = "~=1.1"
 xgboost        = "~=1.5"
 # additional maximum requirements of gamma-pytools
 joblib         = "~=1.1"

diff --git a/sphinx/auxiliary/Diabetes_getting_started_example.ipynb b/sphinx/auxiliary/Diabetes_getting_started_example.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<img src=\"../source/_static/Gamma_Facet_Logo_RGB_LB.svg\" width=\"500\" style=\"padding-bottom: 70px; padding-top: 70px; margin: auto; display: block\">"
+    "<img src=\"../source/_images/Gamma_Facet_Logo_RGB_LB.svg\" width=\"500\" style=\"padding-bottom: 70px; padding-top: 70px; margin: auto; display: block\">"
    ]
   },
   {
@@ -71,7 +71,7 @@
     "To demonstrate the model inspection capability of FACET, we first create a pipeline to fit a learner. In this simple example we use the [diabetes dataset](https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data) which contains age, sex, BMI and blood pressure along with 6 blood serum measurements as features. This dataset was used in this\n",
     "[publication](https://statweb.stanford.edu/~tibs/ftp/lars.pdf). A transformed version of this dataset is also available on scikit-learn [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).\n",
     "\n",
-    "In this quickstart we will train a Random Forest regressor using 10 repeated 5-fold CV to predict disease progression after one year. With the use of *sklearndf* we can create a *pandas* DataFrame compatible workflow. However, FACET provides additional enhancements to keep track of our feature matrix and target vector using a sample object (`Sample`) and easily compare hyperparameter configurations and even multiple learners with the `ModelSelector`."
+    "In this quickstart we will train a Random Forest regressor using 10 repeated 5-fold CV to predict disease progression after one year. With the use of *sklearndf* we can create a *pandas* DataFrame compatible workflow. However, FACET provides additional enhancements to keep track of our feature matrix and target vector using a sample object (`Sample`) and easily compare hyperparameter configurations and even multiple learners with the `LearnerSelector`."
    ]
   },
   {
@@ -274,7 +274,7 @@
    ],
    "source": [
     "# rank your candidate models by performance\n",
-    "selector = ModelSelector(\n",
+    "selector = LearnerSelector(\n",
     "    searcher_type=GridSearchCV,\n",
     "    parameter_space=rnd_forest_ps, \n",
     "    cv=rkf_cv, \n",
@@ -399,7 +399,7 @@
     "# save copy of plot to _static directory for documentation\n",
     "MatrixDrawer(style=\"matplot%\").draw(synergy_matrix, title=\"Synergy Matrix\")\n",
     "plt.savefig(\n",
-    "    \"../source/_static/synergy_matrix.png\", bbox_inches=\"tight\", pad_inches=0\n",
+    "    \"../source/_images/synergy_matrix.png\", bbox_inches=\"tight\", pad_inches=0\n",
     ")"
    ]
   },
@@ -456,7 +456,7 @@
     "# save copy of plot to _static directory for documentation\n",
     "MatrixDrawer(style=\"matplot%\").draw(redundancy_matrix, title=\"Redundancy Matrix\")\n",
     "plt.savefig(\n",
-    "    \"../source/_static/redundancy_matrix.png\",\n",
+    "    \"../source/_images/redundancy_matrix.png\",\n",
     "    bbox_inches=\"tight\",\n",
     "    pad_inches=0,\n",
     ")"
@@ -525,7 +525,7 @@
     "\n",
     "# save copy of plot to _static directories for documentation\n",
     "plt.savefig(\n",
-    "    \"../source/_static/redundancy_dendrogram.png\",\n",
+    "    \"../source/_images/redundancy_dendrogram.png\",\n",
     "    bbox_inches=\"tight\",\n",
     "    pad_inches=0,\n",
     ")"
@@ -608,7 +608,7 @@
     "\n",
     "# save copy of plot to _static directory for documentation\n",
     "plt.savefig(\n",
-    "    \"../source/_static/simulation_output.png\",\n",
+    "    \"../source/_images/simulation_output.png\",\n",
     "    bbox_inches=\"tight\",\n",
     "    pad_inches=0,\n",
     ")"