Skip to content

Commit

Permalink
Add model selection documentation
Browse files Browse the repository at this point in the history
Signed-off-by: Keith Battocchi <kebatt@microsoft.com>
  • Loading branch information
kbattocchi committed Feb 6, 2024
1 parent a09ca51 commit 9f517fd
Show file tree
Hide file tree
Showing 9 changed files with 384 additions and 707 deletions.
43 changes: 43 additions & 0 deletions doc/spec/model_selection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _model_selection:

=================
Model Selection
=================

Estimators that derive from :class:`._OrthoLearner` fit first stage nuisance models on different folds of the data and then fit a final model.
In many cases it will make sense to perform model selection over a number of first-stage models, and the library facilitates this by allowing
a flexible specification of the first-stage models, as any of the following:

* An sklearn-compatible estimator

* If the estimator is a known class that performs its own hyperparameter selection via cross-validation (such as :class:`~sklearn.linear_model.LassoCV`),
then this will be done once and then the selected hyperparameters will be used when cross-fitting on each fold

* If a custom class is used, then it should support a `fit` method and either a `predict` method if the target is continuous or `predict_proba` if the target is discrete.

* One of the following strings; the exact set of models supported by each of these keywords may vary depending on the version of our package:

``"linear"``
Selects over linear models regularized by L1 or L2 norm

``"poly"``
Selects over regularized linear models with polynomial features of different degrees

``"forest"``
Selects over random forest models

``"gbf"``
Selects over gradient boosting models

``"nnet"``
Selects over neural network models

``"automl"``
Selects over all of the above (note that this will be potentially time consuming)

* A list of any of the above

* An implementation of :class:`.ModelSelector`, which is a class that supports a two-stage model selection and fitting process
(this is used internally by our library and is not generally intended to be used directly by end users).

Most subclasses also use the string `"auto"`` as a special default value to automatically select a model from an appropriate smaller subset of models than would be generated by "automl".
1 change: 1 addition & 0 deletions doc/spec/spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ EconML User Guide
estimation_iv
estimation_dynamic
inference
model_selection
interpretability
federated_learning
references
Expand Down
36 changes: 12 additions & 24 deletions econml/dml/causal_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,35 +268,23 @@ class CausalForestDML(_BaseDML):
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
Determines how to fit the treatment to the features. str in a sentence
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
featurizer : :term:`transformer`, optional
Must support fit_transform and transform. Used to create composite features in the final CATE regression.
Expand Down
164 changes: 55 additions & 109 deletions econml/dml/dml.py
Original file line number Diff line number Diff line change
Expand Up @@ -358,35 +358,23 @@ class takes as input the parameter `model_t`, which is an arbitrary scikit-learn
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
model_final: estimator
The estimator for fitting the response residuals to the treatment residuals. Must implement
Expand Down Expand Up @@ -626,35 +614,23 @@ class LinearDML(StatsModelsCateEstimatorMixin, DML):
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
featurizer : :term:`transformer`, optional
Must support fit_transform and transform. Used to create composite features in the final CATE regression.
Expand Down Expand Up @@ -873,35 +849,23 @@ class SparseLinearDML(DebiasedLassoCateEstimatorMixin, DML):
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
alpha: str or float, default 'auto'
CATE L1 regularization applied through the debiased lasso in the final model.
Expand Down Expand Up @@ -1172,32 +1136,23 @@ class KernelDML(DML):
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
fit_cate_intercept : bool, default True
Whether the linear CATE model should have a constant term.
Expand Down Expand Up @@ -1397,32 +1352,23 @@ class NonParamDML(_BaseDML):
Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.
- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise
model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.
- If an estimator, will use the model as is for fitting.
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models
User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise
model_final: estimator
The estimator for fitting the response residuals to the treatment residuals. Must implement
Expand Down
Loading

0 comments on commit 9f517fd

Please sign in to comment.