Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve model selection #848

Merged
merged 13 commits into from
Feb 14, 2024
43 changes: 43 additions & 0 deletions doc/spec/model_selection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _model_selection:

=================
Model Selection
=================

Estimators that derive from :class:`._OrthoLearner` fit first stage nuisance models on different folds of the data and then fit a final model.
In many cases it will make sense to perform model selection over a number of first-stage models, and the library facilitates this by allowing
a flexible specification of the first-stage models, as any of the following:

* An sklearn-compatible estimator

* If the estimator is a known class that performs its own hyperparameter selection via cross-validation (such as :class:`~sklearn.linear_model.LassoCV`),
then this will be done once and then the selected hyperparameters will be used when cross-fitting on each fold

* If a custom class is used, then it should support a `fit` method and either a `predict` method if the target is continuous or `predict_proba` if the target is discrete.

* One of the following strings; the exact set of models supported by each of these keywords may vary depending on the version of our package:

``"linear"``
Selects over linear models regularized by L1 or L2 norm

``"poly"``
Selects over regularized linear models with polynomial features of different degrees

``"forest"``
Selects over random forest models

``"gbf"``
Selects over gradient boosting models

``"nnet"``
Selects over neural network models

``"automl"``
Selects over all of the above (note that this will be potentially time consuming)

* A list of any of the above

* An implementation of :class:`.ModelSelector`, which is a class that supports a two-stage model selection and fitting process
(this is used internally by our library and is not generally intended to be used directly by end users).

Most subclasses also use the string `"auto"`` as a special default value to automatically select a model from an appropriate smaller subset of models than would be generated by "automl".
1 change: 1 addition & 0 deletions doc/spec/spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ EconML User Guide
estimation_iv
estimation_dynamic
inference
model_selection
interpretability
federated_learning
references
Expand Down
246 changes: 157 additions & 89 deletions econml/_ortho_learner.py

Large diffs are not rendered by default.

977 changes: 493 additions & 484 deletions econml/dml/_rlearner.py

Large diffs are not rendered by default.

36 changes: 12 additions & 24 deletions econml/dml/causal_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,35 +268,23 @@ class CausalForestDML(_BaseDML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
Determines how to fit the treatment to the features. str in a sentence

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

featurizer : :term:`transformer`, optional
Must support fit_transform and transform. Used to create composite features in the final CATE regression.
Expand Down
172 changes: 61 additions & 111 deletions econml/dml/dml.py
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ def __init__(self, model: SingleModelSelector, discrete_target):
self._model = clone(model, safe=False)
self._discrete_target = discrete_target

def train(self, is_selecting, X, W, Target, sample_weight=None, groups=None):
def train(self, is_selecting, folds, X, W, Target, sample_weight=None, groups=None):
if self._discrete_target:
# In this case, the Target is the one-hot-encoding of the treatment variable
# We need to go back to the label representation of the one-hot so as to call
Expand All @@ -92,7 +92,7 @@ def train(self, is_selecting, X, W, Target, sample_weight=None, groups=None):
"don't contain all treatments")
Target = inverse_onehot(Target)

self._model.train(is_selecting, _combine(X, W, Target.shape[0]), Target,
self._model.train(is_selecting, folds, _combine(X, W, Target.shape[0]), Target,
**filter_none_kwargs(groups=groups, sample_weight=sample_weight))
return self

Expand All @@ -104,6 +104,10 @@ def best_model(self):
def best_score(self):
return self._model.best_score

@property
def needs_fit(self):
return self._model.needs_fit


def _make_first_stage_selector(model, is_discrete, random_state):
if model == 'auto':
Expand Down Expand Up @@ -354,35 +358,23 @@ class takes as input the parameter `model_t`, which is an arbitrary scikit-learn

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.

- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

model_final: estimator
The estimator for fitting the response residuals to the treatment residuals. Must implement
Expand Down Expand Up @@ -622,35 +614,23 @@ class LinearDML(StatsModelsCateEstimatorMixin, DML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

featurizer : :term:`transformer`, optional
Must support fit_transform and transform. Used to create composite features in the final CATE regression.
Expand Down Expand Up @@ -869,35 +849,23 @@ class SparseLinearDML(DebiasedLassoCateEstimatorMixin, DML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.

- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

alpha: str or float, default 'auto'
CATE L1 regularization applied through the debiased lasso in the final model.
Expand Down Expand Up @@ -1168,32 +1136,23 @@ class KernelDML(DML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

fit_cate_intercept : bool, default True
Whether the linear CATE model should have a constant term.
Expand Down Expand Up @@ -1393,32 +1352,23 @@ class NonParamDML(_BaseDML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.

- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

model_final: estimator
The estimator for fitting the response residuals to the treatment residuals. Must implement
Expand Down
Loading