Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve model selection #848

Merged
merged 13 commits into from
Feb 14, 2024
43 changes: 43 additions & 0 deletions doc/spec/model_selection.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
.. _model_selection:

=================
Model Selection
=================

Estimators that derive from :class:`._OrthoLearner` fit first stage nuisance models on different folds of the data and then fit a final model.
In many cases it will make sense to perform model selection over a number of first-stage models, and the library facilitates this by allowing
a flexible specification of the first-stage models, as any of the following:

* An sklearn-compatible estimator

* If the estimator is a known class that performs its own hyperparameter selection via cross-validation (such as :class:`~sklearn.linear_model.LassoCV`),
then this will be done once and then the selected hyperparameters will be used when cross-fitting on each fold

* If a custom class is used, then it should support a `fit` method and either a `predict` method if the target is continuous or `predict_proba` if the target is discrete.

* One of the following strings; the exact set of models supported by each of these keywords may vary depending on the version of our package:

``"linear"``
Selects over linear models regularized by L1 or L2 norm

``"poly"``
Selects over regularized linear models with polynomial features of different degrees

``"forest"``
Selects over random forest models

``"gbf"``
Selects over gradient boosting models

``"nnet"``
Selects over neural network models

``"automl"``
Selects over all of the above (note that this will be potentially time consuming)

* A list of any of the above

* An implementation of :class:`.ModelSelector`, which is a class that supports a two-stage model selection and fitting process
(this is used internally by our library and is not generally intended to be used directly by end users).

Most subclasses also use the string `"auto"`` as a special default value to automatically select a model from an appropriate smaller subset of models than would be generated by "automl".
1 change: 1 addition & 0 deletions doc/spec/spec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ EconML User Guide
estimation_iv
estimation_dynamic
inference
model_selection
interpretability
federated_learning
references
Expand Down
246 changes: 157 additions & 89 deletions econml/_ortho_learner.py

Large diffs are not rendered by default.

977 changes: 493 additions & 484 deletions econml/dml/_rlearner.py

Large diffs are not rendered by default.

42 changes: 15 additions & 27 deletions econml/dml/causal_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,35 +268,23 @@ class CausalForestDML(_BaseDML):

Parameters
----------
model_y: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto'
Determines how to fit the treatment to the features.

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_y: estimator, default ``'auto'``
Determines how to fit the outcome to the features.

- 'linear' - LogisticRegressionCV if discrete_outcome=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_outcome=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_outcome=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_outcome` is True
and a regressor otherwise

model_t: estimator, {'linear', 'forest'}, list of str/estimator, or 'auto', default 'auto'
Determines how to fit the treatment to the features. str in a sentence

- If an estimator, will use the model as is for fitting.
- If str, will use model associated with the keyword.
model_t: estimator, default ``'auto'``
Determines how to fit the treatment to the features.

- 'linear' - LogisticRegressionCV if discrete_treatment=True else WeightedLassoCVWrapper
- 'forest' - RandomForestClassifier if discrete_treatment=True else RandomForestRegressor
- If list, will perform model selection on the supplied list, which can be a mix of str and estimators, \
and then use the best estimator for fitting.
- If 'auto', model will select over linear and forest models
- If ``'auto'``, the model will be the best-fitting of a set of linear and forest models

User-supplied estimators should support 'fit' and 'predict' methods,
and additionally 'predict_proba' if discrete_treatment=True.
- Otherwise, see :ref:`model_selection` for the range of supported options;
if a single model is specified it should be a classifier if `discrete_treatment` is True
and a regressor otherwise

featurizer : :term:`transformer`, optional
Must support fit_transform and transform. Used to create composite features in the final CATE regression.
Expand Down Expand Up @@ -569,10 +557,10 @@ class CausalForestDML(_BaseDML):
est.fit(y, T, X=X, W=None)

>>> est.effect(X[:3])
array([0.88518..., 1.25061..., 0.81112...])
array([0.62947..., 1.64576..., 0.68496... ])
>>> est.effect_interval(X[:3])
(array([0.40163..., 0.75023..., 0.46629...]),
array([1.36873..., 1.75099..., 1.15596...]))
(array([0.19136... , 1.17143..., 0.10789...]),
array([1.06758..., 2.12009..., 1.26203...]))

Attributes
----------
Expand Down
Loading
Loading