A modern implementation of the Super Learner algorithm for ensemble learning and model stacking
Authors: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Oleg Sofrygin
sl3
is a modern implementation of the Super Learner algorithm of @vdl2007super. The Super Learner algorithm performs ensemble learning in one of two fashions:
- The "discrete" Super Learner can be used to select the best prediction algorithm among a supplied library of learning algorithms ("learners" in the
sl3
nomenclature) -- that is, that algorithm which minimizes the cross-validated risk with respect to some appropriate loss function. - The "ensemble" Super Learner can be used to assign weights to specified learning algorithms (in a user-supplied library) in order to create a combination of these learners that minimizes the cross-validated risk with respect to an appropriate loss function. This notion of weighted combinations has also been called stacked regression [@breiman1996stacked].
Install the most recent stable release from GitHub via devtools
:
devtools::install_github("jeremyrcoyle/sl3")
If you encounter any bugs or have any specific feature requests, please file an issue.
sl3
makes the process of applying screening algorithms, learning algorithms, combining both types of algorithms into a stacked regression model, and cross-validating this whole process essentially trivial. The best way to understand this is to see the sl3
package in action:
set.seed(49753)
suppressMessages(library(data.table))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(SuperLearner)
#> Loading required package: nnls
#> Super Learner
#> Version: 2.0-22
#> Package created on 2017-07-18
library(origami)
#> origami: Generalized Cross-Validation Framework
#> Version: 1.0.0
library(sl3)
# load example data set
data(cpp)
cpp <- cpp %>%
dplyr::filter(!is.na(haz)) %>%
mutate_all(funs(replace(., is.na(.), 0)))
# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
"sexn")
task <- sl3_Task$new(cpp, covariates = covars, outcome = "haz")
# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")
# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
#> Loading required package: glmnet
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-13
preds <- stack_fit$predict()
head(preds)
#> Lrnr_pkg_SuperLearner_SL.glmnet Lrnr_glm_TRUE
#> 1: 0.35345519 0.36298498
#> 2: 0.35345519 0.36298498
#> 3: 0.24554305 0.25993072
#> 4: 0.24554305 0.25993072
#> 5: 0.24554305 0.25993072
#> 6: 0.02953193 0.05680264
#> Lrnr_pkg_SuperLearner_screener_screen.glmnet___Lrnr_glm_TRUE
#> 1: 0.36228209
#> 2: 0.36228209
#> 3: 0.25870995
#> 4: 0.25870995
#> 5: 0.25870995
#> 6: 0.05600958
It is our hope that sl3
will grow to be widely used for creating stacked regression models and the cross-validation of pipelines that make up such models, as well as the variety of other applications in which the Super Learner algorithm plays a role. To that end, contributions are very welcome, though we ask that interested contributors consult our contribution guidelines
prior to submitting a pull request.
© 2017 Jeremy R. Coyle, Nima S. Hejazi, Ivana Malenica, Oleg Sofrygin
The contents of this repository are distributed under the GPL-3 license. See file LICENSE
for details.