This repository has been archived by the owner on Mar 15, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path04-sl3.Rmd
1320 lines (1103 loc) · 54.2 KB
/
04-sl3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Super (Machine) Learning {#sl3}
_Ivana Malenica_ and _Rachael Phillips_
Based on the [`sl3` `R` package](https://github.com/tlverse/sl3) by _Jeremy
Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin_.
Updated: `r Sys.Date()`
## Learning Objectives {-}
By the end of this chapter you will be able to:
1. Select a loss function that is appropriate for the functional parameter to
be estimated.
2. Assemble an ensemble of learners based on the properties that identify what
features they support.
3. Customize learner hyperparameters to incorporate a diversity of different
settings.
4. Select a subset of available covariates and pass only those variables to the
modeling algorithm.
5. Fit an ensemble with nested cross-validation to obtain an estimate of the
performance of the ensemble itself.
6. Obtain `sl3` variable importance metrics.
7. Interpret the discrete and continuous Super Learner fits.
8. Rationalize the need to remove bias from the Super Learner to make an
optimal bias–variance tradeoff for the parameter of interest.
## Motivation {-}
- A common task in data analysis is prediction --- using the observed data to
learn a function, which can be used to map new input variables into a
predicted outcome.
<!--
Oftentimes, the scientific question of interest translates to a statistical
question that requires (causal) effect estimation. Even in these scenarios,
where prediction is not in the forefront, there are Often prediction steps
embedded in the procedure.
--->
- For some data, algorithms that can model a complex function are necessary to
adequately model the data. For other data, a main terms regression model might
fit the data quite well.
- The Super Learner (SL), an ensemble learner, solves this issue, by allowing a
combination of algorithms from the simplest (intercept-only) to most complex
(neural nets, random forests, SVM, etc).
- It works by using cross-validation in a manner which guarantees that the
resulting fit will be as good as possible, given the learners provided.
## Introduction {-}
In [Chapter 1](#intro), we introduced the [_Roadmap for Targeted
Learning_](#roadmap) as a general template to translate real-world data
applications into formal statistical estimation problems. The first steps of
this roadmap define the *statistical estimation problem*, which establish
1. Data as a random variable, or equivalently, a realization of a
particular experiment/study.
2. A statistical model as the set of possible probability distributions that
could have given rise to the observed data. We could know, for example, that
the observations are independent and identically distributed.
3. A translation of the scientific question, which is often causal, into a
target estimand.
Note that if the estimand is causal, step 3 also requires establishing
identifiability of the estimand from the observed data, under possible
non-testable assumptions that may not necessarily be reasonable. Still, the
target quantity does have a valid statistical interpretation. See [causal target
parameters](#causal) for more detail on causal models and identifiability.
Now that we have defined the statistical estimation problem, we are ready to
construct the TMLE; an asymptotically linear and efficient substitution
estimator of this estimand. The first step in this estimation procedure
is an initial estimate of the data-generating distribution, or the relevant part
of this distribution that is needed to evaluate the target parameter. For this
initial estimation, we use the Super Learner (SL) [@vdl2007super].
The SL provides an important step in creating a robust estimator. It is a loss-
function- based tool that uses cross-validation to obtain the best prediction of
our target parameter, based on a weighted average of a library of machine
learning algorithms. The library of machine learning algorithms consists of
functions ("learners" in the `sl3` nomenclature) that we think might be
consistent with the true data-generating distribution. By "consistent with the
true data-generating distribution", we mean that the algorithms selected should
not violate subject-matter knowledge about the experiment that generated the
data. Also, the library should contain a diversity of algorithms that range from
parametric regression models to multi-step algorithms involving screening
covariates, penalizations, optimizing tuning parameters, etc.
The ensembling of the collection of algorithms with weights ("metalearning" in
the `sl3` nomenclature) has been shown to be adaptive and robust, even in small
samples [@polley2010super]. The SL is proven to perform asymptotically as well
as the best possible prediction algorithm in the library [@vdl2003unified;
@vaart2006oracle].
### Why use the Super Learner? {-}
- For prediction, one can use the cross-validated risk to empirically determine
the relative performance of SL and competing methods.
- When we have tested different algorithms on actual
data and looked at the performance (e.g., MSE of prediction), never does one
algorithm always win (see below).
- Below shows the results of such a study, comparing the fits of several
different learners, including the SL algorithms.
```{r why_sl, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/misc/ericSL.pdf")
```
- SL performs asymptotically as well as the best possible weighted combination.
- By including all competitors in the library of candidate estimators (glm,
neural nets, SVMs, random forest, etc.), the SL will asymptotically
outperform any of its competitors --- even if the set of competitors is
allowed to grow polynomial in sample size.
- Motivates the name "Super Learner": it provides a system of combining many
estimators into an improved estimator.
### General Overview of the Algorithm {-}
#### What is cross-validation and how does it work?
- There are many different cross-validation schemes, which are designed to
accommodate different study designs, data structures, and prediction
problems.
- The figure below shows an example of $V$-fold cross-validation with $V=10$
folds.
```{r cv_fig, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/misc/vs.pdf")
```
#### General step-by-step overview of the Super Learner algorithm
- Break up the sample evenly into V-folds (say V=10).
- For each of these 10 folds, remove that portion of the sample (kept out as
validation sample) and the remaining will be used to fit learners (training
sample).
- Fit each learner on the training sample (note, some learners will have their
own internal cross-validation procedure or other methods to select tuning
parameters).
- For each observation in the corresponding validation sample, predict the
outcome using each of the learners, so if there are $p$ learners, then there
would be $p$ predictions.
- Take out another validation sample and repeat until each of the V-sets of data
are removed.
- Compare the cross-validated fit of the learners across all observations based
on specified loss function (e.g., squared error, negative log-likelihood, etc)
by calculating the corresponding average loss (risk).
- Either:
+ choose the learner with smallest risk and apply that learner to entire data
set (resulting SL fit),
+ do a weighted average of the learners to minimize the cross-validated risk
(construct an ensemble of learners), by
+ re-fitting the learners on the original data set, and
+ use the weights above to get the SL fit.
Note, this entire procedure can be itself cross-validated to get a consistent
estimate of the future performance of the SL fit.
```{r cv_sl_alg, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/misc/SLKaiserNew.pdf")
```
#### How to pick a library of candidate learners?
- A library is simply a collection of algorithms.
- The algorithms in the library should come from contextual knowledge and a
large set of "default" algorithms.
- The algorithms may range from a simple linear regression model to multi-step
algorithms involving screening covariates, penalizations, optimizing tuning
parameters, etc.
### Summary of Super Learner's Foundations {-}
- We use a *loss function* $L$ to assign a measure of performance to each
learner $\psi$ when applied to the data $O$, and subsequently compare
performance across the learners. More generally, $L$ maps every
$\psi \in \mathbb{R}$ to $L(\psi) : (O) \mapsto L(\psi)(O)$. We use the terms
"learner", "algorithm", and "estimator" interchangeably.
+ It is important to recall that $\psi$ is an estimator of $\psi_0$, the
unknown and true parameter value under $P_0$.
+ A valid loss function will have expectation (risk) that is minimized at
the true value of the parameter $\psi_0$. Thus, minimizing the expected
loss will bring an estimator $\psi$ closer to the true $\psi_0$.
+ For example, say we observe a learning data set $O_i=(Y_i,X_i)$, of
$i=1, \ldots, n$ independent and identically distributed observations,
where $Y_i$ is a continuous outcome of interest, $X_i$ is a set of
covariates. Also, let our objective be to estimate the function
$\psi_0: X \mapsto \psi_0(X) = \mathbb{E}_0(Y \mid X)$, which is the
conditional mean outcome given covariates. This function can be expressed
as the minimizer of the expected squared error loss:
$\psi_0 = \text{argmin}_{\psi} \mathbb{E}[L(O,\psi(X))]$, where
$L(O,\psi(X)) = (Y − \psi(X))^2$.
+ We can estimate the loss by substituting the empirical distribution of
the data $P_n$ for the true and unknown distribution of the observed data
$P_0$.
+ Also, we can use the cross-validated risk to empirically determine the
relative performance of an estimator (i.e., a candidate learner), and
perhaps how it's performance compares to other estimators.
+ Once we have tested different estimators on actual data and looked at the
performance (e.g., MSE of predictions across all learners), we can see
which algorithm (or weighted combination) has the lowest risk, and thus
is closest to the true $\psi_0$.
- The *cross-validated empirical risk* of an algorithm is defined as the
empirical mean over a validation sample of the loss of the algorithm fitted
on the training sample, averaged across the splits of the data.
- The *discrete Super Learner*, or *cross-validation selector*, is the algorithm
in the library that minimizes the cross-validated empirical risk.
- The *continuous/ensemble Super Learner*, often referred to as *Super Learner*
is a weighted average of the library of algorithms, where the weights are
chosen to minimize the cross-validated empirical risk of the library.
Restricting the weights to be positive and sum to one (i.e., a convex
combination) has been shown to improve upon the discrete Super Learner
[@polley2010super; @vdl2007super]. This notion of weighted combinations was
introduced in @wolpert1992stacked for neural networks and adapted for
regressions in @breiman1996stacked.
- Cross-validation is proven to be optimal for selection among estimators. This
result was established through the oracle inequality for the cross-validation
selector among a collection of candidate estimators [@vdl2003unified;
@vaart2006oracle]. The only condition is that loss function is uniformly
bounded, which is guaranteed in `sl3`.
<!--
The *oracle results* prove that, if the number of algorithms in the library are
polynomial in sample size, then the cross-validation selector (i.e., discrete
Super Learner) (1) is equivalent with the oracle selector asymptotically (based
on sample of size of training samples), or (2) achieves the parametric rate
(log $n/n$) for convergence with respect to the loss-based dissimilarity (risk)
between a candidate estimate $\psi$ and the true parameter value $\psi_0$.
### Super Learner for Prediction {-}
Below, we show the results of
such a study, comparing the fits of several different learners, including the SL
algorithms.
r cv_fig3, results="asis", echo = FALSE
knitr::include_graphics("img/misc/ericSL.pdf")
For more detail on Super Learner we refer the reader to @vdl2007super and
@polley2010super. The optimality results for the cross-validation selector
among a family of algorithms were established in @vdl2003unified and extended
in @vaart2006oracle.
-->
## `sl3` "Microwave Dinner" Implementation {-}
We begin by illustrating the core functionality of the SL algorithm as
implemented in `sl3`.
The `sl3` implementation consists of the following steps:
0. Load the necessary libraries and data
1. Define the machine learning task
2. Make an SL by creating library of base learners and a metalearner
3. Train the SL on the machine learning task
4. Obtain predicted values
### WASH Benefits Study Example {-}
Using the WASH Benefits Bangladesh data, we are interested in predicting
weight-for-height z-score `whz` using the available covariate data. More
information on this dataset, and all other data that we will work with, are
described in [this chapter of the `tlverse`
handbook](ihttps://tlverse.org/tlverse-handbook/data.html). Let's begin!
### 0. Load the necessary libraries and data {-}
First, we will load the relevant `R` packages, set a seed, and load the data.
<!--
If you would like to use newer `sl3` functionality that is available in the
devel branch of the `sl3` GitHub repository, you need to install that version
of the package (i.e., `usethis::install_github(tlverse/sl3@devel)`), re-start
your `R` session, and then re-load the `sl3` package.
-->
```{r setup}
library(data.table)
library(dplyr)
library(readr)
library(ggplot2)
library(SuperLearner)
library(origami)
library(sl3)
library(knitr)
library(kableExtra)
# load data set and take a peek
washb_data <- fread(
paste0(
"https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
"wash-benefits/washb_data.csv"
),
stringsAsFactors = TRUE
)
head(washb_data) %>%
kable() %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```
### 1. Define the machine learning task {-}
To define the machine learning **"task"** (predict weight-for-height Z-score
`whz` using the available covariate data), we need to create an `sl3_Task`
object.
The `sl3_Task` keeps track of the roles the variables play in the machine
learning problem, the data, and any metadata (e.g., observational-level
weights, IDs, offset).
Also, if we had missing outcomes, we would need to set `drop_missing_outcome =
TRUE` when we create the task. In the next analysis, with the [IST stroke trial
data](#ist), we do have a missing outcome. In the following chapter, we need to
estimate this "missingness mechanism"; which is the conditional probably that
the outcome is observed, given the history (i.e., variables that were measured
before the missingness). Estimating the missingness mechanism requires learning
a prediction function that outputs the predicted probability that a unit
is missing, given their history.
```{r task, warning=TRUE}
# specify the outcome and covariates
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]
# create the sl3 task
washb_task <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome
)
```
*This warning is important.* The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, `sl3` uses the
median to impute missing continuous covariates, and the mode to impute binary
and categorical covariates.
Also, for each covariate column with missing values, `sl3` adds an additional
column indicating whether or not the value was imputed, which is particularly
handy when the missingness in the data might be informative.
Also, notice that we did not specify the number of folds, or the loss function
in the task. The default cross-validation scheme is V-fold, with $V=10$ number
of folds.
Let's visualize our `washb_task`:
```{r task-examine}
washb_task
```
We can't see when we print the task, but the default cross-validation fold
structure ($V$-fold cross-validation with $V$=10 folds) was created when we
defined the task.
```{r task-folds-examine}
length(washb_task$folds) # how many folds?
head(washb_task$folds[[1]]$training_set) # row indexes for fold 1 training
head(washb_task$folds[[1]]$validation_set) # row indexes for fold 1 validation
any(
washb_task$folds[[1]]$training_set %in% washb_task$folds[[1]]$validation_set
)
```
`R6` Tip: If you type `washb_task$` and then press the "tab" button (you will
need to press "tab" twice if you're not in RStudio), you can view all of the
active and public fields and methods that can be accessed from the `washb_task`
object.
### 2. Make a Super Learner {-}
Now that we have defined our machine learning problem with the `sl3_Task`, we
are ready to **"make"** the Super Learner (SL). This requires specification of
* A set of candidate machine learning algorithms, also commonly referred to as
a "library" of "learners". The set should include a diversity of algorithms
that are believed to be consistent with the true data-generating distribution.
* A metalearner, to ensemble the base learners.
We might also incorporate
* Feature selection, to pass only a subset of the predictors to the algorithm.
* Hyperparameter specification, to tune base learners.
Learners have properties that indicate what features they support. We may use
`sl3_list_properties()` to get a list of all properties supported by at least
one learner.
```{r list-properties}
sl3_list_properties()
```
Since we have a continuous outcome, we may identify the learners that support
this outcome type with `sl3_list_learners()`.
```{r list-learners}
sl3_list_learners("continuous")
```
Now that we have an idea of some learners, we can construct them using the
`make_learner` function or the `new` method.
```{r baselearners}
# choose base learners
lrn_glm <- make_learner(Lrnr_glm)
lrn_mean <- Lrnr_mean$new()
```
We can customize learner hyperparameters to incorporate a diversity of different
settings. Documentation for the learners and their hyperparameters can be found
in the [`sl3` Learners
Reference](https://tlverse.org/sl3/reference/index.html#section-sl-learners).
```{r extra-lrnr-awesome}
lrn_lasso <- make_learner(Lrnr_glmnet) # alpha default is 1
lrn_ridge <- Lrnr_glmnet$new(alpha = 0)
lrn_enet.5 <- make_learner(Lrnr_glmnet, alpha = 0.5)
lrn_polspline <- Lrnr_polspline$new()
lrn_ranger100 <- make_learner(Lrnr_ranger, num.trees = 100)
lrn_hal_faster <- Lrnr_hal9001$new(max_degree = 2, reduce_basis = 0.05)
xgb_fast <- Lrnr_xgboost$new() # default with nrounds = 20 is pretty fast
xgb_50 <- Lrnr_xgboost$new(nrounds = 50)
```
We can use `Lrnr_define_interactions` to define interaction terms among
covariates. The interactions should be supplied as list of character vectors,
where each vector specifies an interaction. For example, we specify
interactions below between (1) `tr` (whether or not the subject received the
WASH intervention) and `elec` (whether or not the subject had electricity); and
between (2) `tr` and `hfiacat` (the subject's level of food security).
```{r interaction-learner}
interactions <- list(c("elec", "tr"), c("tr", "hfiacat"))
# main terms as well as the interactions above will be included
lrn_interaction <- make_learner(Lrnr_define_interactions, interactions)
```
What we just defined above is incomplete. In order to fit learners with these
interactions, we need to create a `Pipeline`. A `Pipeline` is a set of learners
to be fit sequentially, where the fit from one learner is used to define the
task for the next learner. We need to create a `Pipeline` with the interaction
defining learner and another learner that incorporate these terms when fitting
a model. Let's create a learner pipeline that will fit a linear model with the
combination of main terms and interactions terms, as specified in
`lrn_interaction`.
```{r interaction-pipe}
# we already instantiated a linear model learner, no need to do that again
lrn_glm_interaction <- make_learner(Pipeline, lrn_interaction, lrn_glm)
lrn_glm_interaction
```
We can also include learners from the `SuperLearner` `R` package.
```{r extra-lrnr-woah}
lrn_bayesglm <- Lrnr_pkg_SuperLearner$new("SL.bayesglm")
```
Here is a fun trick to create customized learners over a grid of parameters.
```{r extra-lrnr-mindblown-svm, eval = FALSE}
# I like to crock pot my SLs
grid_params <- list(
cost = c(0.01, 0.1, 1, 10, 100, 1000),
gamma = c(0.001, 0.01, 0.1, 1),
kernel = c("polynomial", "radial", "sigmoid"),
degree = c(1, 2, 3)
)
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
svm_learners <- apply(grid, MARGIN = 1, function(tuning_params) {
do.call(Lrnr_svm$new, as.list(tuning_params))
})
```
```{r extra-lrnr-mindblown-xgboost}
grid_params <- list(
max_depth = c(2, 4, 6),
eta = c(0.001, 0.1, 0.3),
nrounds = 100
)
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
grid
xgb_learners <- apply(grid, MARGIN = 1, function(tuning_params) {
do.call(Lrnr_xgboost$new, as.list(tuning_params))
})
xgb_learners
```
Did you see `Lrnr_caret` when we called `sl3_list_learners(c("binomial"))`? All
we need to specify to use this popular algorithm as a candidate in our SL is
the `algorithm` we want to tune, which is passed as `method` to `caret::train()`.
The default method for parameter selection criterion with is set to "CV"
instead of the `caret::train()` default `boot`. The summary metric used to
select the optimal model is `RMSE` for continuous outcomes and `Accuracy` for
categorical and binomial outcomes.
```{r carotene, eval = FALSE}
# Unlike xgboost, I have no idea how to tune a neural net or BART machine, so
# I let caret take the reins
lrnr_caret_nnet <- make_learner(Lrnr_caret, algorithm = "nnet")
lrnr_caret_bartMachine <- make_learner(Lrnr_caret,
algorithm = "bartMachine",
method = "boot", metric = "Accuracy",
tuneLength = 10
)
```
In order to assemble the library of learners, we need to **"stack"** them
together.
A `Stack` is a special learner and it has the same interface as all other
learners. What makes a stack special is that it combines multiple learners by
training them simultaneously, so that their predictions can be either combined
or compared.
```{r stack}
stack <- make_learner(
Stack, lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50
)
stack
```
We can also stack the learners by first creating a vector, and then
instantiating the stack. I prefer this method, since it easily allows us to
modify the names of the learners.
```{r alt-stack}
# named vector of learners first
learners <- c(
lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50
)
names(learners) <- c(
"glm", "polspline", "enet.5", "ridge", "lasso", "xgboost50"
)
# next make the stack
stack <- make_learner(Stack, learners)
# now the names are pretty
stack
```
We're jumping ahead a bit, but let's check something out quickly. It's
straightforward, and just one more step, to set up this stack such that
all of the learners will train in a cross-validated manner.
```{r alt-stack-cv}
cv_stack <- Lrnr_cv$new(stack)
cv_stack
```
#### Screening Algorithms for Feature Selection {-}
We can optionally select a subset of available covariates and pass only
those variables to the modeling algorithm. The current set of learners that
can be used for prescreening covariates is included below.
- `Lrnr_screener_importance` selects `num_screen` (default = 5) covariates
based on the variable importance ranking provided by the `learner`. Any
learner with an "importance" method can be used in `Lrnr_screener_importance`;
and this currently includes `Lrnr_ranger`, `Lrnr_randomForest`, and
`Lrnr_xgboost`.
- `Lrnr_screener_coefs`, which provides screening of covariates based on the
magnitude of their estimated coefficients in a (possibly regularized) GLM.
The `threshold` (default = 1e-3) defines the minimum absolute size of the
coefficients, and thus covariates, to be kept. Also, a `max_retain` argument
can be optionally provided to restrict the number of selected covariates to be
no more than `max_retain`.
- `Lrnr_screener_correlation` provides covariate screening procedures by
running a test of correlation (Pearson default), and then selecting the (1)
top ranked variables (default), or (2) the variables with a pvalue lower than
some pre-specified threshold.
- `Lrnr_screener_augment` augments a set of screened covariates with additional
covariates that should be included by default, even if the screener did not
select them. An example of how to use this screener is included below.
Let's consider screening covariates based on their `randomForest` variable
importance ranking (ordered by mean decrease in accuracy). To select the top
5 most important covariates according to this ranking, we can combine
`Lrnr_screener_importance` with `Lrnr_ranger` (limiting the number of trees by
setting `ntree = 20`).
Hang on! Before you think it -- I will confess: Bob Ross and I both know that 20
trees makes for a lonely forest, and I shouldn't consider it, but these are the
sacrifices I have to make for this chapter to build in time!
```{r screener}
miniforest <- Lrnr_ranger$new(
num.trees = 20, write.forest = FALSE,
importance = "impurity_corrected"
)
# learner must already be instantiated, we did this when we created miniforest
screen_rf <- Lrnr_screener_importance$new(learner = miniforest, num_screen = 5)
screen_rf
# which covariates are selected on the full data?
screen_rf$train(washb_task)
```
An example of how to format `Lrnr_screener_augment` is included below for
clarity.
```{r screener-augment}
keepme <- c("aged", "momage")
# screener must already be instantiated, we did this when we created screen_rf
screen_augment_rf <- Lrnr_screener_augment$new(
screener = screen_rf, default_covariates = keepme
)
screen_augment_rf
```
Selecting covariates with non-zero lasso coefficients is quite common. Let's
construct `Lrnr_screener_coefs` screener that does just that, and test it
out.
```{r screener-coefs}
# we already instantiated a lasso learner above, no need to do it again
screen_lasso <- Lrnr_screener_coefs$new(learner = lrn_lasso, threshold = 0)
screen_lasso
```
To **"pipe"** only the selected covariates to the modeling algorithm, we need to
make a `Pipeline`, similar to the one we built for the regression model with
interaction terms.
```{r screener-pipe}
screen_rf_pipe <- make_learner(Pipeline, screen_rf, stack)
screen_lasso_pipe <- make_learner(Pipeline, screen_lasso, stack)
```
Now, these learners with no internal screening will be preceded by a screening
step.
We also consider the original `stack`, to compare how the feature selection
methods perform in comparison to the methods without feature selection.
Analogous to what we have seen before, we have to stack the pipeline and
original `stack` together, so we may use them as base learners in our super
learner.
```{r screeners-stack}
# pretty names again
learners2 <- c(learners, screen_rf_pipe, screen_lasso_pipe)
names(learners2) <- c(names(learners), "randomforest_screen", "lasso_screen")
fancy_stack <- make_learner(Stack, learners2)
fancy_stack
```
We will use the [default
metalearner](https://tlverse.org/sl3/reference/default_metalearner.html),
which uses
[`Lrnr_solnp`](https://tlverse.org/sl3/reference/Lrnr_solnp.html) to
provide fitting procedures for a pairing of [loss
function](https://tlverse.org/sl3/reference/loss_functions.html) and
[metalearner
function](https://tlverse.org/sl3/reference/metalearners.html). This
default metalearner selects a loss and metalearner pairing based on the outcome
type. Note that any learner can be used as a metalearner.
Now that we have made a diverse stack of base learners, we are ready to make the
SL. The SL algorithm fits a metalearner on the validation set
predictions/losses across all folds.
```{r make-sl}
sl <- make_learner(Lrnr_sl, learners = fancy_stack)
```
We can also use `Lrnr_cv` to build a SL, cross-validate a stack of
learners to compare performance of the learners in the stack, or cross-validate
any single learner (see "Cross-validation" section of this [`sl3`
introductory tutorial](https://tlverse.org/sl3/articles/intro_sl3.html)).
Furthermore, we can [Define New `sl3`
Learners](https://tlverse.org/sl3/articles/custom_lrnrs.html) which can be used
in all the places you could otherwise use any other `sl3` learners, including
`Pipelines`, `Stacks`, and the SL.
Recall that the discrete SL, or cross-validated selector, is a metalearner that
assigns a weight of 1 to the learner with the lowest cross-validated empirical
risk, and weight of 0 to all other learners. This metalearner specification can
be invoked with `Lrnr_cv_selector`.
```{r make-sl-discrete}
discrete_sl_metalrn <- Lrnr_cv_selector$new()
discrete_sl <- Lrnr_sl$new(
learners = fancy_stack,
metalearner = discrete_sl_metalrn
)
```
<!--
In the plot below, we visualize the steps for executing the Super Learner in
the `tlverse/delayed` framework. For those like myself who are not particularly
keen on understanding the intricacies of `delayed`, let's focus on the main
point of this figure: we can see there are 10 realizations of the stack which
represent the 10 cross-validation folds and there is a separate hold-out
(top branch of the figure) that will not be used to fit the Super Learner.
```{r make-sl-plot}
dt_sl <- delayed_learner_train(sl, washb_task)
plot(dt_sl, color = FALSE, height = "400px", width = "90%")
```
--->
### 3. Train the Super Learner on the machine learning task {-}
The SL algorithm fits a metalearner on the validation-set predictions in a
cross-validated manner, thereby avoiding overfitting.
Now we are ready to **"train"** our SL on our `sl3_task` object, `washb_task`.
```{r sl}
set.seed(4197)
sl_fit <- sl$train(washb_task)
```
### 4. Obtain predicted values {-}
Now that we have fit the SL, we are ready to calculate the predicted outcome
for each subject.
```{r sl-predictions}
# we did it! now we have SL predictions
sl_preds <- sl_fit$predict()
head(sl_preds)
```
<!--
Below we visualize the observed versus predicted values.
For fun, we will also include the cross-validated predictions from most popular
learner on the block, main terms linear regression.
```{r, plot-predvobs-woohoo, eval=FALSE}
# df_plot <- data.frame(Observed = washb_data[["whz"]], Predicted = sl_preds,
# count = seq(1:nrow(washb_data))
# df_plot_melted <- melt(df_plot, id.vars = "count",
# measure.vars = c("Observed", "Predicted"))
# ggplot(df_plot_melted, aes(value, count, color = variable)) + geom_point()
```
-->
We can also obtain a summary of the results.
```{r, sl-summary}
sl_fit_summary <- sl_fit$print()
```
From the table of the printed SL fit, we note that the SL had a mean risk of `r
round(sl_fit_summary$risk[length(sl_fit_summary$learner)], 4)` and that
this ensemble weighted the `ranger` and `glmnet` learners highest while not
weighting the `mean` learner highly.
We can also see that the `glmnet` learner had the lowest cross-validated mean
risk, thus making it the cross-validated selector (or the _discrete_ SL). The
mean risk of the SL is calculated using all of the data, and not a separate
hold-out, so the SL's mean risk that is reported here is an underestimation.
## Cross-validated Super Learner {-}
We can cross-validate the SL to see how well the SL performs on unseen data, and
obtain an estimate of the cross-validated risk of the SL.
This estimation procedure requires an "outer/external" layer of
cross-validation, also called nested cross-validation, which involves setting
aside a separate holdout sample that we don’t use to fit the SL. This external
cross-validation procedure may also incorporate 10 folds, which is the default
in `sl3`. However, we will incorporate 2 outer/external folds of
cross-validation for computational efficiency.
We also need to specify a loss function to evaluate SL. Documentation for the
available loss functions can be found in the [`sl3` Loss Function
Reference](https://tlverse.org/sl3/reference/loss_functions.html).
```{r CVsl}
washb_task_new <- make_sl3_Task(
data = washb_data,
covariates = covars,
outcome = outcome,
folds = origami::make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
CVsl <- CV_lrnr_sl(
lrnr_sl = sl_fit, task = washb_task_new, loss_fun = loss_squared_error
)
CVsl %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = T) %>%
scroll_box(width = "100%", height = "300px")
```
<!-- Explain summary!!!! -->
## Variable Importance Measures with `sl3` {-}
Variable importance can be interesting and informative. It can also be
contradictory and confusing. Nevertheless, we like it, and so do our
collaborators, so we created a variable importance function in `sl3`! The `sl3`
`importance` function returns a table with variables listed in decreasing order
of importance (i.e., most important on the first row).
The measure of importance in `sl3` is based on a risk ratio, or risk difference,
between the learner fit with a removed, or permuted, covariate and the learner
fit with the true covariate, across all covariates. In this manner, the larger
the risk difference, the more important the variable is in the prediction.
The intuition of this measure is that it calculates the risk (in terms of the
average loss in predictive accuracy) of losing one covariate, while keeping
everything else fixed, and compares it to the risk if the covariate was not
lost. If this risk ratio is one, or risk difference is zero, then losing that
covariate had no impact, and is thus not important by this measure. We do this
across all of the covariates. As stated above, we can remove the covariate and
refit the SL without it, or we just permute the covariate (faster, more risky)
and hope for the shuffling to distort any meaningful information that was
present in the covariate. This idea of permuting instead of removing saves a lot
of time, and is also incorporated in the `randomForest` variable importance
measures. However, the permutation approach is risky, so the importance function
default is to remove and refit.
Let's explore the `sl3` variable importance measurements for the `washb` data.
```{r varimp}
washb_varimp <- importance(sl_fit, loss = loss_squared_error, type = "permute")
washb_varimp %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = TRUE) %>%
scroll_box(width = "100%", height = "300px")
```
```{r varimp-plot, out.width = "100%"}
# plot variable importance
importance_plot(
washb_varimp,
main = "sl3 Variable Importance for WASH Benefits Example Data"
)
```
<!-- Explain summary!!!! -->
## Exercises {#sl3-exercises}
### Predicting Myocardial Infarction with `sl3` {#sl3ex1}
Follow the steps below to predict myocardial infarction (`mi`) using the
available covariate data. We thank Prof. David Benkeser at Emory University for
making the this Cardiovascular Health Study (CHS) data accessible.
```{r ex-setup}
# load the data set
db_data <- url(
paste0(
"https://raw.githubusercontent.com/benkeser/sllecture/master/",
"chspred.csv"
)
)
chspred <- read_csv(file = db_data, col_names = TRUE)
# take a quick peek
head(chspred) %>%
kable(digits = 4) %>%
kableExtra:::kable_styling(fixed_thead = TRUE) %>%
scroll_box(width = "100%", height = "300px")
```
1. Create an `sl3` task, setting myocardial infarction `mi` as the outcome and
using all available covariate data.
2. Make a library of seven relatively fast base learning algorithms (i.e., do
not consider BART or HAL). Customize hyperparameters for one of your
learners. Feel free to use learners from `sl3` or `SuperLearner`. You may
use the same base learning library that is presented above.
3. Incorporate at least one pipeline with feature selection. Any screener and
learner(s) can be used.
4. Fit the metalearning step with the default metalearner.
5. With the metalearner and base learners, make the Super Learner (SL) and
train it on the task.
6. Print your SL fit by calling `print()` with `$`.
7. Cross-validate your SL fit to see how well it performs on unseen
data. Specify a valid loss function to evaluate the SL.
8. Use the `importance()` function to identify the "most important" predictor of
myocardial infarction, according to `sl3` importance metrics.
### Predicting Recurrent Ischemic Stroke in an RCT with `sl3` {#sl3ex2}
For this exercise, we will work with a random sample of 5,000 patients who
participated in the International Stroke Trial (IST). This data is described in
[Chapter 3.2 of the `tlverse`
handbook](https://tlverse.org/tlverse-handbook/data.html#ist).
1. Train a SL to predict recurrent stroke `DRSISC` with the available covariate
data (the 25 other variables). Of course, you can consider feature selection
in the machine learning algorithms. In this data, the outcome is
occasionally missing, so be sure to specify `drop_missing_outcome = TRUE`
when defining the task.
2. Use the SL-based predictions to calculate the area under the ROC curve (AUC).
3. Calculate the cross-validated AUC to evaluate the performance of the
SL on unseen data.
4. Which covariates are the most predictive of 14-day recurrent stroke,
according to `sl3` variable importance measures?
```{r ex-setup2}
ist_data <- paste0(
"https://raw.githubusercontent.com/tlverse/",
"tlverse-handbook/master/data/ist_sample.csv"
) %>% fread()
# number 3 help
ist_task_CVsl <- make_sl3_Task(
data = ist_data,
outcome = "DRSISC",
covariates = colnames(ist_data)[-which(names(ist_data) == "DRSISC")],
drop_missing_outcome = TRUE,
folds = origami::make_folds(
n = sum(!is.na(ist_data$DRSISC)),
fold_fun = folds_vfold,
V = 5
)
)
```
<!--
## Super Learning of a Conditional Density
### Super learning of a conditional density
Suppose we want to construct a Super Learner of the conditional probability
distribution $g_0(a\mid W)=P_0(A=a\mid W)$, where $a\in {\cal A}$.
Let's denote the values of $a$ with $\{0,1,\ldots,K\}$. A valid loss function
for the conditional density is
\[
L(g)(O)=-\log g(A\mid W).\]
That is, $g_0=\arg\min_g P_0L(g)$, i.e., $g_0$ is the minimizer of the
expectation of the log-likelihood loss.
**Candidate estimators**
1. Candidate estimators based on multinomial logistic regression: To start
with, one can use existing parametric model based MLE and machine learning
algorithms in `R` that fit a multinomial regression. For example, parametric
model multinomial logistic regression is available in `R` so that one can
already build a rich library of such estimators based on different candidate
parametric models. In addition, `polyclass()` is a multinomial logistic
regression machine learning algorithm in `R`.
2. Candidate estimators based on machine learning for multinomial logistic
regression: Secondly, one can use a machine learning algorithm such as
`polyclass()` in `R` that data adaptively fits a multinomial logistic
regression, which itself has tuning parameters, again generating a class of
candidate estimators.
3. Incorporating screening: Note that one can also marry any of these choices
with a screening algorithm, thereby creating more candidate estimators of
interest. The screening can be particularly important when there are many
variables.
4. Candidate estimators by fitting separate logistic regressions and using
post-normalization
* Code $A$ in terms of Bernoullis $B_k=I(A=k)$, $k=0,\ldots,K$.
* Construct an estimator $\bar{g}_{nk}$ of $\bar{g}_{0k}(W)\equiv P_0(B_k=1\mid
W)$ using any of the logistic regression algorithms, for all $k=0,\ldots,K$.
* This implies an estimator
\[
g_n(a\mid W)=\frac{\bar{g}_{na}(W)}{\sum_{k=0}^K \bar{g}_{nk}(W)}.\]
* In other words, we simply normalize these separate logistic regression
estimators so that we obtain a valid conditional distribution.
* This generates an enormous amount of interesting algorithms, since we have
available the whole machine learning literature for binary outcome regression.
5. Candidate estimators by estimating the conditional "hazard" with pooled
logistic regression.
Note that
\[
g_0(a\mid W)=\lambda_0(a\mid W) S_0(a\mid W),\]
where \[
\lambda_0(a\mid W)=P_0(A=a\mid A\geq a,W),\]
and $S_0(a\mid W)=\prod_{s\leq a}(1-\lambda_0(s\mid W))$ is the conditional
survival function $P_0(A>a\mid W)$. So we have now parameterized the
conditional distribution of $A$, given $W$, by a conditional hazard
$\lambda_0(a\mid W)$: $g_0=g_{\lambda_0}$.
* We could now focus on constructing candidate estimators of
$\lambda_0(a\mid W)$, which implies candidate estimators of $g_0$.
* For every observation $A_i$, we can create $A_i+1$ rows of data
$(W,s,I(A_i=s))$, $s=0,\ldots,A_i$, $i=1,\ldots,n$. We now run a logistic
regression estimator based on the pooled data set, ignoring ID, where we
regress the binary outcome $I(A_i=s)$ on the covariates $(W,s)$.
* If one assumes a parametric model, then this is nothing else then using the
maximum likelihood estimator, demonstrating that ignoring the ID is not
inefficient.
* This defines now an estimator of $\lambda_0(s\mid W)=P_0(A=s\mid W,A\geq s)$
as a function of $(s,W)$.
* Different choices of logistic regression based estimators will define
different estimators.
* The pooling across $s$ is not very sensible if $A$ is not an ordered variable