06-sl3.Rmd

# Super (Machine) Learning {#sl3}

_Ivana Malenica_ and _Rachael Phillips_

Based on the [`sl3` `R` package](https://github.com/tlverse/sl3) by _Jeremy
Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin_.

Updated: `r Sys.Date()`

## Learning Objectives {-}

By the end of this lesson you will be able to: 

- Ensemble a combination of algorithms into an improved "super learner" 
  algorithm using the `sl3` `R` package, and explain why this combination is 
  better than selecting a single algorithm to colleagues.

## Motivation {-}

- A common task in data analysis is prediction --- using the observed data 
  (input variables and outcomes) to learn a function that can map new input 
  variables into a predicted outcome.
  <!--
  Oftentimes, the scientific question of interest translates to a statistical
  question that requires (causal) effect estimation. Even in these scenarios,
  where prediction is not in the forefront, there are often prediction steps
  embedded in the procedure.
  --->
- For some data, algorithms that learn complex relationships between variables 
  are necessary to adequately model the data. For other data, main terms 
  regression models might fit the data quite well. 
- It is generally impossible to know a priori which algorithm will be the best 
  for a given data set and prediction problem. 
- The super learner solves this issue of algorithm selection by creating an
  ensemble of many algorithms, from the simplest (intercept-only) to most 
  complex (neural nets, tree-based methods, support vector machines, etc.).
- Super learner works by using cross-validation in a manner that theoretically
  (in large samples) guarantees the resulting fit will be as good as possible, 
  given the algorithms provided.

## Introduction 

In [Chapter 1](#intro), we introduced the [_Roadmap for Targeted
Learning_](#roadmap) as a general template to translate real-world data
applications into formal statistical estimation problems. The first steps of
this roadmap define the *statistical estimation problem*, which establish

1. **The data \$O$ as a random variable, or equivalently, a realization of a** 
   **particular experiment/study, which has probability distribution $P_0$.** 
   This is written $O \sim P_0$, and $P_0$ is also commonly referred to as the 
   data-generating process (DGP) and also the data-generating distribution 
   (DGD). The data structure $O$ is comprised of variables, such as a 
   vector of covariates $W$, a treatment or exposure $A$, and an outcome $Y$, 
   $O=(W,A,Y) \sim P_0$. The observation $O$ represents a *random variable*; 
   when we repeat the common experiment $n$ times, we "realize" this random 
   variable $O$ a total of $n$ times. For example, $O_1,\ldots, O_n$ 
   random variables could be the result of a random sample of $n$ subjects from 
   a population, where we collect/realize/observe/measure the following 
   variables: baseline characteristics $W$, treatment $A$ (and let's pretend 
   treatment was assigned based on a subset of $W$, say age and sex), and a 
   post-treatment outcome $Y$.
2. **A statistical model $\M$ as a set of possible probability distributions** 
   **that could have given rise to the data.** It's essential for $\M$ to only 
   constrained by factual, subject-matter knowledge in order to guarantee $P_0$ 
   resides in the statistical model, written $P_0 \in \M$. Continuing 
   the example from step 1, the following restrictions could be placed on the 
   statistical model: the $O_1, \ldots, O_n$ observations in the data are 
   *independent and identically distributed* (i.i.d.), and the assignment of 
   treatment $A$ depended on a subset of $W$ variables: age and sex. 
3. **A translation of the scientific question of interest into a function of** 
   **$P_0$, the target statistical estimand $\Psi(P_0)$.** For example, we might 
   be interested in the average difference in mean outcomes under treatment 
   $A=1$ versus placebo $A=0$, which corresponds to the target statistical 
   estimand $\Psi(P_0)=E_{P_0}\Big[E_{P_0}(Y|A=1,W)−E_{P_0}(Y|A=0,W)\Big]$. Note 
   that, if the scientific question is causal, then it's translation will 
   produce a target *causal* estimand; another layer of translation, 
   identifiability, is required to express the target causal estimand as a 
   function of the observed data distribution $P_0$. See [causal target 
   parameters](#causal) for more information on causal quantities, causal models 
   and identifiability.
Note that if the target estimand is causal, step 3 also requires establishing 
so-called "identifiability" of this estimand from the observed data. See [causal 
target parameters](#causal) for more detail on causal models and identifiability.

After finalizing steps 1--3 above, the estimation procedure can be specified. 
We advocate for the use of the super learner (SL) algorithm in the estimation 
procedure it is flexible and grounded in optimality theory [@vdl2007super].  

### Why use the super learner? {-}

- It offers a system for combining/ensembling many machine learning (ML) 
  algorithms into an improved ML algorithm. 
- In large samples, SL is proven to perform at least as well as the unknown 
  best candidate ML algorithm [@vdl2003unified; @vaart2006oracle]. 
- When we have tested different ML algorithms on actual data and looked at the 
  performance, never does one algorithm always win (see below).

```{r why_sl, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/png/ericSL.png")
```
The figure above shows the performance of several different ML algorithms, 
including the SL, across many datasets. Here, the performance was assessed with 
the relative mean squared error and the target estimand that was being estimated 
for all datasets was the conditional mean outcome, given covariates 
[@polley2010super].

### General overview of the super learner algorithm

- SL uses cross-validation and an objective function (e.g., loss function) to 
  optimize the fit of the target parameter, based on a weighted combination of a 
  so-called "library" of candidate ML algorithms. 
- The library of ML algorithms consists of functions ("learners" in the SL 
  nomenclature). These functions should align with the statistical model, both 
  in terms of it's vastness and any potential constraints, where the 
  statistical model's constraints are restrictions based on subject-matter 
  knowledge regarding the process that generated the data.
- The so-called "metalearning", or ensembling of the library of ML algorithms 
  has been shown to be adaptive and robust, even in small 
  samples [@polley2010super]. 

#### Cross-validation

- There are many different cross-validation schemes, which are designed to
  accommodate different study designs, data structures, and prediction
  problems. See [cross-validation](#origami) for more detail.

```{r cv_fig, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/png/vs.png")
```
The figure above shows an example of $V$-fold cross-validation with $V=10$ folds, 
and this is the default cross-validation structure in the `sl3` `R` 
package. The darker boxes represent the so-called "validation data" and the 
lighter boxes represent the so-called "training data". The following details 
are important to notice: 

- Across all folds, there are $V$ copies of the dataset. The only difference 
  between each copy is the coloring, which distinguishes the subset of the data 
  that's considered as the training data from the subset that's considered as 
  the validation data. 
- Within each fold $1/V$ of the data is the validation data.
- Across all folds, all of the data will be considered as validation data and 
  no observation will be included twice as validation data. Therefore, the 
  total number of validation data observations across all of the folds is 
  equal to the total number of observations in the data.
  
#### Procedure with V-fold cross-validation

1. Fit each learner (say there are $K$ learners) on the whole dataset. We refer
   to these learners that are trained on the whole dataset as "full-fit" 
   learners. 
2. Break up the data evenly into $V$ disjoint subsets. Separately, create 
   $V$ copies of the data. For each copy $v$, where $v=1,\ldots,V$, create the 
   $V$ folds by labelling the portion of the data that was included in subset 
   $v$ as the validation sample, and the labelling what's remaining of the data 
   as the training sample. 
3. For each fold $v$, $v=1,\ldots,V$, fit each learner (say there are $K$ 
   learners) on the training sample and predict the validation sample outcomes 
   by providing each fitted learner with the validation sample covariates as 
   input. Notice that each learner will be fit $V$ times. We refer to these 
   learners that are trained across the $V$ cross-validation folds as 
   "cross-validated fit" learners. 
4. Combine the validation sample predictions from all folds and all learners to 
   create the so-called $K$ column matrix of "cross-validated predictions". 
   This matrix is also commonly referred to as the $Z$ matrix. Notice that it 
   contains, for each learner, out-of-sample predictions for all of the 
   observations in the data. 
5. Train the metalearner (e.g., a non-negative least squares regression) on 
   data with predictors and outcomes being the $Z$ matrix and the observed data 
   outcomes, respectively. The metalearner --- just like any ordinary ML 
   algorithm --- estimates the parameters of it's model using the training data 
   and afterwards, the fitted model can be used to obtain predicted outcomes 
   from new input data. What's special about the metalearner is that it's 
   estimated model parameters (e.g., regression coefficients) correspond to 
   it's predictors, which are the variables in the $Z$ matrix, the $K$ learners' 
   predictions. Once the metalearner is fit, it can be used to obtain predicted 
   outcomes from new input data; that is, new $K$ learners predictions' can be 
   supplied to the fitted metalearner in order to obtain predicted outcomes.
6. The fitted metalearner and the full-fit learners define the weighted 
   combination of the $K$ learners, finalizing the super learner (SL) fit. To 
   obtain SL predictions the full-fit learners' predictions are first obtained 
   and then fed as input to the fitted metalearner; the metalearner's output 
   is the SL predictions. 

```{r cv_sl_alg, fig.show="hold", echo = FALSE}
knitr::include_graphics("img/png/SLKaiserNew.png")
```

#### How to pick the library of candidate ML algorithms?

- The library of candidate ML algorithms should be chosen based on contextual 
  knowledge regarding the study/experiment that generated the data, and on the 
  information available in the data.  
- Having a "go-to" library to use as a "default" when the sample size is 
  relatively large can be useful in practice. 
- The algorithms may range from a simple linear regression model to multi-step
  algorithms involving screening covariates, penalizations, optimizing tuning
  parameters, etc.
  
### Theoretical foundations

For more detail on super learner algorithm we refer the reader to 
@polley2010super and @vdl2007super. The optimality results for the 
cross-validation selector among a family of algorithms were established in 
@vdl2003unified and extended in @vaart2006oracle.

<!--
### Summary of Super Learner's Foundations {-}

- We use the terms "learner", "algorithm", and "estimator" interchangeably.
- We introduce the following notation:
    + Let $O_1,\ldots, O_n$ be $n$ independent and identically distributed 
      (i.i.d.) observations of $O \sim P_0$, where $P_0$ is known to be an 
      element of a statistical model $\M$, $P_0 \in \M$.
    + Let $\psi_0$ denote the unknown and true prediction function, which is 
      defined with respect to unknown and true $P_0$ that generated the data.
    + Let $\psi$ denote a candidate prediction function in $\Psi$, the space of 
      candidate prediction functions $\Psi = {\psi(P): P \in \M}$.
    + Let $\hat{\psi}_k=\psi_k(P_n)\in \Psi, k=1,\ldots,K(n)$ denote a 
      a collection of estimators (where the number of estimators considered 
      $K(n)$ is defined with respect to the number of i.i.d. observations 
      $n$) of the prediction function of interest that can be learned from the 
      data, and thus are defined with respect to $P_n$, the empirical 
      distribution of $O_1,\ldots, O_n$. Recall from [step 1 in the Roadmap 
      chapter](#intro-roadmap) that $P_n$ is an approximation of $P_0$.  

- A *loss function* $L$ is a mapping of (i.e., takes as input) a candidate 
  prediction function $\psi$ and observation $O$ into (i.e., and outputs) a real 
  number $\mathbb{R}$, $(O,\psi) \mapsto $L(O,\psi) \in \mathbb{R}$. 
- A *valid loss function* will have expectation (risk) that is minimized at 
  $\psi_0$; that is, $\psi_0=\argmin_{\psi\in\Psi}E_0L(O,\psi)$. Oftentimes, 
  this criteria defines a class of valid loss functions.
- We use the *loss-based dissimilarity* as a means to measure the difference
  in the performance of a candidate prediction function relative to the true 
  prediction function, $d_L(\psi,\psi_0)=E_0L(O,\psi)-E_0L(O,\psi_0)$. Notice 
  that $d_L(\psi,\psi_0) = 0$ if an only if $\psi = \psi_0$, and that the 
  loss-based dissimilarity is the risk difference between $\psi$ and $\psi_0$
  with respect to the loss. 
- The so-called *"oracle selector"* is an optimal benchmark selector that 
  selects the estimator $\hat{\psi}_k$ that minimizes the dissimilarity to 
  $\psi_0$, 
  $\psi_0=\argmin_{k}d_n(\hat{\psi}_k,\psi_0)=\argmin_{k}E_0L(O,\hat{\psi}_k)$.
  Notice that the oracle selector also minimizes the true risk. Also, since 
  the oracle selector depends on the unknown $P_0$ it cannot be known in 
  practice. 
- The super learner optimizes it's estimation of $\psi_0$ with respect to  
  the loss-based dissimilarity. For example, the squared error loss implies the
  following dissimilarity that a SL would aim to minimize: 
  $d_L(\psi,\psi_0) = E_0(\psi(X)-\psi_0(X))^2$, for some input covariate data 
  $X$. 
- Among the class of valid loss functions available for each $\psi_0$, the loss 
  function that should be chosen is the one that implies a dissimilarity 
  measure that identifies the desired measure of performance. 
- We often use an expected loss function as our objective function to assign a 
  measure of performance to each fitted learner $\hat{\psi}_k$ when applied to 
  the data $O$, and subsequently compare performance across the learners. 
- The *cross-validated empirical risk* with respect to the loss-based 
  dissimilarity of an algorithm is defined as the empirical mean over a 
  validation sample of the loss of the algorithm fitted on the training sample, 
  averaged across the splits of the data.


  Thus, minimizing the expected loss will bring an estimator $\psi_n$ closer to the true $\psi_0$.
    + For example, say we observe a learning data set $O_i=(Y_i,X_i)$, of
      $i=1, \ldots, n$ independent and identically distributed observations,
      where $Y_i$ is a continuous outcome of interest, $X_i$ is a set of
      covariates. Also, let our goal be to estimate the function
      $\psi_0: X \mapsto \psi_0(X) = \mathbb{E}_0(Y \mid X)$, which is the
      conditional mean outcome, given covariates. This function can be expressed
      as the minimizer of the expected squared error loss:
      $\psi_0 = \text{argmin}_{\psi} \mathbb{E}[L(O,\psi(X))]$, where
      $L(O,\psi(X)) = (Y − \psi(X))^2$.
    + We can estimate the loss by substituting the empirical distribution of
      the data $P_n$ for the true and unknown distribution of the observed data
      $P_0$.
    + Also, we can use the cross-validated risk to empirically determine the
      relative performance of an estimator (i.e., a candidate learner), and
      perhaps how it's performance compares to other estimators.
    + Once we have tested different estimators on actual data and looked at the
      performance (e.g., MSE of predictions across all learners), we can see
      which algorithm (or weighted combination) has the lowest risk, and thus
      is closest to the true $\psi_0$.
      - Occasionally, loss functions are indexed by unknown nuisance parameters 
        $\nu_0$, $L(O,\psi \mid \nu_0)$, which are defined with respect to $P_0$ and 
        thus will need to be estimated from the data. An example of such a loss is 
        the augmented inverse probability of treatment-weighted loss, where the 
        nuisance parameter is $\nu_0=P_0(A|W)$, the conditional probability 
        distribution of treatment $A$, given covariates $W$.
- The *cross-validated empirical risk* of an algorithm is defined as the
  empirical mean over a validation sample of the loss of the algorithm fitted
  on the training sample, averaged across the splits of the data.
- The *discrete Super Learner*, or *cross-validation selector*, is the algorithm
  in the library that minimizes the cross-validated empirical risk.
- The *continuous/ensemble Super Learner*, often referred to as *Super Learner*
  is a weighted average of the library of algorithms, where the weights are
  chosen to minimize the cross-validated empirical risk of the library.
  Restricting the weights to be positive and sum to one (i.e., a convex
  combination) has been shown to improve upon the discrete Super Learner
  [@polley2010super; @vdl2007super]. This notion of weighted combinations was
  introduced in @wolpert1992stacked for neural networks and adapted for
  regressions in @breiman1996stacked. 
- Cross-validation is proven to be optimal for selection among estimators. This
  result was established through the oracle inequality for the cross-validation
  selector among a collection of candidate estimators [@vdl2003unified;
  @vaart2006oracle]. The only condition is that loss function is uniformly
  bounded, which is guaranteed in `sl3`.


The *oracle results* prove that, if the number of algorithms in the library are
polynomial in sample size, then the cross-validation selector (i.e., discrete
Super Learner) (1) is equivalent with the oracle selector asymptotically (based
on sample of size of training samples), or (2) achieves the parametric rate
(log $n/n$) for convergence with respect to the loss-based dissimilarity (risk)
between a candidate estimate $\psi$ and the true parameter value $\psi_0$.


### Super Learner for Prediction {-}

Below, we show the results of
such a study, comparing the fits of several different learners, including the SL
algorithms.

r cv_fig3, results="asis", echo = FALSE
knitr::include_graphics("img/png/ericSL.png")


For more detail on Super Learner we refer the reader to @vdl2007super and
@polley2010super. The optimality results for the cross-validation selector
among a family of algorithms were established in @vdl2003unified and extended
in @vaart2006oracle.
-->
## `sl3` "Microwave Dinner" Implementation

We begin by illustrating the core functionality of the SL algorithm as
implemented in `sl3`.

The `sl3` implementation consists of the following steps:

0. Load the necessary libraries and data
1. Define the machine learning task
2. Make an SL by creating library of base learners and a metalearner
3. Train the SL on the machine learning task
4. Obtain predicted values

### WASH Benefits Study Example {-}

Using the WASH Benefits Bangladesh data, we are interested in predicting
weight-for-height z-score `whz` using the available covariate data. More
information on this dataset, and all other data that we will work with, are
described in [this chapter of the `tlverse`
handbook](ihttps://tlverse.org/tlverse-handbook/data.html). Let's begin!

### 0. Load the necessary libraries and data {-}

First, we will load the relevant `R` packages, set a seed, and load the data.

<!--
If you would like to use newer `sl3` functionality that is available in the
devel branch of the `sl3` GitHub repository, you need to install that version
of the package (i.e., `usethis::install_github(tlverse/sl3@devel)`), re-start
your `R` session, and then re-load the `sl3` package.
-->

```{r setup}
library(data.table)
library(dplyr)
library(readr)
library(ggplot2)
library(SuperLearner)
library(origami)
library(sl3)
library(knitr)
library(kableExtra)

# load data set and take a peek
washb_data <- fread(
  paste0(
    "https://raw.githubusercontent.com/tlverse/tlverse-data/master/",
    "wash-benefits/washb_data.csv"
  ),
  stringsAsFactors = TRUE
)
head(washb_data) %>%
  kable() %>%
  kableExtra:::kable_styling(fixed_thead = T) %>%
  scroll_box(width = "100%", height = "300px")
```

### 1. Define the machine learning task {-}

To define the machine learning `task` (predict weight-for-height Z-score
`whz` using the available covariate data), we need to create an `sl3_Task`
object.

The `sl3_Task` keeps track of the roles the variables play in the machine
learning problem, the data, and any metadata (e.g., observational-level
weights, IDs, offset).

Also, if we had missing outcomes, we would need to set `drop_missing_outcome =
TRUE` when we create the task. In the next analysis, with the [IST stroke trial
data](#ist), we do have a missing outcome. In the following chapter, we need to
estimate this missingness mechanism; which is the conditional probably that
the outcome is observed, given the history (i.e., variables that were measured
before the missingness). Estimating the missingness mechanism requires learning
a prediction function that outputs the predicted probability that a unit
is missing, given their history.

```{r task, warning=TRUE}
# specify the outcome and covariates
outcome <- "whz"
covars <- colnames(washb_data)[-which(names(washb_data) == outcome)]

# create the sl3 task
washb_task <- make_sl3_Task(
  data = washb_data,
  covariates = covars,
  outcome = outcome
)
```
*This warning is important.* The task just imputed missing covariates for us.
Specifically, for each covariate column with missing values, `sl3` uses the
median to impute missing continuous covariates, and the mode to impute binary
and categorical covariates.

Also, for each covariate column with missing values, `sl3` adds an additional
column indicating whether or not the value was imputed, which is particularly
handy when the missingness in the data might be informative.

Also, notice that we did not specify the number of folds, or the loss function
in the task. The default cross-validation scheme is V-fold, with $V=10$ number
of folds.

Let's visualize our `washb_task`:

```{r task-examine}
washb_task
```

We can't see when we print the task, but the default cross-validation fold
structure ($V$-fold cross-validation with $V$=10 folds) was created when we
defined the task.

```{r task-folds-examine}
length(washb_task$folds) # how many folds?

head(washb_task$folds[[1]]$training_set) # row indexes for fold 1 training
head(washb_task$folds[[1]]$validation_set) # row indexes for fold 1 validation

any(
  washb_task$folds[[1]]$training_set %in% washb_task$folds[[1]]$validation_set
)
```

Tip: If you type `washb_task$` and then press the tab  button (you will
need to press tab twice if you're not in RStudio), you can view all of the
active and public fields and methods that can be accessed from the `washb_task`
object.

### 2. Make a super learner {-}

Now that we have defined our machine learning problem with the `sl3_Task`, we
are ready to make the super learner (SL). This requires specification of

* A set of candidate machine learning algorithms, also commonly referred to as
  a library of learners. The set should include a diversity of algorithms
  that are believed to be consistent with the true data-generating distribution.
* A metalearner, to ensemble the base learners.

We might also incorporate

* Feature selection, to pass only a subset of the predictors to the algorithm.
* Hyperparameter specification, to tune base learners.

Learners have properties that indicate what features they support. We may use
`sl3_list_properties()` to get a list of all properties supported by at least
one learner.

```{r list-properties}
sl3_list_properties()
```
Since we have a continuous outcome, we may identify the learners that support
this outcome type with `sl3_list_learners()`.

```{r list-learners}
sl3_list_learners("continuous")
```

Now that we have an idea of some learners, we can construct them using the
`make_learner` function or the `new` method.

```{r baselearners}
# choose base learners
lrn_glm <- make_learner(Lrnr_glm)
lrn_mean <- Lrnr_mean$new()
```
We can customize learner hyperparameters to incorporate a diversity of different
settings. Documentation for the learners and their hyperparameters can be found
in the [`sl3` Learners
Reference](https://tlverse.org/sl3/reference/index.html#section-sl-learners).

```{r extra-lrnr-awesome}
lrn_lasso <- make_learner(Lrnr_glmnet) # alpha default is 1
lrn_ridge <- Lrnr_glmnet$new(alpha = 0)
lrn_enet.5 <- make_learner(Lrnr_glmnet, alpha = 0.5)

lrn_polspline <- Lrnr_polspline$new()

lrn_ranger100 <- make_learner(Lrnr_ranger, num.trees = 100)

lrn_hal_faster <- Lrnr_hal9001$new(max_degree = 2, reduce_basis = 0.05)

xgb_fast <- Lrnr_xgboost$new() # default with nrounds = 20 is pretty fast
xgb_50 <- Lrnr_xgboost$new(nrounds = 50)
```
We can use `Lrnr_define_interactions` to define interaction terms among
covariates. The interactions should be supplied as list of character vectors,
where each vector specifies an interaction. For example, we specify
interactions below between (1) `tr` (whether or not the subject received the
WASH intervention) and `elec` (whether or not the subject had electricity); and
between (2) `tr` and `hfiacat` (the subject's level of food security).

```{r interaction-learner}
interactions <- list(c("elec", "tr"), c("tr", "hfiacat"))
# main terms as well as the interactions above will be included
lrn_interaction <- make_learner(Lrnr_define_interactions, interactions)
```
What we just defined above is incomplete. In order to fit learners with these
interactions, we need to create a `Pipeline`. A `Pipeline` is a set of learners
to be fit sequentially, where the fit from one learner is used to define the
task for the next learner. We need to create a `Pipeline` with the interaction
defining learner and another learner that incorporate these terms when fitting
a model. Let's create a learner pipeline that will fit a linear model with the
combination of main terms and interactions terms, as specified in
`lrn_interaction`.

```{r interaction-pipe}
# we already instantiated a linear model learner, no need to do that again
lrn_glm_interaction <- make_learner(Pipeline, lrn_interaction, lrn_glm)
lrn_glm_interaction
```

We can also include learners from the `SuperLearner` `R` package.

```{r extra-lrnr-woah}
lrn_bayesglm <- Lrnr_pkg_SuperLearner$new("SL.bayesglm")
```

Here is a fun trick to create customized learners over a grid of parameters.

```{r extra-lrnr-mindblown-svm, eval = FALSE}
# I like to crock pot my SLs
grid_params <- list(
  cost = c(0.01, 0.1, 1, 10, 100, 1000),
  gamma = c(0.001, 0.01, 0.1, 1),
  kernel = c("polynomial", "radial", "sigmoid"),
  degree = c(1, 2, 3)
)
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
svm_learners <- apply(grid, MARGIN = 1, function(tuning_params) {
  do.call(Lrnr_svm$new, as.list(tuning_params))
})
```
```{r extra-lrnr-mindblown-xgboost}
grid_params <- list(
  max_depth = c(2, 4, 6),
  eta = c(0.001, 0.1, 0.3),
  nrounds = 100
)
grid <- expand.grid(grid_params, KEEP.OUT.ATTRS = FALSE)
grid

xgb_learners <- apply(grid, MARGIN = 1, function(tuning_params) {
  do.call(Lrnr_xgboost$new, as.list(tuning_params))
})
xgb_learners
```

Did you see `Lrnr_caret` when we called `sl3_list_learners(c("binomial"))`?  All
we need to specify to use this popular algorithm as a candidate in our SL is
the `algorithm` we want to tune, which is passed as `method` to `caret::train()`.

```{r carotene, eval = FALSE}
# Unlike xgboost, I have no idea how to tune a neural net or BART machine, so
# I let caret take the reins
lrnr_caret_nnet <- make_learner(Lrnr_caret, algorithm = "nnet")
lrnr_caret_bartMachine <- make_learner(Lrnr_caret,
  algorithm = "bartMachine",
  method = "boot", metric = "Accuracy",
  tuneLength = 10
)
```

In order to assemble the library of learners, we need to `Stack` them
together.

A `Stack` is a special learner and it has the same interface as all other
learners. What makes a stack special is that it combines multiple learners by
training them simultaneously, so that their predictions can be either combined
or compared.

```{r stack}
stack <- make_learner(
  Stack, lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50
)
stack
```

We can also stack the learners by first creating a vector, and then
instantiating the stack. I prefer this method, since it easily allows us to
modify the names of the learners.

```{r alt-stack}
# named vector of learners first
learners <- c(
  lrn_glm, lrn_polspline, lrn_enet.5, lrn_ridge, lrn_lasso, xgb_50
)
names(learners) <- c(
  "glm", "polspline", "enet.5", "ridge", "lasso", "xgboost50"
)
# next make the stack
stack <- make_learner(Stack, learners)
# now the names are pretty
stack
```

We're jumping ahead a bit, but let's check something out quickly. It's
straightforward, and just one more step, to set up this stack such that
all of the learners will train in a cross-validated manner.

```{r alt-stack-cv}
cv_stack <- Lrnr_cv$new(stack)
cv_stack
```

#### Screening algorithms for feature selection {-}

We can optionally select a subset of available covariates and pass only
those variables to the modeling algorithm. The current set of learners that
can be used for prescreening covariates is included below.

- `Lrnr_screener_importance` selects `num_screen` (default = 5) covariates
  based on the variable importance ranking provided by the `learner`. Any
  learner with an importance method can be used in `Lrnr_screener_importance`;
  and this currently includes `Lrnr_ranger`, `Lrnr_randomForest`, and
  `Lrnr_xgboost`.
- `Lrnr_screener_coefs`, which provides screening of covariates based on the
  magnitude of their estimated coefficients in a (possibly regularized) GLM.
  The `threshold` (default = 1e-3) defines the minimum absolute size of the
  coefficients, and thus covariates, to be kept. Also, a `max_retain` argument
  can be optionally provided to restrict the number of selected covariates to be
  no more than `max_retain`.
- `Lrnr_screener_correlation` provides covariate screening procedures by
  running a test of correlation (Pearson default), and then selecting the (1)
  top ranked variables (default), or (2) the variables with a pvalue lower than
  some pre-specified threshold.
- `Lrnr_screener_augment` augments a set of screened covariates with additional
  covariates that should be included by default, even if the screener did not
  select them. An example of how to use this screener is included below.

Let's consider screening covariates based on their `randomForest` variable
importance ranking (ordered by mean decrease in accuracy). To select the top
5 most important covariates according to this ranking, we can combine
`Lrnr_screener_importance` with `Lrnr_ranger` (limiting the number of trees by
setting `ntree = 20`).

Hang on! Before you think it --- we will confess: Bob Ross and us both know that 
20 trees makes for a lonely forest, and we shouldn't consider it, but these are 
the sacrifices we make for this chapter to be built in time!

```{r screener}
miniforest <- Lrnr_ranger$new(
  num.trees = 20, write.forest = FALSE,
  importance = "impurity_corrected"
)

# learner must already be instantiated, we did this when we created miniforest
screen_rf <- Lrnr_screener_importance$new(learner = miniforest, num_screen = 5)
screen_rf

# which covariates are selected on the full data?
screen_rf$train(washb_task)
```

An example of how to format `Lrnr_screener_augment` is included below for
clarity.
```{r screener-augment}
keepme <- c("aged", "momage")
# screener must already be instantiated, we did this when we created screen_rf
screen_augment_rf <- Lrnr_screener_augment$new(
  screener = screen_rf, default_covariates = keepme
)
screen_augment_rf
```

Selecting covariates with non-zero lasso coefficients is quite common. Let's
construct `Lrnr_screener_coefs` screener that does just that, and test it
out.

```{r screener-coefs}
# we already instantiated a lasso learner above, no need to do it again
screen_lasso <- Lrnr_screener_coefs$new(learner = lrn_lasso, threshold = 0)
screen_lasso
```

To pipe only the selected covariates to the modeling algorithm, we need to
make a `Pipeline`, similar to the one we built for the regression model with
interaction terms.

```{r screener-pipe}
screen_rf_pipe <- make_learner(Pipeline, screen_rf, stack)
screen_lasso_pipe <- make_learner(Pipeline, screen_lasso, stack)
```
Now, these learners will be preceded by a screening step.

We also consider the original `stack`, to compare how the feature selection
methods perform in comparison to the methods without feature selection.

Analogous to what we have seen before, we have to stack the pipeline and
original `stack` together, so we may use them as base learners in our super
learner.

```{r screeners-stack}
# pretty names again
learners2 <- c(learners, screen_rf_pipe, screen_lasso_pipe)
names(learners2) <- c(names(learners), "randomforest_screen", "lasso_screen")

fancy_stack <- make_learner(Stack, learners2)
fancy_stack
```

We will use the [default
metalearner](https://tlverse.org/sl3/reference/default_metalearner.html),
which uses
[`Lrnr_solnp`](https://tlverse.org/sl3/reference/Lrnr_solnp.html) to
provide fitting procedures for a pairing of [loss
function](https://tlverse.org/sl3/reference/loss_functions.html) and
[metalearner
function](https://tlverse.org/sl3/reference/metalearners.html). This
default metalearner selects a loss and metalearner pairing based on the outcome
type. Note that any learner can be used as a metalearner.

Now that we have made a diverse stack of base learners, we are ready to make the
SL. The SL algorithm fits a metalearner on the validation set
predictions/losses across all folds.

```{r make-sl}
sl <- make_learner(Lrnr_sl, learners = fancy_stack)
```
We can also use `Lrnr_cv` to build a SL, cross-validate a stack of
learners to compare performance of the learners in the stack, or cross-validate
any single learner (see "Cross-validation" section of this [`sl3`
introductory tutorial](https://tlverse.org/sl3/articles/intro_sl3.html)).

Furthermore, we can [Define New `sl3`
Learners](https://tlverse.org/sl3/articles/custom_lrnrs.html) which can be used
in all the places you could otherwise use any other `sl3` learners, including
`Pipelines`, `Stacks`, and the SL.

Recall that the discrete SL, or cross-validated selector, is a metalearner that
assigns a weight of 1 to the learner with the lowest cross-validated empirical
risk, and weight of 0 to all other learners. This metalearner specification can
be invoked with `Lrnr_cv_selector`.

```{r make-sl-discrete}
discrete_sl_metalrn <- Lrnr_cv_selector$new()
discrete_sl <- Lrnr_sl$new(
  learners = fancy_stack,
  metalearner = discrete_sl_metalrn
)
```
<!--
In the plot below, we visualize the steps for executing the Super Learner in
the `tlverse/delayed` framework. For those like myself who are not particularly
keen on understanding the intricacies of `delayed`, let's focus on the main
point of this figure: we can see there are 10 realizations of the stack which
represent the 10 cross-validation folds and there is a separate hold-out
(top branch of the figure) that will not be used to fit the Super Learner.

```{r make-sl-plot}
dt_sl <- delayed_learner_train(sl, washb_task)
plot(dt_sl, color = FALSE, height = "400px", width = "90%")
```
--->

### 3. Train the super learner on the machine learning task {-}

The SL algorithm fits a metalearner on the validation-set predictions in a
cross-validated manner, thereby avoiding overfitting.

Now we are ready to `train` our SL on our `sl3_task` object, `washb_task`.

```{r sl}
set.seed(4197)
sl_fit <- sl$train(washb_task)
```

### 4. Obtain predicted values {-}

Now that we have fit the SL, we are ready to calculate the predicted outcome
for each subject.

```{r sl-predictions}
# we did it! now we have SL predictions
sl_preds <- sl_fit$predict()
head(sl_preds)
```
<!--
Below we visualize the observed versus predicted values.

For fun, we will also include the cross-validated predictions from most popular
learner on the block, main terms linear regression.


```{r, plot-predvobs-woohoo, eval=FALSE}

# df_plot <- data.frame(Observed = washb_data[["whz"]], Predicted = sl_preds,
#                        count = seq(1:nrow(washb_data))

# df_plot_melted <- melt(df_plot, id.vars = "count",
#                         measure.vars = c("Observed", "Predicted"))

# ggplot(df_plot_melted, aes(value, count, color = variable)) + geom_point()
```
-->

We can also obtain a summary of the results.

```{r, sl-summary}
sl_fit$cv_risk(loss_fun = loss_squared_error)
```

## Cross-validated super learner

We can cross-validate the SL to see how well the SL performs on unseen data, and
obtain an estimate of the cross-validated risk of the SL.

This estimation procedure requires an outer/external layer of
cross-validation, also called nested cross-validation, which involves setting
aside a separate holdout sample that we don’t use to fit the SL. This external
cross-validation procedure may also incorporate 10 folds, which is the default
in `sl3`. However, we will incorporate 2 outer/external folds of
cross-validation for computational efficiency.

We also need to specify a loss function to evaluate SL. Documentation for the
available loss functions can be found in the [`sl3` Loss Function
Reference](https://tlverse.org/sl3/reference/loss_functions.html).

```{r CVsl}
washb_task_new <- make_sl3_Task(
  data = washb_data,
  covariates = covars,
  outcome = outcome,
  folds = origami::make_folds(washb_data, fold_fun = folds_vfold, V = 2)
)
CVsl <- CV_lrnr_sl(
  lrnr_sl = sl_fit, task = washb_task_new, loss_fun = loss_squared_error
)
CVsl %>%
  kable(digits = 4) %>%
  kableExtra:::kable_styling(fixed_thead = T) %>%
  scroll_box(width = "100%", height = "300px")
```
<!-- Explain summary!!!! -->

## Variable importance measures with `sl3`

Variable importance can be interesting and informative. It can also be
contradictory and confusing. Nevertheless, we like it, and so do our
collaborators, so we created a variable importance function in `sl3`! The `sl3`
`importance` function returns a table with variables listed in decreasing order
of importance (i.e., most important on the first row).

The measure of importance in `sl3` is based on a risk ratio, or risk difference,
between the learner fit with a removed, or permuted, covariate and the learner
fit with the true covariate, across all covariates. In this manner, the larger
the risk difference, the more important the variable is in the prediction.

The intuition of this measure is that it calculates the risk (in terms of the
average loss in predictive accuracy) of losing one covariate, while keeping
everything else fixed, and compares it to the risk if the covariate was not
lost. If this risk ratio is one, or risk difference is zero, then losing that
covariate had no impact, and is thus not important by this measure. We do this
across all of the covariates. As stated above, we can remove the covariate and
refit the SL without it, or we just permute the covariate (faster)
and hope for the shuffling to distort any meaningful information that was
present in the covariate. This idea of permuting instead of removing saves a lot
of time, and is also incorporated in the `randomForest` variable importance
measures. However, the permutation approach is risky, so the importance function
default is to remove and refit.

Let's explore the `sl3` variable importance measurements for the `washb` data.

```{r varimp}
washb_varimp <- importance(sl_fit, loss = loss_squared_error, type = "permute")
washb_varimp %>%
  kable(digits = 4) %>%
  kableExtra:::kable_styling(fixed_thead = TRUE) %>%
  scroll_box(width = "100%", height = "300px")
```

```{r varimp-plot, out.width = "100%"}
# plot variable importance
importance_plot(
  washb_varimp,
  main = "sl3 Variable Importance for WASH Benefits Example Data",
  nvar = 15
)
```
<!-- Explain summary!!!! -->

## Exercises {#sl3-exercises}

### Predicting Myocardial Infarction with `sl3` {#sl3ex1}

Follow the steps below to predict myocardial infarction (`mi`) using the
available covariate data. We thank Prof. David Benkeser at Emory University for
making the this Cardiovascular Health Study (CHS) data accessible.

```{r ex1-setup}
# load the data set
db_data <- url(
  paste0(
    "https://raw.githubusercontent.com/benkeser/sllecture/master/",
    "chspred.csv"
  )
)
chspred <- read_csv(file = db_data, col_names = TRUE)

# take a quick peek
head(chspred) %>%
  kable(digits = 4) %>%
  kableExtra:::kable_styling(fixed_thead = TRUE) %>%
  scroll_box(width = "100%", height = "300px")
```

1. Create an `sl3` task, setting myocardial infarction `mi` as the outcome and
   using all available covariate data.
2. Make a library of seven relatively fast base learning algorithms. Customize 
   tuning parameters for one of your learners. Feel free to use learners from 
   `sl3` or `SuperLearner`. You may use the same base learning library that is 
   presented above.
3. Incorporate at least one pipeline with feature selection. Any screener and
   learner(s) can be used.
4. With the default metalearner and base learners, make the super learner (SL) 
   and train it on the task.
5. Print your SL fit by calling `print()` with `$`.
6. Cross-validate your SL fit to see how well it performs on unseen
   data. Specify a valid loss function to evaluate the SL.
7. Use the `importance()` function to identify the "most important" predictor of
   myocardial infarction, according to `sl3` importance metrics.

### Exercise Solutions {-}
After all exercises are submitted, the solutions will be made available here:
[Super Learning Exercise Solutions](#solutions-sl).

<!--
## Super Learning of a Conditional Density

### Super learning of a conditional density

Suppose we want to construct a Super Learner of the conditional probability
distribution $g_0(a\mid W)=P_0(A=a\mid W)$, where $a\in {\cal A}$.
Let's denote the values of $a$ with $\{0,1,\ldots,K\}$. A valid loss function
for the conditional density is
\[
L(g)(O)=-\log g(A\mid W).\]
That is, $g_0=\arg\min_g P_0L(g)$, i.e., $g_0$ is the minimizer of the
expectation of the log-likelihood loss.

**Candidate estimators**

1. Candidate estimators based on multinomial logistic regression: To start
with, one can use existing parametric model based MLE and machine learning
algorithms in `R` that fit a multinomial regression. For example, parametric
model multinomial logistic regression is available in `R` so that one can
already build a rich library of such estimators based on  different candidate
parametric models. In addition, `polyclass()` is a multinomial logistic
regression machine learning algorithm in `R`.

2. Candidate estimators based on machine learning for multinomial logistic
regression: Secondly, one can use a machine learning algorithm such as
`polyclass()` in `R` that data adaptively fits a multinomial logistic
regression, which itself has tuning parameters, again generating a class of
candidate estimators.

3. Incorporating screening: Note that one can also marry any of these choices
with a screening algorithm, thereby creating more candidate estimators of
interest. The screening can be particularly important when there are many
variables.

4. Candidate estimators by fitting separate logistic regressions and using
post-normalization

* Code $A$ in terms of Bernoullis $B_k=I(A=k)$, $k=0,\ldots,K$.
* Construct an estimator $\bar{g}_{nk}$ of $\bar{g}_{0k}(W)\equiv P_0(B_k=1\mid
  W)$ using any of the logistic regression algorithms, for all $k=0,\ldots,K$.
* This implies an estimator
\[
g_n(a\mid W)=\frac{\bar{g}_{na}(W)}{\sum_{k=0}^K \bar{g}_{nk}(W)}.\]
* In other words, we simply normalize these separate logistic regression
estimators so that we obtain a valid conditional distribution.
* This generates an enormous amount of interesting algorithms, since we have
available the whole machine learning literature for binary outcome regression.

5. Candidate estimators by estimating the conditional "hazard" with pooled
logistic regression.
Note that
\[
g_0(a\mid W)=\lambda_0(a\mid W) S_0(a\mid W),\]
where \[
\lambda_0(a\mid W)=P_0(A=a\mid A\geq a,W),\]

and $S_0(a\mid W)=\prod_{s\leq a}(1-\lambda_0(s\mid W))$ is the conditional
survival function $P_0(A>a\mid W)$. So we have now parameterized the
conditional distribution of $A$, given $W$, by a conditional hazard
$\lambda_0(a\mid W)$: $g_0=g_{\lambda_0}$.

* We could now focus on constructing candidate estimators of
$\lambda_0(a\mid W)$, which implies candidate estimators of $g_0$.

* For every observation $A_i$, we can create $A_i+1$ rows of data
$(W,s,I(A_i=s))$, $s=0,\ldots,A_i$, $i=1,\ldots,n$. We now run a logistic
regression estimator based on the pooled data set, ignoring ID, where we
regress the binary outcome $I(A_i=s)$ on the covariates $(W,s)$.

* If one assumes a parametric model, then this is nothing else then using the
maximum likelihood estimator, demonstrating that ignoring the ID is not
inefficient.

* This defines now an estimator of $\lambda_0(s\mid W)=P_0(A=s\mid W,A\geq s)$
as a function of $(s,W)$.

* Different choices of logistic regression based estimators will define
different estimators.

* The pooling across $s$ is not very sensible if $A$ is not an ordered variable
If $A$ is categorical, we recommend to compute  a separate logistic regression
estimator of $\lambda_0(a\mid W)$ for each $a$ (i.e., stratify by $s$ in the
  above pooled data set).

* For non-categorical $A$, one could include both stratified (by level) as well
as pooled (across levels) based logistic regression estimators.


## Exercise 2 -- Estimating the Propensity Score with `sl3` {#sl3ex-pscore}

exercise where we can look at positivity and maybe modify target population,
address issues related to this

## Super Learning of an Optimal Individualized Treatment Rule

* Data $O=(W,A,Y)$, and nonparametric model \mathcal{M} potentially containing
assumptions on the conditional probability distribution of $A$ given $W$
$g_0(A\mid W)$.
* Target: Optimal treatment rule $\psi_0(W)=I(B_0(W)>0)$, where
$B_0(W)=E_0(Y\mid A=1,W)-E_0(Y\mid A=0,W)$, the conditional treatment effect.
* Possible loss function for $\psi_0$ is an IPCW-loss:
\[
L_{g_0}(\psi)=\frac{I(A=\psi(W))}{g(A\mid W)}Y.\]

Indeed, $\psi_0$ is the minimizer of $EL_{g_0}(\psi)$ over all rules $\psi$.
* Construct library of candidate estimators of $\psi_0=I(B_0>0)$. This can
include estimators based on plugging in an estimator of $B_0$.
* One could also include a candidate estimator $I(B_n>0)$ where $B_n$ is a
Super Learner of $B_0$, e.g. based on loss function
\[
L_{g_0}(B)=\big(\frac{2A-1}{/g(A\mid W)}Y-B(W)\big)^2\]
that directly targets $B_0=\arg\min_B P_0L_{g_0}(B)$. This loss function is
still a squared error loss but its minimized by the true $B_0$.
* Estimate $g_0$ if not known.
* Compute cross-validation selector:
\[
k_n=\arg\min_k E_{B_n}P_{n,B_n}^1 L_{\hat{g}(P_{n,B_n}^0)}
(\hat{\Psi}_k(P_{n,B_n}^0)).\]
where $B_n = \{0,1\}^n$ is used for a binary vector of $n$ defining sample
splits, where the validation sample is ${i:B_n(i) = 1}$ and ${i:B_n(i) = 0}$ is
the training sample. The empirical distribution $P_{n,B_n}^0$ corresponds to
the split $B_n$ of the training sample and the empirical distribution of the
validation sample is $P_{n,B_n}^1$.
* Super-learner of optimal rule $\psi_0$: $\hat{\Psi}_{k_n}(P_n)$.

## Exercise 3 -- Estimating the Blip {#sl3ex3}
-->

## Concluding Remarks

- Super learner (SL) is a general approach that can be applied to a diversity of
  estimation and prediction problems which can be defined by a loss function.

- It would be straightforward to plug in the estimator returned by SL into the
  target parameter mapping.
    + For example, suppose we are after the average treatment effect (ATE) of a
      binary treatment intervention:
      $\Psi_0 = \mathbb{E}_{0,W}[\mathbb{E}_0(Y \mid A=1,W) -
        \mathbb{E}_0(Y \mid A=0,W)]$.
    + We could use the SL that was trained on the original data (let's call
      this `sl_fit`) to predict the outcome for all subjects under each
      intervention. All we would need to do is take the average difference
      between the counterfactual outcomes under each intervention of interest.
    + Considering $\Psi_0$ above, we would first need two $n$-length vectors of
      predicted outcomes under each intervention. One vector would represent the
      predicted outcomes under an intervention that sets all subjects to
      receive $A=1$, $Y_i \mid A_i=1,W_i$ for all $i=1,\ldots,n$. The other
      vector would represent the predicted outcomes under an intervention that
      sets all subjects to receive $A=0$, $Y_i \mid A_i=0,W_i$ for all
      $i=1,\ldots,n$.
    + After obtaining these vectors of counterfactual predicted outcomes, all
      we would need to do is average and then take the difference in order to
      plug-in the SL estimator into the target parameter mapping.
    + In `sl3` and with our current ATE example, this could be achieved with
      `mean(sl_fit$predict(A1_task)) - mean(sl_fit$predict(A0_task))`;
      where `A1_task$data` would contain all 1's (or the level that pertains to
      receiving the treatment) for the treatment column in the data (keeping all
      else the same), and `A0_task$data` would contain all 0's (or the level
      that pertains to not receiving the treatment) for the treatment column in
      the data.
- It's a worthwhile exercise to obtain the predicted counterfactual outcomes
  and create these counterfactual `sl3` tasks. It's too biased; however, to
  plug the SL fit into the target parameter mapping, (e.g., calling the result
  of `mean(sl_fit$predict(A1_task)) - mean(sl_fit$predict(A0_task))` the
  estimated ATE. We would end up with an estimator for the ATE that was
  optimized for estimation of the prediction function, and not the ATE!
- Ultimately, we want an estimator that is optimized for our target estimand of 
  interest. Here, we cared about doing a good job estimating the ATE. The 
  SL is an essential step to help us get there. In fact, we will use the 
  counterfactual predicted outcomes that were explained at length above. 
  However, SL might not be not the end of the estimation procedure. 
  Plugging in the super learner in the target parameter representation would 
  generally not result in an asymptotically linear estimator of the target 
  estimand. This begs the question, why is it important for an estimator to 
  possess these properties?

  + An asymptotically linear estimator converges to the estimand at 
    $\frac{1}{\sqrt{n}}$ rate --- and thus behaves as sample mean --- so it's 
    sampling distribution can be estimated in order to conduct formal 
    statistical inference (i.e., confidence intervals and $p$-values).
  + Substitution, or plug-in, estimators of the estimand are desirable because
    they respect both the local and global constraints of the statistical model
    (e.g., bounds), and have they have better finite-sample properties.
  + An efficient estimator is optimal in the sense that it has the lowest
    possible variance, and is thus the most precise. An estimator is efficient
    if and only if is asymptotically linear with influence curve equal to the
    canonical gradient.

      + The canonical gradient is a mathematical object that is specific to
        the target estimand, and it provides information on the level of
        difficulty of the estimation problem. Various canonical gradient are
        shown in the chapters that follow.
      + Practitioner's do not need to know how to calculate a canonical
        gradient in order to understand efficiency and use Targeted Maximum
        Likelihood Estimation (TMLE). Metaphorically, you do not need to be
        Yoda in order to be a Jedi.

- TMLE is a general strategy that succeeds in constructing efficient and
  asymptotically linear plug-in estimators.
- SL is fantastic for pure prediction, and for obtaining an initial
  estimate in the first step of TMLE, but we need the second step of TMLE to
  have the desirable statistical properties mentioned above.
- In the chapters that follow, we focus on the Targeted Maximum Likelihood
  Estimator and it's generalization to Targeted Minimum Loss-based Estimator, 
  both referred to as TMLE.

<!--
We could just plug-in the estimator returned by Super Learner; however, this is
problematic because the Super Learner estimators are trading off bias and
variance in an optimal way and as a result their bias is essentially the rate
of convergence of these algorithms, which is always slower than $1/\sqrt{n}$.
Therefore, if we plug-in the estimator returned by Super Learner into the
target parameter mapping, we would end up with an estimator which has the same
bias as what we plugged in, which is greater than $1/\sqrt{n}$. Thus, we end up
with an estimator which is not asymptotically normal, since it does not
converge to the estimand at $1/\sqrt{n}$ rate.

An asymptotically linear estimator has no meaningful bias ($ < 1/\sqrt{n}$),
and can be written as an empirical mean in first order of a function of the
data, the influence curve, plus some negligible remainder term. Once an
estimator is asymptotically linear with an influence curve it’s normally
distributed, so the standardized estimator converges to a normal distribution
with mean 0 and variance is the variance of the influence curve. Thus, it is
advantageous to construct asymptotically linear estimators since they permit
formal statistical inference. Among the class of regular asymptotically linear
estimators, there is an optimal estimator which is an efficient estimator, and
that’s the one with influence curve equal to the canonical gradient of the
path-wise derivative of the target parameter. The canonical gradient is the
direction of the path through the data distribution where the parameter is
steepest. An estimator is efficient if and only if is asymptotically linear
with influence curve equal to the canonical gradient. One can calculate the
canonical gradient with the statistical model and the statistical target
parameter.  Techniques for calculating the canonical gradient entail projecting
an initial gradient on the tangent space of the model at the particular
distribution in the model in which you want to calculate the canonical
gradient.

Now we know what it takes to construct an efficient estimator. Namely, we need
to construct an estimator which is asymptotically linear with influence curve
the canonical gradient. There are three general classes of estimators which
succeed in constructing asymptotically linear estimators: (1) the one-step
estimator, but it is not a plug-in estimator; (2) the targeted maximum
likelihood estimator, which is a Super Learner targeted towards the target
parameter and it is a plug-in estimator; and (3) estimating equation based
estimators, which use the canonical gradient but as an estimating function in
the target parameter. In the chapters that follow, we focus on the targeted
maximum likelihood estimator and the targeted minimum loss-based estimator,
both referred to as TMLE.
-->