By the end of this chapter you will be able to:
- Translate scientific questions to statistical questions.
- Define a statistical model based on the knowledge of the experiment that generated the data.
- Identify a causal parameter as a function of the observed data distribution.
- Explain the following causal and statistical assumptions and their implications: i.i.d., consistency, interference, positivity, SUTVA.
The roadmap of statistical learning is concerned with the translation from real-world data applications to a mathematical and statistical formulation of the relevant estimation problem. This involves data as a random variable having a probability distribution, scientific knowledge represented by a statistical model, a statistical target parameter representing an answer to the question of interest, and the notion of an estimator and sampling distribution of the estimator.
Following the roadmap is a process of five stages.
- Data as a random variable with a probability distribution,
$O \sim P_0$ . - The statistical model
$\mathcal{M}$ such that$P_0 \in \mathcal{M}$ . - The statistical target parameter
$\Psi$ and estimand$\Psi(P_0)$ . - The estimator
$\hat{\Psi}$ and estimate$\hat{\Psi}(P_n)$ . - A measure of uncertainty for the estimate
$\hat{\Psi}(P_n)$ .
The data set we're confronted with is the result of an experiment and we can
view the data as a random variable,
Once we have
In order to start learning something, we need to ask "What do we know about the probability distribution of the data?" This brings us to Step 2.
The statistical model
Alternatively, if the probability distribution of the data at hand is described
by a finite number of parameters, then the statistical model is parametric. In
this case, we prescribe to the belief that the random variable
Sadly, the assumption that the data-generating distribution has a specific, parametric forms is all-too-common, even when such is a leap of faith. This practice of oversimplification in the current culture of data analysis typically derails any attempt at trying to answer the scientific question at hand; alas, such statements as the ever-popular quip of Box that "All models are wrong but some are useful," encourage the data analyst to make arbitrary choices even when that often force significant differences in answers to the same estimation problem. The Targeted Learning paradigm does not suffer from this bias since it defines the statistical model through a representation of the true data-generating distribution corresponding to the observed data.
Now, on to Step 3: ``What are we trying to learn from the data?"
The statistical target parameter,
For a simple example, consider a data set which contains observations of a survival time on every subject, for which our question of interest is "What's the probability that someone lives longer than five years?" We have, \begin{equation*} \Psi(P_0) = \mathbb{P}(O > 5) \end{equation*}
This answer to this question is the estimand,
To obtain a good approximation of the estimand, we need an estimator, an a
priori-specified algorithm defined as a mapping from the set of possible
empirical distributions,
Where the estimator may be seen as an operator that maps the observed data and
corresponding empirical distribution to a value in the parameter space, the
numerical output that produced such a function is the estimate. Thus, it is an
element of the parameter space based on the empirical probability distribution
of the observed data. If we plug in a realization of
In order to quantify the uncertainty in our estimate of the target parameter (i.e., to construct statistical inference), an understanding of the sampling distribution of our estimator will be necessary. This brings us to Step 5.
Since the estimator
A class of Central Limit Theorems (CLTs) are statements regarding the
convergence of the sampling distribution of an estimator to a normal
distribution. In general, we will construct estimators whose limit sampling
distributions may be shown to be approximately normal distributed as sample size
increases. For large enough
Note: we will typically have to estimate the standard error,
A 95% confidence interval means that if we were to take 100 different samples
of size
Data,
After formalizing the data and the statistical model, we can define a causal
model to express causal parameters of interest. Directed acyclic graphs (DAGs)
are one useful tool to express what we know about the causal relations among
variables. Ignoring exogenous
<div id="htmlwidget-f24f61b14a52b30b1486" style="width:200px;height:300px;" class="visNetwork html-widget"></div>
<script type="application/json" data-for="htmlwidget-f24f61b14a52b30b1486">{"x":{"nodes":{"id":["W","A","Y"],"label":["W","A","Y"]},"edges":{"from":["W","W","A"],"to":["A","Y","Y"]},"nodesToDataframe":true,"edgesToDataframe":true,"options":{"width":"100%","height":"100%","nodes":{"shape":"dot"},"manipulation":{"enabled":false},"edges":{"arrows":{"to":true}},"layout":{"randomSeed":25}},"groups":null,"width":"200px","height":"300px","idselection":{"enabled":false},"byselection":{"enabled":false},"main":null,"submain":null,"footer":null,"background":"rgba(0, 0, 0, 0)"},"evals":[],"jsHooks":[]}</script>
While directed acyclic graphs (DAGs) like above provide a convenient means by
which to visualize causal relations between variables, the same causal
relations among variables can be represented via a set of structural equations,
which define the non-parametric structural equation model (NPSEM):
\begin{align*}
W &= f_W(U_W) \
A &= f_A(W, U_A) \
Y &= f_Y(W, A, U_Y),
\end{align*}
where
The first hypothetical experiment we will consider is assigning exposure to the whole population and observing the outcome, and then assigning no exposure to the whole population and observing the outcome. On the nonparametric structural equations, this corresponds to a comparison of the outcome distribution in the population under two interventions:
-
$A$ is set to$1$ for all individuals, and -
$A$ is set to$0$ for all individuals.
These interventions imply two new nonparametric structural equation models. For
the case
In these equations,
Note, we can define much more complicated interventions on NPSEM's, such as interventions based upon rules (themselves based upon covariates), stochastic rules, etc. and each results in a different targeted parameter and entails different identifiability assumptions discussed below.
Because we can never observe both
- The causal graph implies
$Y(a) \perp A$ for all$a \in \mathcal{A}$ , which is the randomization assumption. In the case of observational data, the analogous assumption is strong ignorability or no unmeasured confounding$Y(a) \perp A \mid W$ for all$a \in \mathcal{A}$ ; - Although not represented in the causal graph, also required is the assumption
of no interference between units, that is, the outcome for unit
$i$ $Y_i$ is not affected by exposure for unit$j$ $A_j$ unless$i=j$ ; -
Consistency of the treatment mechanism is also required, i.e., the outcome
for unit
$i$ is$Y_i(a)$ whenever$A_i = a$ , an assumption also known as "no other versions of treatment"; - It is also necessary that all observed units, across strata defined by
$W$ , have a bounded (non-deterministic) probability of receiving treatment -- that is,$0 < \mathbb{P}(A = a \mid W) < 1$ for all$a$ and$W$ ). This assumption is referred to as positivity or overlap.
Remark: Together, (2) and (3), the assumptions of no interference and consistency, respectively, are jointly referred to as the stable unit treatment value assumption (SUTVA).
Given these assumptions, the ATE may be re-written as a function of
The data come from a study of the effect of water quality, sanitation, hand washing, and nutritional interventions on child development in rural Bangladesh (WASH Benefits Bangladesh): a cluster-randomised controlled trial [@luby2018effects]. The study enrolled pregnant women in their first or second trimester from the rural villages of Gazipur, Kishoreganj, Mymensingh, and Tangail districts of central Bangladesh, with an average of eight women per cluster. Groups of eight geographically adjacent clusters were block-randomised, using a random number generator, into six intervention groups (all of which received weekly visits from a community health promoter for the first 6 months and every 2 weeks for the next 18 months) and a double-sized control group (no intervention or health promoter visit). The six intervention groups were:
- chlorinated drinking water;
- improved sanitation;
- hand-washing with soap;
- combined water, sanitation, and hand washing;
- improved nutrition through counseling and provision of lipid-based nutrient supplements; and
- combined water, sanitation, handwashing, and nutrition.
In the workshop, we concentrate on child growth (size for age) as the outcome of interest. For reference, this trial was registered with ClinicalTrials.gov as NCT01590095.
library(tidyverse)
# read in data
dat <- read_csv("https://raw.githubusercontent.com/tlverse/tlverse-data/master/wash-benefits/washb_data.csv")
dat
# A tibble: 4,695 × 28
whz tr fracode month aged sex momage momedu momheight hfiacat Nlt18
<dbl> <chr> <chr> <dbl> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 0 Control N05265 9 268 male 30 Prima… 146. Food S… 3
2 -1.16 Control N05265 9 286 male 25 Prima… 149. Modera… 2
3 -1.05 Control N08002 9 264 male 25 Prima… 152. Food S… 1
4 -1.26 Control N08002 9 252 female 28 Prima… 140. Food S… 3
5 -0.59 Control N06531 9 336 female 19 Secon… 151. Food S… 2
# … with 4,690 more rows, and 17 more variables: Ncomp <dbl>, watmin <dbl>,
# elec <dbl>, floor <dbl>, walls <dbl>, roof <dbl>, asset_wardrobe <dbl>,
# asset_table <dbl>, asset_chair <dbl>, asset_khat <dbl>, asset_chouki <dbl>,
# asset_tv <dbl>, asset_refrig <dbl>, asset_bike <dbl>, asset_moto <dbl>,
# asset_sewmach <dbl>, asset_mobile <dbl>
For the purposes of this workshop, we we start by treating the data as independent and identically distributed (i.i.d.) random draws from a very large target population. We could, with available options, account for the clustering of the data (within sampled geographic units), but, for simplification, we avoid these details in these workshop presentations, although modifications of our methodology for biased samples, repeated measures, etc., are available.
We have 28 variables measured, of which 1 variable is set to be the outcome of
interest. This outcome, whz
in dat
);
the treatment of interest, tr
in
dat
); and the adjustment set,
Using the skimr
package, we can
quickly summarize the variables measured in the WASH Benefits data set:
Table: (#tab:skim_washb_data)Data summary
Name | dat |
Number of rows | 4695 |
Number of columns | 28 |
_______________________ | |
Column type frequency: | |
character | 5 |
numeric | 23 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
tr | 0 | 1 | 3 | 15 | 0 | 7 | 0 |
fracode | 0 | 1 | 2 | 6 | 0 | 20 | 0 |
sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
momedu | 0 | 1 | 12 | 15 | 0 | 3 | 0 |
hfiacat | 0 | 1 | 11 | 24 | 0 | 4 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
whz | 0 | 1.00 | -0.59 | 1.03 | -4.67 | -1.28 | -0.6 | 0.08 | 4.97 | ▁▆▇▁▁ |
month | 0 | 1.00 | 6.45 | 3.33 | 1.00 | 4.00 | 6.0 | 9.00 | 12.00 | ▇▇▅▇▇ |
aged | 0 | 1.00 | 266.32 | 52.17 | 42.00 | 230.00 | 266.0 | 303.00 | 460.00 | ▁▂▇▅▁ |
momage | 18 | 1.00 | 23.91 | 5.24 | 14.00 | 20.00 | 23.0 | 27.00 | 60.00 | ▇▇▁▁▁ |
momheight | 31 | 0.99 | 150.50 | 5.23 | 120.65 | 147.05 | 150.6 | 154.06 | 168.00 | ▁▁▆▇▁ |
Nlt18 | 0 | 1.00 | 1.60 | 1.25 | 0.00 | 1.00 | 1.0 | 2.00 | 10.00 | ▇▂▁▁▁ |
Ncomp | 0 | 1.00 | 11.04 | 6.35 | 2.00 | 6.00 | 10.0 | 14.00 | 52.00 | ▇▃▁▁▁ |
watmin | 0 | 1.00 | 0.95 | 9.48 | 0.00 | 0.00 | 0.0 | 1.00 | 600.00 | ▇▁▁▁▁ |
elec | 0 | 1.00 | 0.60 | 0.49 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▆▁▁▁▇ |
floor | 0 | 1.00 | 0.11 | 0.31 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
walls | 0 | 1.00 | 0.72 | 0.45 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
roof | 0 | 1.00 | 0.99 | 0.12 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▁▁▁▁▇ |
asset_wardrobe | 0 | 1.00 | 0.17 | 0.37 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▂ |
asset_table | 0 | 1.00 | 0.73 | 0.44 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
asset_chair | 0 | 1.00 | 0.73 | 0.44 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▃▁▁▁▇ |
asset_khat | 0 | 1.00 | 0.61 | 0.49 | 0.00 | 0.00 | 1.0 | 1.00 | 1.00 | ▅▁▁▁▇ |
asset_chouki | 0 | 1.00 | 0.78 | 0.41 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▂▁▁▁▇ |
asset_tv | 0 | 1.00 | 0.30 | 0.46 | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 | ▇▁▁▁▃ |
asset_refrig | 0 | 1.00 | 0.08 | 0.27 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_bike | 0 | 1.00 | 0.32 | 0.47 | 0.00 | 0.00 | 0.0 | 1.00 | 1.00 | ▇▁▁▁▃ |
asset_moto | 0 | 1.00 | 0.07 | 0.25 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_sewmach | 0 | 1.00 | 0.06 | 0.25 | 0.00 | 0.00 | 0.0 | 0.00 | 1.00 | ▇▁▁▁▁ |
asset_mobile | 0 | 1.00 | 0.86 | 0.35 | 0.00 | 1.00 | 1.0 | 1.00 | 1.00 | ▁▁▁▁▇ |
A convenient summary of the relevant variables is given just above, complete with a small visualization describing the marginal characteristics of each covariate. Note that the asset variables reflect socioeconomic status of the study participants. Notice also the uniform distribution of the treatment groups (with twice as many controls); this is, of course, by design.