-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathGDR2018_slidesML.Rmd
1485 lines (1125 loc) · 54.3 KB
/
GDR2018_slidesML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Multilevel models"
author:
- affiliation: Centre for Applied Vision Research, City, University of London
name: Matteo Lisi
date: "GDR Vision - Paris 4/10/2018"
output:
ioslides_presentation:
css: ./style.css
incremental: yes
logo: logocitynew.pdf.png
smaller: yes
transition: 0
subtitle: frequentist and Bayesian approaches
---
```{r setup, include=FALSE}
#knitr::opts_chunk$set(echo = FALSE, collapse = TRUE, comment=NA, tidy.opts=list(width.cutoff=60),tidy=TRUE)
library(knitr)
opts_chunk$set(collapse = TRUE, comment=NA, tidy.opts=list(width.cutoff=55),tidy=TRUE)
library(lme4)
library(mlisi)
library(rstan)
```
## Today's program
> 1. Multilevel models from a frequentist perspective
> 2. The Bayesian approach
> 3. Markov Chain Monte Carlo (MCMC)
> 4. Examples
## Linear models
> - The dependent variable $y$ is modeled as a weighted combination of the independent variables, plus an additive error $\epsilon$ $$
y_i=\beta_0 + \beta_1x_{1i} + \ldots +\beta_nx_{ni} + \epsilon_i \\
\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)
$$
## Linear models
> - The dependent variable $y$ is modeled as a weighted combination of the independent variables, plus an additive error $\epsilon$ $$
\textbf{Y} = \textbf{X}\beta + \epsilon \\
\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)
$$
----
Matrix notation:
$$
\bf{Y} = \bf{X}\beta + \epsilon
$$
$$
\left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right) =
\left( \begin{array}{cccc}
1 & x_{11} & \ldots & x_{1n}\\
\vdots & \vdots & \vdots & \vdots\\
1 & x_{m1} & \ldots & x_{mn}
\end{array} \right)
\left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_n \end{array} \right) +
\left( \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_m \end{array} \right)
$$
Matrix multiplication:
$$
\left( \begin{array}{cc} a & b \\ c& d \end{array} \right)
\left( \begin{array}{c} 1 \\ 2 \end{array} \right) =
1 \left( \begin{array}{c} a \\ c \end{array} \right) +
2 \left( \begin{array}{c} b \\ d \end{array} \right) =
\left( \begin{array}{c} a + 2b \\ c+ 2d \end{array} \right)
$$
## Linear mixed-effects models
- 'Classical' linear models are _fixed-effects_ only:
- independent variables are all experimental manipulation (they are not random).
- the only random source of variation is the residual error $\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)$.
- However _observations_ (e.g. trials) are often grouped according to _observational cluster_ (e.g. subjects), random samples from a larger population, on which we'd like to make inferences.
- In order to properly generalize from the sample to the population, these variables should be treated as _random effects_. Mixed-effects models allow to do that by explicitly modelling the population distribution.
----
- A model containing both fixed and random effects is usually called a _mixed-effects_ model (but also: hierarchical, multilevel, etc.).
- Random effects are treated as random variations around a population mean. These random variations are usually assumed to have a Gaussian distribution.
- A simple example: a _random-intercept_ model. Regressions have the same slope in each of the $J$ groups ($j=1,\dots,J$), but random variations in intercept $$
y_{ij} = \beta_0 + b_j + \beta_1x_{ij} + \epsilon_{i}\\
\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)\\
b \sim \mathcal{N}\left( 0, \sigma_b^2 \right)
$$
----
- General formulation in matrix notation
$$
\textbf{Y} = \textbf{X}\beta + \textbf{Z}b + \epsilon
$$
where $\textbf{X}$ and $\textbf{Z}$ are the known fixed-effects and random-effects regressor matrices.
- The components of the residual error vector $\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right)$ are assumed to be *i.i.d.* (independent and identically distributed).
- The random-effect components, $b \sim \mathcal{N}\left( 0, \Omega \right)$ are assumed to be normally distributed with mean 0, however they are not necessarily independent (the components $b_j$ can be correlated, and correlations can be estimated). Example:$$
\left[ {\begin{array}{c}
{{b_0}}\\
{{b_1}}
\end{array}} \right] \sim\cal N \left( {\left[ {\begin{array}{c}
0 \\ 0 \end{array}} \right],\Omega = \left[ {\begin{array}{c}
{{\mathop{\rm Var}} \left( b_0 \right)} & {{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)}\\
{{\mathop{\rm cov}} \left( {{b_0},{b_1}} \right)} & {{\mathop{\rm Var}} \left( b_1 \right)}
\end{array}} \right]} \right)
$$
## Likelihood function
- Parameters ($\beta$, $\sigma^2$ and $\Omega$) are estimated by maximizing the likelihood function, which is the probability of the data, given the parameters (but treated as a function of the parameters, keeping the data fixed).
- The probability of the data _conditional to the random effects_ is integrated with respect to the marginal density of the random effects, to obtain the marginal density of the data $$
L\left(\beta,\sigma^2,\Omega \mid \text{data}\right) = \int p \left(\text{data} \mid \beta,\sigma^2, b\right) \, p\left(b \mid \Omega\right) \, db
$$
## ML vs REML
- The $\textsf{R}$ library `lme4` provides two methods for estimating parameters: Maximum Likelihood (ML) and Restricted Maximum Likelihood (REML).
- ML tend to be biased and underestimate the variance components (e.g. $\Omega$).
- REML provide less biased variance estimates: conceptually can be seen as similar to Bessel's correction for sample variance (using $n-1$ instead of $n$ in the denominator).
## Advantages of multilevel models
- Improved estimates for _repeated sampling_ (e.g. repeated-measures design).
- Improved estimates for _imbalances in sampling_ (e.g. unequal number of trials across subjects).
- Avoid averaging (pre-averaging of data remove variation and can manifacture false confidence).
- Subject-specific standard error is taken into account in group-level estimates.
- Variation among group or individuals is modelled explicitly.
- Outperform classical methods in predictive ability.
## Example 1 `sleepstudy` {.build}
`sleepstudy` is a dataset in the `lme4` package, with reaction times data from 18 subjects that were restricted to 3 hours of sleep for 10 days.
Questions:
> - how reaction times changes with each sleep-deprived night?
- are individual difference in baseline response times related to individual differences in the effect of sleep deprivation?
```{r}
str(sleepstudy)
```
----
```{r, include=FALSE}
library(ggplot2)
nice_theme <- theme_bw()+theme(text=element_text(family="Helvetica",size=9),panel.border=element_blank(),strip.background = element_rect(fill="white",color="white",size=0),strip.text=element_text(size=rel(0.8)),panel.grid.major.x=element_blank(),panel.grid.major.y=element_blank(),panel.grid.minor=element_blank(),axis.line.x=element_line(size=.4),axis.line.y=element_line(size=.4),axis.text.x=element_text(size=7,color="black"),axis.text.y=element_text(size=7,color="black"),axis.line=element_line(size=.4), axis.ticks=element_line(color="black"))
```
<div class="centered">
```{r, echo=F,fig.height=4,fig.width=7}
ggplot(sleepstudy, aes(x=Days,y=Reaction))+geom_point()+geom_smooth(method="lm",lwd=1)+facet_wrap(~Subject,ncol=9)+nice_theme+scale_x_continuous(breaks=seq(0,10,2))+labs(x="Days of sleep deprivation",y="Mean reaction times [ms]")
```
</div>
----
> - The model we want to fit includes both random slopes and intercept $$
y_{ij} = \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right) \times {\rm{Days}}_i + \epsilon_i \\
\epsilon \sim \mathcal{N}\left( 0, \sigma^2 \right) \\
b\sim\cal N \left( 0, \Omega\right)
$$
> - In `lme4`
```{r, warning=F}
m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject), sleepstudy)
```
```{r, include=F}
load("sleepCI.RData")
```
----
```{r}
summary(m.1)
```
----
The function `confint()` allow easily to compute bootstrapped 95% CI of the parameters
```{r, eval=FALSE}
CI_fixef <- confint(m.1, method="boot", nsim=500, oldNames=F)
```
```{r}
print(CI_fixef, digits=2)
```
## {.build}
- Tests of fixed and random effects can also be conducted using likelihood ratio tests, by comparing the likelihood of two _nested_ models.
- If $L_1$ and $L_2$ are the maximised likelihoods of two nested models with $k_1 < k_2$ parameters, the test statistic is $2\text{log}\left( L_2/ L_1 \right)$, which is approximately $\chi^2$ with $k_2 - k_1$ degrees of freedom.
- Example: test if there are substantial variation across subjects in the effect of `Days`
```{r}
# m.1 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject), sleepstudy)
m.2 <- lmer(Reaction ~ 1 + Days + (1 |Subject), sleepstudy)
anova(m.1, m.2)
```
----
_Shrinkage_: the _predicted_ ${\hat b}_j$ (conditional modes of the random effects) can be seen as a "compromise" between the within-subject estimates and the population mean.
<div class="centered">
```{r, echo=F,fig.height=5.5,fig.width=5}
# analyze shrinking
m.single <- coef(lmList(Reaction ~ Days | Subject, sleepstudy))
par.mixed <- as.matrix(ranef(m.1)$Subject) + repmat(t(as.matrix(fixef(m.1))),18,1)
# plot parameters from mixed model and individual fit
plot(m.single[,1], m.single[,2], xlab="Intercept",ylab="Slope",pch=19,cex.lab=1.2,xlim=c(190,310),ylim=c(-5,25),col="dark grey")
# draw ellipse illustrating covariance of random effects
vcov_m.1 <- matrix(as.vector(VarCorr(m.1)$Subject),ncol=2)
mixtools::ellipse(c(mean(par.mixed[,1]), mean(par.mixed[,2])), sigma=vcov_m.1,alpha=0.05, col="grey", lty=2)
points(mean(m.single[,1]), mean(m.single[,2]),pch=19,col="dark grey",cex=2)
points(mean(par.mixed[,1]), mean(par.mixed[,2]),pch=21,col="black",cex=2,lwd=2)
text(m.single[,1], m.single[,2],labels=rownames(m.single),pos=1,cex=0.6)
points(par.mixed[,1], par.mixed[,2])
arrows(m.single[,1], m.single[,2],par.mixed[,1], par.mixed[,2], length=0.1)
legend("bottomright",c("single-subject fits","multilevel model"),pch=c(19,21),col=c("dark grey", "black"),bty="n",cex=0.6)
```
</div>
<div class="notes">
They are sometime called BLUP (_Best Linear Umbiased Predictors_) since they are treated as random variables, not as unknown parameters, and in statistical jargon we say we _predict_ (rather than _estimate_) random variables. A more appropriate term is probably conditional modes of the random effects.
</div>
## Diagnostic
As for any linear model, it is important to check that residual errors are well-behaved
<div class="centered">
```{r, echo=F,fig.height=4,fig.width=6}
# diagnostic plot: visualize the model residuals
par(mfrow=c(1,2))
#plot(fitted(m.1),resid(m.1), xlab="fitted values", ylab="residuals") # against fitted values if more than one predictor
plot(jitter(sleepstudy$Days),resid(m.1), xlab="Days", ylab="residuals") # but here one could plot it agains the preditor Days in this way
#boxplot(resid(m.1)~sleepstudy$Days, xlab="Days", ylab="residuals")
abline(h=0,lty=2)
qqnorm(resid(m.1))
qqline(resid(m.1),lty=2)
```
</div>
## The _frequentist_ approach
- So far, we've seen multilevel models from a **frequentist** point of view.
- parameters are seen as unknown _fixed_ quantities.
- the goal is to find best guesses (point-estimates) of the parameters.
- uncertainty in the parameter's value is not treated probabilistically.
- uncertainty in the estimates is characterized in relation to the data-generating process (think about the highly un-intuitive definition of confidence interval).
- probability is interpreted as the expected _long-run frequency_ of an event in an (imaginary) very large sample.
## The _Bayesian_ approach
- What is different in the Bayesian approach?
- probability is interpreted as the _degree of belief_ in the occurrence of an event, or in a proposition.
- parameters are seen as unknown random variable, uncertainty in their values represented by a _prior_ probability distribution.
- the goal is to update the prior distribution on the basis of the available data, leading to a _posterior_ probability distribution.
- this is done with Bayes' theorem, however the Bayesian approach _is not_ distinguished by Bayes' theorem (which is just a trivial implication of probability theory).
- _"The essential characteristic of Bayesian methods is their explicit use of
probability for quantifying uncertainty in inferences based on statistical
analysis"_ (Gelman et al. 2013).
## Bayesian models
- In a typical Bayesian setting we have a likelihood $p\left(\text{data} \mid \theta\right)$, conditional on the parameter $\theta$, and a prior distribution $p\left(\theta\right)$. The posterior distribution for the parameter is $$ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right) p\left(\theta\right)}{p\left( \text{data} \right)} $$
## Bayesian models
> - In a typical Bayesian setting we have a likelihood $p\left(\text{data} \mid \theta\right)$, conditional on the parameter $\theta$, and a prior distribution $p\left(\theta\right)$. The posterior distribution for the parameter is $$ p\left(\theta \mid \text{data}\right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}$$
<div class="notes">
This is called averaged likelihood because it is averaged over the prior (also called the evidence or the probability of the data). Also called marginal likelihood because it is marginalised over all possible values of $\theta$ weighted by their probability.
</div>
## Bayesian models
> - In a typical Bayesian setting we have a likelihood $p\left(\text{data} \mid \theta\right)$, conditional on the parameter $\theta$, and a prior distribution $p\left(\theta\right)$. The posterior distribution for the parameter is $$ p\left(\theta \mid \text{data} \right) = \frac{p\left(\text{data} \mid \theta\right)p\left(\theta\right)}{\int p\left( \text{data} \mid \theta \right) p\left(\theta\right) d\theta}\\
\text{posterior} = \frac{\text{likelihood} \times \text{prior}}{\text{average likelihood}} $$
<div class="notes">
This is called averaged likelihood because it is averaged over the prior (also called the evidence or the probability of the data). Also called marginal likelihood because it is marginalised over all possible values of $\theta$ weighted by their probability.
</div>
## Bayesian models
> - In a typical Bayesian setting we have a likelihood $p\left(\text{data} \mid \theta\right)$, conditional on the parameter $\theta$, and a prior distribution $p\left(\theta\right)$. The posterior distribution for the parameter is $$ p\left(\theta \mid \text{data}\right) \propto p\left(\text{data} \mid \theta\right)p\left(\theta\right)\\ \text{posterior} \propto \text{likelihood} \times \text{prior} $$
## Bayesian multilevel models
- In the case of a multilevel model, each cluster $j$ has parameter $\theta_j$ which is an independent sample from a population distribution governed by some _hyperparameter_ $\phi$.
- $\phi$ is also not known, and thus has its prior distribution $p\left(\phi\right)$ (_hyperprior_).
- We have a joint prior distribution: $p\left(\phi,\theta\right)=p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)$
- and a joint posterior distribution: $$p\left(\phi,\theta \mid \text{data}\right) \propto \underbrace{p\left(\phi\right) \prod_{j=1}^{J} p\left(\theta_j \mid \phi\right)}_\text{prior} \times \underbrace{ p\left(\text{data} \mid \theta_j\right)}_\text{likelihood} $$
## Bayesian parameter inference
- The information we want about a parameter $\phi$ is contained in its marginal posterior distribution $p\left(\phi \mid \text{data} \right)=\int p\left(\phi,\theta \mid \text{data} \right) d\theta$.
- The calculation can be difficult once a model has many parameters (to describe the uncertainty in a parameter of interest one must average or _marginalise_ over the uncertainty in all other parameters).
- Monte Carlo techniques and computer simulations (Markov-Chain Monte Carlo sampling, MCMC) can be used to draw samples from the posterior distribution.
- Once you have samples from the posterior distribution, summarizing the marginal posterior distribution becomes simply a matter of counting values.
## Metropolis algorithm {.build}
Say you want to sample from a density $P({\bf x})$, but you can only evaluate a function $f({\bf x})$ that is proportional to the density $P({\bf x})$:
- Initialization:
1. choose arbitrary starting point ${\bf x}_0$
2. choose a probability density to generate proposals $g({\bf x}_{n+1}|{\bf x}_n)$ (_proposal density_)
- For each iteration $i$:
1. generate a candidate ${\bf x'}$ sampling from $g({\bf x}_{i+1}|{\bf x}_{i})$
2. calculate the _acceptance ratio_ $a = \frac{f({\bf x}')}{f({\bf x}_{i})}$
3. *accept/reject*: generate a uniform number $u$ in $[0,1]$
- if $u \le a$, accept and set ${\bf x}_{i+1}={\bf x'}$ (_"move to the new value"_)
- if $u > a$, reject and set ${\bf x}_{i+1}={\bf x}_{i}$ (_"stay where you are"_)
## MCMC example
Let use the sampling algorithm to infer values from one participant of the dataset `sleepstudy`.
<div class="centered">
```{r, echo=F, fig.width=2.8, fig.height=3.2}
with(sleepstudy[sleepstudy$Subject=="308",],plot( Days,Reaction,pch=19,main="308",xlim=c(-1,11),ylim=c(200,500)))
```
</div>
- To sample from the posterior distribution of parameters is sufficient to compute the un-normalized posterior: $\text{prior} \times \text{likelihood}$
## MCMC example
First, code one function to compute the (log)likelihood, keeping the data fixed.
```{r}
x <- sleepstudy$Days[sleepstudy$Subject=="308"]
y <- sleepstudy$Reaction[sleepstudy$Subject=="308"]
loglik <- function(par){
# par = [intercept, slope, log(residual Std.)]
pred <- par[1] + par[2]*x
return(sum(dnorm(y, mean = pred, sd = exp(par[3]), log = T)))
}
```
## MCMC example
Next, choose your priors
```{r, echo=F, fig.width=8, fig.height=3}
par(mfrow=c(1,3))
curve(dnorm(x, mean=250, sd=180),from=0, to=1000, xlab="Intercept",ylab="prior density", col="blue")
curve(dnorm(x, mean=20, sd=20),from=-50, to=50, xlab="Slope",ylab="prior density", col="blue")
x_pSD <- seq(-1,6, length.out = 500)
y_pSD <- dnorm(x_pSD , mean=4,sd=1)
plot(exp(x_pSD),y_pSD, type="l", xlab=expression(sigma[epsilon]),ylab="prior density", col="blue")
```
## MCMC example
Next, choose your priors
```{r, echo=F, fig.width=8, fig.height=3}
par(mfrow=c(1,3))
curve(dnorm(x, mean=250, sd=180),from=0, to=1000, xlab="Intercept",ylab="prior density", col="blue")
curve(dnorm(x, mean=20, sd=20),from=-50, to=50, xlab="Slope",ylab="prior density", col="blue")
x_pSD <- seq(-1,6, length.out = 500)
y_pSD <- dnorm(x_pSD , mean=4,sd=1)
plot(x_pSD,y_pSD, type="l", xlab=expression(paste("log ",sigma[epsilon])),ylab="prior density", col="red")
```
## MCMC example
Write a function that compute the log of the joint prior density for a combination of parameters
```{r}
logprior <- function(par){
intercept_prior <- dnorm(par[1], mean=250, sd=180, log=T)
slope_prior <- dnorm(par[2], mean=20, sd=20, log=T)
sd_prior <- dnorm(par[3],mean=4, sd=1, log=T)
return(intercept_prior+slope_prior+sd_prior)
}
```
## MCMC example
Put together likelihood and prior to obtain the (log) posterior density
```{r}
logposterior <- function(par){
return (loglik(par) + logprior(par))
}
```
## MCMC example
Choose the arbitrary starting point and the proposal density $g({\bf x}_{n+1}|{\bf x}_n)$
```{r}
# initial parameters
startvalue <- c(250,20,5)
# proposal density
proposalfunction <- function(par){
return(rnorm(3, mean = par, sd= c(15,5,0.2)))
}
```
## MCMC example
This function execute iteratively the step we have seen before
```{r}
run_metropolis_MCMC <- function(startvalue, iterations){
chain <- array(dim = c(iterations+1,3))
chain[1,] <- startvalue
for (i in 1:iterations){
# draw a random proposal
proposal <- proposalfunction(chain[i,])
# ratio of posterior density between new and old values
a <- exp(logposterior(proposal) - logposterior(chain[i,]))
# accept/reject
if (runif(1) < a){
chain[i+1,] <- proposal
}else{
chain[i+1,] <- chain[i,]
}
}
return(chain)
}
```
## MCMC example
```{r, echo=F, fig.width=8, fig.height=5}
#set.seed(3)
#chain <- run_metropolis_MCMC(startvalue, 20000)
# #save(chain, file="chain_ex.RData")
load("chain_ex.RData")
burnIn <- 5000
acceptance <- 1-mean(duplicated(chain[-(1:burnIn),]))
LSfit <- lm(y~x)
interceptLS <- coef(LSfit)[1]
slopeLS <- coef(LSfit)[2]
sigmaLS <- summary(LSfit)$sigma
par(mfrow = c(2,3))
hist(chain[-(1:burnIn),1],main="Intercept",border="white",col="dark grey", breaks=20)
abline(v = interceptLS, col="red",lwd=2)
hist(chain[-(1:burnIn),2],main="Slope",border="white",col="dark grey", breaks=20)
abline(v = slopeLS , col="red",lwd=2)
hist(exp(chain[-(1:burnIn),3]),main=expression(sigma[epsilon]),border="white",col="dark grey", breaks=20)
abline(v = sigmaLS, col="red" ,lwd=2)
plot(chain[-(1:burnIn),1], type = "l", main = "Chain values")
abline(h = interceptLS, col="red",lwd=2)
plot(chain[-(1:burnIn),2], type = "l" , main = "Chain values")
abline(h = slopeLS, col="red",lwd=2)
plot(exp(chain[-(1:burnIn),3]), type = "l" , main = "Chain values")
abline(h = sigmaLS, col="red",lwd=2)
```
## MCMC example
Having the samples from the posterior distribution, summarising uncertainty is simply a matter of counting values. Here I calculate a 95% Bayesian credible interval for slope.
```{r}
# remove initial 'burn in' samples
burnIn <- 5000
slope_samples <- chain[-(1:burnIn),2]
# mean of posterior distribution
mean(slope_samples)
# 95% Bayesian credible interval
alpha <- 0.05
round(c(quantile(slope_samples, probs = alpha/2),quantile(slope_samples, probs = 1-alpha/2)),digits=2)
```
----
- The Metropolis algorithm is the grandparent of modern MCMC techniques for drawing samples from unknown posterior distributions
- Proposals are random and that can make it very inefficient, especially as the complexity of the model (and thus of the posterior) increases
- Modern algorithm use clever proposals to get a better picture of the posterior distribution with fewer samples
- A state-of-the-art algorithm is Hamiltonian Monte Carlo (HMC), which is used by the free software Stan ([mc-stan.org](http://mc-stan.org/))
## Stan
- Stan (named after Stanislaw Ulam, 1909-1984, co-inventor of MCMC methods) is a probabilistic programming language that enabels users to easily Bayesian inference on complex models.
- Once a model is defined, Stan compiles a C++ program that uses HMC to draw samples from the posterior density
- Stan is free, open-source, has a comprehensive documentation and interfaces for the most popular computing environments (R, Matlab, Python, Mathematica)
- We shall see how to build a Bayesian multilevel model in Stan using again the `sleepstudy` dataset
- I will use Stan's R interface (`RStan`, see installation instructions at [github.com/stan-dev/rstan/wiki/RStan-Getting-Started](https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started))
## Bayesian LMM {.build}
- The model can be expressed as $$
y_{j} \sim \text{Normal}\left( \beta_0 + b_{0j} + \left( \beta_1 + b_{1j} \right){\rm{Days}}, \,\, \sigma_{\epsilon}^2 \right)\\
b \sim \text{Normal}\left( 0, \,\, \Omega \right)\\
$$
- Priors$$
\beta_0 \sim \text{Normal} \left(0.3,1 \right)\\
\beta_1 \sim \text{Normal} \left(0.01,0.5 \right)\\
\sigma_{\epsilon} \sim \text{HalfCauchy} \left(0, 0.5\right)\\
$$
## Bayesian LMM
> - For efficiency and numerical stability, the matrix $\Omega$ is usually parametrized by a vector of variances $\sigma_0, \sigma_1$ and a correlation matrix $\textbf{R}$
$$
\Omega = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right)
$$
## Variance-covariance and correlation matrices
$$
\begin{align*}
\Omega & = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right)\\
& = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \left( \begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array} \right) \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \\
& = \left( \begin{array}{cc} \sigma_0 & \rho\sigma_0 \\ \rho\sigma_1 & \sigma_1 \end{array} \right) \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \\
& = \left( \begin{array}{cc} \sigma_0^2 & \rho\sigma_0\sigma_1 \\ \rho\sigma_0\sigma_1 & \sigma_1^2 \end{array} \right)\\
\end{align*}
$$
## Bayesian LMM
- Hyperprior$$
\Omega = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{R} \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right)\\
{\bf \sigma}_{\Omega} \sim \text{HalfCauchy} \left(0, 0.5\right)\\
\textbf{R} \sim \text{LKJcorr} \left( 2 \right)\\
$$
## Prior for correlation matrix
The LKJ prior (Lewandowski, Kurowicka, & Joe, 2009) is a prior distribution over correlation matrices, has one parameter that control how 'skeptical' is the prior with respect to extreme correlation values.
<div class="centered">
```{r, echo=F, fig.width=4, fig.height=3}
par(xpd=TRUE)
R <- rethinking::rlkjcorr(1e5, K=2, eta=6)
rethinking::dens(R[,1,2], xlab=expression(paste("correlation [",rho,"]")),col="red",bty="L",xlim=c(-1.05,1.05))
R <- rethinking::rlkjcorr(1e5, K=2, eta=3)
rethinking::dens(R[,1,2],col="dark grey", add=T)
R <- rethinking::rlkjcorr(1e5, K=2, eta=1.5)
rethinking::dens(R[,1,2],col="blue", add=T)
R <- rethinking::rlkjcorr(1e5, K=2, eta=1)
rethinking::dens(R[,1,2],col="dark green", add=T)
legend("topleft",legend=c(6,3,1.5,1),lwd=1,col=c("red","dark grey","blue","dark green"),bty="n",inset=c(0.75,0))
```
</div>
## Coding the model in Stan language {.build}
- Stan models are typically specified in separate text files with `.stan` extension
- Model specification consists of several blocks, e.g.
- `data`: declaration of all dependent/independent variables
- `parameters`: declare the free parameters of the model
- `model`: define likelihood function and priors
- Other useful block types:
- `transformed parameters`: apply any transformation to the parameters, before computing the likelihood
- `generated quantities`: derive any quantities of interest based on data or parameters
## `data`
```
data {
int<lower=1> N; // number of observations
real RT[N]; // reaction time (transformed in seconds)
int<lower=0,upper=9> Days[N]; // predictor
int<lower=1> J; // number of subjects
int<lower=1,upper=J> Subject[N]; // subject id
}
```
RStan requires the dataset to be organized as a list object
```{R}
d_stan <- list(Subject=as.numeric(factor(sleepstudy$Subject,
labels=1:length(unique(sleepstudy$Subject)))),
Days=sleepstudy$Days,
RT = sleepstudy$Reaction/1000,
N=nrow(sleepstudy),
J=length(unique(sleepstudy$Subject)) )
```
## `parameters`
```
parameters {
vector[2] beta; // fixed-effects parameters
real<lower=0> sigma_e; // residual std
vector<lower=0>[2] sigma_u; // random effects standard deviations
cholesky_factor_corr[2] L_u; // L_u is the Choleski factor of the correlation matrix
matrix[2,J] z_u; // random effect matrix
}
transformed parameters {
matrix[2,J] u;
u = diag_pre_multiply(sigma_u, L_u) * z_u; // use Cholesky to set correlation
}
```
----
For efficiency and numerical stability (see Stan manual) the correlation matrix $\textbf{R}$ is parametrized by its Cholesky factor $\textbf{L}$, which can be seen as the square root of $\textbf{R}$, i.e.
$$\textbf{L}\textbf{L}^T=\textbf{R} = \left( \begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array} \right) $$
If $\textbf{Z}_{\text{uncorr}}$ is a 2x$n$ matrix where the rows are 2 samples from uncorrelated random variables, then
$$
\textbf{Z}_{\text{corr}} = \left( \begin{array}{cc} \sigma_0 & 0 \\ 0 & \sigma_1 \end{array} \right) \textbf{L} \, \textbf{Z}_{\text{uncorr}}
$$
then $\textbf{Z}_{\text{corr}}$ have the desired variances $\sigma_0^2,\sigma_1^2$ and correlation $\rho$. ($\textbf{L}$ is lower-triangular Cholesky factor.)
----
Why does that work? Say the the two uncorrelated samples have zero mean and unit variance, then the covariance matrix is identity matrix, $\mathbb{E}\left( {Z{Z^T}} \right) - \mathbb{E}\left( Z \right) \mathbb{E}\left(Z \right)^T = \mathbb{E} \left( Z Z^T \right) = I$, and is identical to the correlation matrix.
Say the desired correlation matrix is $\textbf{R}$; since it is symmetric and positive defined it is possible to obtain the Cholesky factorization $L{L^T} = \textbf{R}$.
We transform the samples with $X=LZ$, we have that that the new covariance (correlation) matrix is
$$
\begin{align}
\mathbb{E} \left(XX^T\right) &= \mathbb{E} \left((LZ)(LZ)^T \right) \\
&= \mathbb{E} \left(LZ Z^T L^T\right) \\
&= L \mathbb{E} \left(ZZ^T \right) L^T \\
&= LIL^T = LL^T = \textbf{R} \\
\end{align}
$$
The third step is justified because the expected value is a linear operator, therefore $\mathbb{E}(cX) = c\mathbb{E}(X)$. Also $(AB)^T = B^T A^T$, the order of the factor reverses.
## `model`
```
model {
real mu; // conditional mean of the dependent variable
//priors
L_u ~ lkj_corr_cholesky(2); // LKJ prior for the correlation matrix
to_vector(z_u) ~ normal(0,1); // before Cholesky, random effects are normal variates with SD=1
sigma_u ~ cauchy(0, 0.5); // SD of random effects (vectorized)
sigma_e ~ cauchy(0, 0.5); // prior for residual standard deviation
beta[1] ~ normal(0.3, 1); // prior for fixed-effect intercept
beta[2] ~ normal(0.01, 0.5); // prior for fixed-effect slope
//likelihood
for (i in 1:N){
mu = beta[1] + u[1,Subject[i]] + (beta[2] + u[2,Subject[i]])*Days[i];
RT[i] ~ normal(mu, sigma_e);
}
}
```
## Running the model
> - Run sampling through RStan
```{r, eval=F}
library(rstan)
options(mc.cores = parallel::detectCores()) # indicate stan to use multiple cores if available
sleep_model <- stan(file = "sleep_model_v2.stan", data = d_stan, iter = 2000, chains = 4)
```
- In case of warnings check 'A brief guide to Stan's warnings' for advice (http://mc-stan.org/misc/warnings.html)
```{r, echo=F}
sleep_model <- readRDS("sleep_fit_ppc.rds")
```
## Evaluating convergence
> - Visualize chains to check for convergence
```{r, fig.height=4,fig.width=8}
traceplot(sleep_model, pars = c("beta","sigma_e","sigma_u"), inc_warmup = F)
```
## Evaluating convergence
> - Check that $\hat R \approx 1\pm0.1$ (essentially $\hat R$ is the ratio of between-chain variance to within-chain variance)
```{r}
print(sleep_model, pars = c("beta","sigma_e"), probs = c(0.025, 0.975),digits=3)
```
----
Posterior distribution of fixed-effects parameters.
```{r, eval=F}
pairs(sleep_model, pars = c("beta","sigma_e","sigma_u"))
```
<div class="centered">
```{r, echo=F,fig.height=5,fig.width=5}
pairs(sleep_model, pars = c("beta","sigma_e","sigma_u"))
```
</div>
----
Examine random effects correlation.
```{r}
# Use the L matrices to compute the correlation matrices
L_u <- extract(sleep_model, pars = "L_u")$L_u
cor_u <- apply(L_u, 1, function(x) tcrossprod(x)[1, 2])
print(signif(quantile(cor_u, probs = c(0.025, 0.5, 0.975))))
```
## Posterior predictive check
> - _'simulating replicated data under the fitted model and then comparing these to the observed data'_ (Gelman and Hill, 2007).
```
generated quantities {
real y_rep[N];
matrix[2,J] u_rep; // matrix of simulated random effects
for (j in 1:J) {
u_rep[1:2,j] = multi_normal_cholesky_rng(rep_vector(0, 2),
diag_matrix(sigma_u) * (L_u));
}
for (i in 1:N){ // simulate the model
y_rep[i] = normal_rng(beta[1] + u_rep[1,Subject[i]] +
(beta[2] + u_rep[2,Subject[i]])*Days[i],
sigma_e);
}
}
```
## Posterior predictive check
> - If the model fits, then replicated data generated under the model should look similar to
observed data.
- Generating random observations from new simulated observers allow approximating the _posterior predictive distribution_, $p \left( y^{\text{rep}} \mid y \right) = \int p \left( y^{\text{rep}} \mid \theta \right) p\left(\theta \mid y\right) d \theta$
<div class="centered">
```{r, echo=F,fig.width=4,fig.height=4}
ndag <- readRDS("sleep_ppc_CI.rds")
plot(sleepstudy$Days, sleepstudy$Reaction/1000, xlab="Days",ylab="Reaction",ylim=c(0.17,0.5),type="n")
polygon(c(ndag$Days,rev(ndag$Days)), c(ndag$ci_ub,rev(ndag$ci_lb)), col="skyblue",border=NA)
points(jitter(sleepstudy$Days,factor=0.4), sleepstudy$Reaction/1000, pch=19,col=rgb(0,0,0,0.5),cex=1.2)
```
<\div>
----
Comparison with `lme4` random effects estimates.
```{r, eval=F}
# extract samples
samples_rf0 <- extract(sleep_model, pars="u")$u[,1,]
samples_rf1 <- extract(sleep_model, pars="u")$u[,2,]
# compute posterior mean
rf_int <- apply(samples_rf0, 2, mean)
rf_slp <- apply(samples_rf1, 2, mean)
par(mfrow=c(1,2)) # plot
plot(rf_int, ranef(m.1)$Subject[,1],pch=19,xlab="Stan", ylab="lme4",main="Intercept")
plot(rf_slp, ranef(m.1)$Subject[,2],pch=19,xlab="Stan", ylab="lme4",main="Slope")
```
<div class="centered">
```{r, echo=F, fig.height=2.75,fig.width=5}
rf_int <- apply(extract(sleep_model, pars="u")$u[,1,], 2, mean)
rf_slp <- apply(extract(sleep_model, pars="u")$u[,2,], 2, mean)
par(mfrow=c(1,2),cex=0.7)
plot(rf_int, ranef(m.1)$Subject[,1],pch=19,xlab="Stan", ylab="lme4",main="Intercept")
abline(a=0,b=1000,lty=2)
plot(rf_slp, ranef(m.1)$Subject[,2],pch=19,xlab="Stan", ylab="lme4",main="Slope")
abline(a=0,b=1000,lty=2)
```
</div>
## `rstanarm`
>- It is not strictly necessary to write code from scratch in case of standard models
- The package `rstanarm` automatically prepare Stan code from a model specified in standard R syntax, e.g.
```{r, eval=F}
library(rstanarm)
sleepstudy$Reaction_s <- sleepstudy$Reaction/1000
sleep_model_2 <- stan_lmer(Reaction_s ~ Days + (Days|Subject),
data = sleepstudy,
cores = 2,
iter = 2000,
chain=4)
```
```{r, echo=F}
sleep_model_2 <- readRDS("sleep_fit_rstanarm.rds")
```
## `rstanarm`
- Priors can be assigned (see the function `prior_summary()` to examine the default)
- Another options is the package `bmrs` (Bürkner, 2017)
----
```{r}
summary(sleep_model_2, pars=c("(Intercept)","Days"),digits=3)
```
## Interim summary
> - Multilevel model are characterised by more than one sources of "random" variation
- The distinguishing feature of Bayesian methods is the explicit use of probability distributions to quantify the uncertainty about unknown quantities
- Using samples to approximate posterior distributions makes it much easier to summarize the results
- Bayesian models estimated by MCMC sampling do not have the problem of convergence warnings (as in `lme4`), but it is important to check that sampling was succesfull (e.g. $\hat R$, ...)
## MCMC: taming a wild chain
> - When the posterior density has broad, flat regions, chains can wander erratically
```{r, eval=F}
dat <- list(y=c(-1,1))
stan_code <- "
data{
real y[2];
}
parameters{
real mu;
real<lower=0> sigma;
}
model{
y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code=stan_code, data = dat, iter = 2000, chains = 4)
```
```{r, echo=F}
m_ <- readRDS("wildchain1.rds")
```
## MCMC: taming a wild chain
> - When the posterior density has broad, flat regions, chains can wander erratically
<div class="centered">
```{r, fig.height=3.5,fig.width=6}
rstan::traceplot(m_, pars = c("mu","sigma"), inc_warmup = F)
```
</div>
## MCMC: taming a wild chain
> - Weakly informative prior can gently guide the Markov chain toward reasonable parameter values
```{r, eval=F}
dat <- list(y=c(-1,1))
stan_code <- "
data{
real y[2];
}
parameters{
real mu;
real<lower=0> sigma;
}
model{
mu ~ normal(1, 10);
sigma ~ cauchy(0, 1);
y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code=stan_code, data = dat, iter = 2000, chains = 4)
```
```{r, echo=F}
m_ <- readRDS("wildchain2.rds")
```
## MCMC: taming a wild chain
> - Weakly informative prior can gently guide the Markov chain toward reasonable parameter values
<div class="centered">
```{r, fig.height=3.5,fig.width=6}
rstan::traceplot(m_, pars = c("mu","sigma"), inc_warmup = F)
```
</div>
## MCMC: non-identifiable parameters
> - Markow chains give problems if parameters are non-identifiable (e.g. highly correlated predictors)
```{r, eval=F}
dat <- list(y=rnorm(100, mean=0, sd=1))
stan_code <- "
data{
real y[100];
}
parameters{
vector[2] alpha;
real<lower=0> sigma;
}
model{
real mu;
mu = alpha[1] + alpha[2];
y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code=stan_code, data = dat, iter = 4000, chains = 2)
```
```{r, echo=F}
m_ <- readRDS("unidentified1.rds")
```
## MCMC: non-identifiable parameters
> - Markow chains give problems if parameters are non-identifiable (e.g. highly correlated predictors)
<div class="centered">
```{r, fig.height=3.5,fig.width=7}
rstan::traceplot(m_, pars = c("alpha","sigma"), inc_warmup = F)
```
</div>
## MCMC: non-identifiable parameters
> - Weakly informative priors can help also in this case
```{r, eval=F}
dat <- list(y=rnorm(100, mean=0, sd=1))
stan_code <- "
data{
real y[100];
}
parameters{
vector[2] alpha;
real<lower=0> sigma;
}
model{
real mu;
alpha ~ normal(0, 10);
mu = alpha[1] + alpha[2];
y ~ normal(mu, sigma);
}
"
m_ <- stan(model_code=stan_code, data = dat, iter = 4000, chains = 2)
```
```{r, echo=F}
m_ <- readRDS("unidentified2.rds")
```
## MCMC: non-identifiable parameters
> - Weakly informative priors can help also in this case.
<div class="centered">
```{r, fig.height=3.5,fig.width=7}
rstan::traceplot(m_, pars = c("alpha","sigma"), inc_warmup = F)
```
</div>
## Priors {.build}
- Priors are the initial belief about the plausiblity of possible parameter values.