-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2024-02-22_OutliersStudent-t.qmd
637 lines (494 loc) · 27.2 KB
/
2024-02-22_OutliersStudent-t.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
---
title-block-banner: true
title: "Do not over-think about 'outliers', use a student-t distribution instead"
date: today
date-format: full
author:
- name: "Daniel Manrique-Castano"
orcid: 0000-0002-1912-1764
affiliation: Univerisity Laval (Laboratory of neurovascular interactions)
keywords:
- Student's-t distribution
- Bayesian modeling
- brms
license: "CC BY"
format:
pdf:
toc: true
number-sections: true
colorlinks: true
html:
code-fold: true
embed-resources: true
toc: true
toc-depth: 2
toc-location: left
number-sections: true
theme: spacelab
knitr:
opts_chunk:
warning: false
message: false
csl: science.csl
bibliography: References.bib
---
For many researchers, outliers are rogue waves that can dramatically alter the course of the analysis or "confound" some expected effects. I prefer to use the term "extreme observations" and leave the term outlier for observations that are not truly part of the population being studied. For example, in my field (brain ischemia research), an outlier is an animal that does not have ischemia (when it should have), while extreme observations are animals with small or large ischemias that are very different from the others.
Traditional (frequentist) statistical models are built on the strong foundation of Gaussian distributions. This has a significant limitation: an inherent assumption that all data points will cluster around a central mean in a predictable pattern (based on the central limit theorem). This may be true in Plato's world of ideas, but we, scientists in the biomedical field, are aware it's challenging to rely on this assumption given the limited sampling (number of animals) we have available to make observations.
Gaussian distributions are very sensitive to extreme observations, and their use leads scientists to believe that eliminating extreme observations is the best way to get "clearer" or "cleaner" results (whatever that means). As I once commented in an article as reviewer 2, "The problem is not the extreme observations that may"hide" your effects, but the fact that you are using a statistical model that (I believe) is inappropriate for your purposes". It should be noted that no statistical model is the "right" or "appropriate" one, but we can estimate that, given the data, there are certain statistical models that are more likely to generate the observed data (generative models) than others.
Fortunately, nothing forces us to be bound by the assumptions of the Gaussian models, right? We have other options, such as the **Student's t-distribution** [@ahsanullah2014]. I see it as a more adaptable vessel to navigate the turbulent seas of real-world biomedical data. The Student's t-distribution provides a robust alternative to acknowledge that our data may be populated by extreme observations that are normal biological responses that we can expect in any context. There may be patients or animals that don't respond or overreact to treatment, and it is valuable that our modeling approach recognizes these responses as part of the spectrum. Therefore, this tutorial explores the modeling strategies using Student's t-distributions through the lens of the **`brms`** package for R [@brms]---a powerful ally for Bayesian modeling.
# What's behind a student's t-distribution?
A [Student's t-distribution](https://mathworld.wolfram.com/Studentst-Distribution.html) is nothing more than a Gaussian distribution with heavier tails. In other words, we can say that the Gaussian distribution is a special case of the Student's t-distribution. The Gaussian distribution is defined by the mean (μ) and the standard deviation (σ). The Student t distribution, on the other hand, adds an additional parameter, the degrees of freedom (df), which controls the "thickness" of the distribution. This parameter assigns greater probability to events further from the mean. This feature is particularly useful for small sample sizes, such as in biomedicine, where the assumption of normality is questionable. Note that as the degrees of freedom increase, the Student t-distribution approaches the Gaussian distribution. We can visualize this using density plots:
```{r}
#| label: fig-Fig1
#| include: true
#| warning: false
#| message: false
#| fig-cap: Comparison of Gaussian and Student t-Distributions with different degrees of freedom.
#| fig-height: 5
#| fig-width: 6
# Load necessary libraries
library(ggplot2)
# Set seed for reproducibility
set.seed(123)
# Define the distributions
x <- seq(-4, 4, length.out = 200)
y_gaussian <- dnorm(x)
y_t3 <- dt(x, df = 3)
y_t10 <- dt(x, df = 10)
y_t30 <- dt(x, df = 30)
# Create a data frame for plotting
df <- data.frame(x, y_gaussian, y_t3, y_t10, y_t30)
# Plot the distributions
ggplot(df, aes(x)) +
geom_line(aes(y = y_gaussian, color = "Gaussian")) +
geom_line(aes(y = y_t3, color = "t, df=3")) +
geom_line(aes(y = y_t10, color = "t, df=10")) +
geom_line(aes(y = y_t30, color = "t, df=30")) +
labs(title = "Comparison of Gaussian and Student t-Distributions",
x = "Value",
y = "Density") +
scale_color_manual(values = c("Gaussian" = "blue", "t, df=3" = "red", "t, df=10" = "green", "t, df=30" = "purple")) +
theme_classic()
```
Note in @fig-Fig1 that the hill around the mean gets smaller as the degrees of freedom decrease as a result of the probability mass going to the tails, which are thicker. This property is what gives the Student's t-distribution a reduced sensitivity to outliers. For more details in this matter you can check [this](https://online.stat.psu.edu/stat414/lesson/26/26.4) blog.
## Load packages and themes
We load the required libraries and create a visual theme for our plots.
```{r}
#| label: LoadPack
#| include: true
#| warning: false
#| message: false
library(ggplot2)
library(brms)
library(ggdist)
library(easystats)
library(dplyr)
library(tibble)
library(ghibli)
Plot_theme <- theme_classic() +
theme(
plot.title = element_text(size=18, hjust = 0.5, face="bold"),
plot.subtitle = element_text(size = 10, color = "black"),
plot.caption = element_text(size = 12, color = "black"),
axis.line = element_line(colour = "black", size = 1.5, linetype = "solid"),
axis.ticks.length=unit(7,"pt"),
axis.title.x = element_text(colour = "black", size = 16),
axis.text.x = element_text(colour = "black", size = 16, angle = 0, hjust = 0.5),
axis.ticks.x = element_line(colour = "black", size = 1),
axis.title.y = element_text(colour = "black", size = 16),
axis.text.y = element_text(colour = "black", size = 16),
axis.ticks.y = element_line(colour = "black", size = 1),
legend.position="right",
legend.direction="vertical",
legend.title = element_text(colour="black", face="bold", size=12),
legend.text = element_text(colour="black", size=10),
plot.margin = margin(t = 10, # Top margin
r = 2, # Right margin
b = 10, # Bottom margin
l = 10) # Left margin
)
```
# Exploratory data visualization
So, let's skip data simulations and get serious. We'll work with real data I have acquired from mice performing in the rotarod test.
## Load the dataset
First, we load the dataset into our environment and set the corresponding factor levels.
```{r}
#| label: LoadData
#| include: true
#| warning: false
#| message: false
#| column: margin
data <- read.csv("Data/Rotarod.csv")
data$Day <- factor(data$Day, levels = c("1", "2"))
data$Genotype <- factor(data$Genotype, levels = c("WT", "KO"))
head(data)
```
The dataset contains IDs for the animals, a groping variable (Genotype), an indicator for two different days on which the test was performed (day), and different trials for the same day. For this article, we model only one of the trials (Trial3). We will save the other trials for a future article on modeling variation.
As the data handling implies, our modeling strategy will be based on Genotype and Day as categorical predictors of the distribution of `Trial3`. In biomedical science, categorical predictors, or grouping factors, are more common than continuous predictors. Scientists in this field like to divide their samples into groups or conditions and apply different treatments.
## Plot the data
Let's have an initial view of the data using **Raincloud plots** as show by [\@amorimfranchi](https://medium.com/@amorimfranchi) in [this](https://medium.com/@amorimfranchi/raincloud-plots-for-clear-precise-and-efficient-data-communication-4c71d0a37c23#:~:text=Raincloud%20plots%20for%20clear%2C%20precise%20and%20efficient%20data%20communication,-Guilherme%20A.&text=Raw%20data%20visualization%20involves%20presenting,quality%20assessment%20of%20your%20data.) great blog post.
```{r}
#| label: fig-Fig2
#| include: true
#| warning: false
#| message: false
#| fig-cap: Exploratory data visualization.
#| fig-height: 5
#| fig-width: 6
edv <- ggplot(data, aes(x = Day, y = Trial3, fill=Genotype)) +
scale_fill_ghibli_d("SpiritedMedium", direction = -1) +
geom_boxplot(width = 0.1,
outlier.color = "red") +
xlab('Day') +
ylab('Time (s)') +
ggtitle("Rorarod performance") +
theme_classic(base_size=18, base_family="serif")+
theme(text = element_text(size=18),
axis.text.x = element_text(angle=0, hjust=.1, vjust = 0.5, color = "black"),
axis.text.y = element_text(color = "black"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5),
legend.position="bottom")+
scale_y_continuous(breaks = seq(0, 100, by=20),
limits=c(0,100)) +
# Line below adds dot plots from {ggdist} package
stat_dots(side = "left",
justification = 1.12,
binwidth = 1.9) +
# Line below adds half-violin from {ggdist} package
stat_halfeye(adjust = .5,
width = .6,
justification = -.2,
.width = 0,
point_colour = NA)
edv
```
@fig-Fig2 looks different from the original by [\@amorimfranchi](https://medium.com/@amorimfranchi) because we are plotting two factors instead of one. However, the nature of the plot is the same. Pay attention to the red dots, these are the ones that can be considered as extreme observations that tilt the measures of central tendency (especially the mean) towards one direction. We also observe that the variances are different, so modeling also the sigma can give better estimates. Our task now is to model the output using the `brms` package.
# Fitting statistical models with brms
Here we fit our model with `Day` and `Genotype` as interacting categorical predictors for the distribution of `Trial 3`. Let's first fit a typical Gaussian model, which is analogous to an ordinary least squares (OLS) model from the frequentist framework, since we are using the default flat `brms` [priors](https://paul-buerkner.github.io/brms/reference/set_prior.html). Priors are beyond the scope of this article, but I promise we'll cover them in a future blog.
Once we have results from the Gaussian model, we can compare them to the large results from the Student's t model. We then add`sigma` to the equation to account for the difference in the variance of the data.
## Fitting a "typical" (frequentists) model in Gaussian land
Our Gaussian model is built under the typical (and often incorrect) assumption of homoscedasticity [@yang2019]. In other words, we assume that all groups have the same (or very similar) variance. I do not recall seeing this as a researcher.
```{r}
#| label: GaussianFit
#| include: true
#| warning: false
#| message: false
#| results: false
Gaussian_Fit1 <- brm(Trial3 ~ Day * Genotype,
data = data,
family = gaussian(),
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/20240222_OutliersStudent-t/Gaussian_Fit1.rds",
file_refit = "never")
# Add loo for model comparison
Gaussian_Fit1 <-
add_criterion(Gaussian_Fit1, c("loo", "waic", "bayes_R2"))
```
### Model diagnostics
Before proceeding, it's a good idea to do some simple model diagnostics to compare the actual observations with the predictions made by our model. We can do this in several ways, but the most common is to plot full densities. We can achieve this using the `pp_check` function from `brms`.
```{r}
#| label: fig-GaussianDiag1
#| include: true
#| warning: false
#| message: false
#| fig-cap: Diagnostics for the Gaussian model
#| fig-height: 4
#| fig-width: 5
set.seed(8807)
pp_check(Gaussian_Fit1, ndraws = 100) +
labs(title = "Gaussian model") +
theme_classic()
```
@fig-GaussianDiag1 suggests that our observations (dark blue) are not meaningfully different from the model predictions. Below, I leave you with additional code to check other `pp_check` alternatives with their respective graphs.
```{r}
#| label: fig-GaussianDiag2
#| include: true
#| warning: false
#| message: false
#| fig-height: 5
#| fig-width: 5
#| column: screen-inset-shaded
#| layout-nrow: 1
set.seed(88071)
pp_check(Gaussian_Fit1, group = "Genotype", type = "dens_overlay_grouped", ndraws = 100) +
labs(title = "Density by Genotype") +
theme_classic()
pp_check(Gaussian_Fit1, type = "stat_grouped", group = "Genotype", stat = "var", binwidth = 3) +
coord_cartesian(xlim = c(0, 300)) +
ggtitle("Grouped variance") +
theme_classic()
pp_check(Gaussian_Fit1, type = "stat", stat = "var", binwidth = 3) +
coord_cartesian(xlim = c(0, 600)) +
ggtitle("How well we captured the variace") +
theme_classic()
pp_check(Gaussian_Fit1, type = "stat", stat = "mean", binwidth = 2) +
coord_cartesian(xlim = c(0, 50)) +
ggtitle("How well we captured the mean") +
theme_classic()
```
### Checking the results for the Gaussian distribution
Now, we use the `describe_posterior` function from the `bayestestR` package [@bayestestR] to see the results:
```{r}
#| label: Gaussian_Posterior
#| include: true
#| warning: false
#| message: false
describe_posterior(Gaussian_Fit1,
centrality = "mean",
dispersion = TRUE,
ci_method = "HDI",
test = "rope",
)
```
Let's focus here on the 'intercept', which is the value for WT at 1 DPI, and 'GenotypeKO', the estimated difference for KO animals at the same time point. We see that WT animals spend about 37 seconds in the rotarod, while their KO counterparts spend less than a second (0.54) more. As a researcher in this field, I can say that this difference is meaningless and that genotype has no effect on rotarod performance. Even the effect of day, which is 2.9, seems meaningless to me under this model. We can easily visualize these estimates using the wonderful `conditional_effects` function from brms.
```{r}
#| label: fig-GaussianEff
#| include: true
#| warning: false
#| message: false
#| fig-cap: Conditional effects for the Gaussian model
#| fig-height: 5
#| fig-width: 7
# We create the graph for convex hull
Gaussian_CondEffects <-
conditional_effects(Gaussian_Fit1)
Gaussian_CondEffects <- plot(Gaussian_CondEffects,
plot = FALSE)[[3]]
Gaussian_CondEffects +
geom_point(data=data, aes(x = Day, y = Trial3, color = Genotype), inherit.aes=FALSE) +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
```
In @fig-GaussianEff we can see the estimates and uncertainty for the interaction terms. I have customized the plot with a number of ggplot elements, which you can check in the original [Quarto Notebook](https://github.com/daniel-manrique/MediumBlog/blob/main/20240222_OutliersStudent-t.qmd). Note the similar uncertainty for both time points, even though the dispersion is larger on day 1 than on day 2. We will address this point in a small snippet at the end of this article.
Now let's see how much our understanding changes when we model the same data using a student-t distribution.
## Fitting our guest: a model with a student-t distribution
It's time to use the student-t distribution in our `brms` model.
```{r}
#| label: StudentFit
#| include: true
#| warning: false
#| message: false
#| results: false
Student_Fit <- brm(Trial3 ~ Day * Genotype,
data = data,
family = student,
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/20240222_OutliersStudent-t/Student_Fit.rds",
file_refit = "never")
# Add loo for model comparison
Student_Fit <-
add_criterion(Student_Fit, c("loo", "waic", "bayes_R2"))
```
### Model diagnostics
We plot the model diagnostics as done before:
```{r}
#| label: fig-StudentDiag1
#| include: true
#| warning: false
#| message: false
#| fig-cap: Model diagnostics for student-t distribution
#| fig-height: 4
#| fig-width: 5
set.seed(8807)
pp_check(Student_Fit, ndraws = 100) +
labs(title = "Student-t model") +
theme_classic()
```
@fig-StudentDiag1 shows that the mean shape and the peak of the observations and the predictions match. It's important to note that our model seems to predict values below 0. This is an important research issue that we will skip for now. However, it does imply the use of informative priors or distribution families that set a lower bound at 0, such as the `log_normal',`hurdle_lognormal', or \`zero_inflated_poisson', depending on the case. Andrew Heiss [@heiss2021] offers a [great example](https://www.andrewheiss.com/blog/2021/11/08/beta-regression-guide/) in this regard.
### Checking the results for the student-t distribution
Let's take a look at the posterior distribution:
```{r}
#| label: Student_Posterior
#| include: true
#| warning: false
#| message: false
describe_posterior(Student_Fit,
centrality = "mean",
dispersion = TRUE,
ci_method = "HDI",
test = "rope",
)
```
Under this model, we can see that our estimates have changed moderately, I would say. Our estimate for the intercept (WT at 1 day) is reduced by 7 seconds. And why is that? Because the extreme values we discovered at the beginning have less influence on the measures of central tendency of the data. Thus, this is a more accurate measure of the "typical" WT animal on day 1. We also observe a substantial increase in the effect of day, with almost 10 seconds more than our initial Gaussian estimates. Importantly, the effect of our KO genotype appears to be more notorious, increasing about 10 times from 0.52 in our Gaussian model to 5.5 in our student-t model. From my perspective, given the context of these data, the contrasts between the two models are notorious.
Let's see it in graphical terms using `conditional_effects`:
```{r}
#| label: fig-StudentEff
#| include: true
#| warning: false
#| message: false
#| fig-cap: Conditional effects for the Student-t model
#| fig-height: 5
#| fig-width: 7
Student_CondEffects <-
conditional_effects(Student_Fit)
Student_CondEffects <- plot(Student_CondEffects,
plot = FALSE)[[3]]
Student_CondEffects +
geom_point(data=data, aes(x = Day, y = Trial3, color = Genotype), inherit.aes=FALSE) +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
```
Can we get better estimates? For this particular example, I think we can. From the start, it was easy to notice the difference in the variance of the data, especially when we compare the first and second-day visuals. We improved our estimates using the student-t distribution, and we can improve them further by developing a model for heteroscedasticity that predicts sigma (the residual variance). In this way, the model does not assume that your residual variance is equal across your grouping variables, but it becomes a response that can be modeled by predictors. This is the little point we left for the end.
## Predicting sigma using a student-t distribution
We include sigma as a response variable using the`bf` function from `brms`. In this case, we are going to model this parameter using the same predictors `Day` and `Genotype`.
```{r}
#| label: StudentFit?Sigma
#| include: true
#| warning: false
#| message: false
#| results: false
Student_Mdl2 <- bf (Trial3 ~ Day * Genotype,
sigma ~ Day * Genotype)
Student_Fit2 <- brm(
formula = Student_Mdl2,
data = data,
family = student,
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/20240222_OutliersStudent-t/Student_Fit2.rds",
file_refit = "never")
# Add loo for model comparison
Student_Fit2 <-
add_criterion(Student_Fit2, c("loo", "waic", "bayes_R2"))
```
### Model diagnostics
```{r}
#| label: fig-StudentDiag2
#| include: true
#| warning: false
#| message: false
#| fig-cap: Model diagnostics for student-t distribution with sigma
#| fig-height: 4
#| fig-width: 5
set.seed(8807)
pp_check(Student_Fit2, ndraws = 100) +
labs(title = "Student-t") +
theme_classic()
```
@fig-StudentDiag2 looks good, except for the uncomfortable predictions below 0. For this case, I judge that this does not strongly bias the estimates and their uncertainty. However, this is an aspect I will take into account when doing actual research.
### Checking the results for the student-t distribution with predicted sigma
Now, let's take a look at the posterior distribution.
```{r}
#| label: Student_Posterior2
#| include: true
#| warning: false
#| message: false
describe_posterior(Student_Fit2,
centrality = "mean",
dispersion = TRUE,
ci_method = "HDI",
test = "rope",
)
```
We see more parameters compared to the other two fitted models because the response for sigma is now included as a main effect in the model. Under this scheme, we see that the intercepts are closer to those of the Gaussian model and the effect of genotype (`GenotypeKO`) is reduced by half.
There is one aspect to note, however. In our first Student-t model, the uncertainty for the intercept was 24.1–37.4. On the other hand, in the last model, the uncertainty increases to 24.3–46.1. This means that when we consider the different variances, we are less certain of this (and other) parameters. The same is true for day, for example, which changes from 1.2–18.9 to -5.6–18.1. In this case, we are now less certain that the second day is associated with an increase in time spent on the rotarod. Don't worry, the purpose of statistical modeling is to provide the best possible quantification of the uncertainty in a measurement, and that's what we're doing right now. Of course, our uncertainty increases when we have extreme values that are part of our sample and therefore part of our population. In this example, we see that accounting for the different variances in our data gives us a very different idea of our results.
Finally, we can see that sigma, plotted on the log scale, varies meaningfully with day and genotype:
```{r}
#| label: fig-StudentCondEffects2
#| include: true
#| warning: false
#| message: false
#| fig-cap: Conditional effects for the Student-t model with sigma
#| fig-height: 5
#| fig-width: 5
#| layout-nrow: 1
Student_CondEffects2 <-
conditional_effects(Student_Fit2)
Student_CondEffects2 <- plot(Student_CondEffects2,
plot = FALSE)[[3]]
Student_CondEffects2 +
geom_point(data=data, aes(x = Day, y = Trial3, color = Genotype), inherit.aes=FALSE) +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
Student_CondEffects3 <-
conditional_effects(Student_Fit2, dpar = "sigma")
Student_CondEffects3 <- plot(Student_CondEffects3,
plot = FALSE)[[3]]
Student_CondEffects3 +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")
```
What we see in the second graph is sigma, which effectively accounts for the variance in this parameter between days and genotypes. We see a much higher uncertainty at day 1, especially for WT mice, while the parameter is analogous at day 2.
We can conclude this article by comparing the three models for out-of-sample predictions.
## Model comparison
We perform model comparison using the WAIC criteria [@gelman2013]for estimating the out-of-sample prediction error. By considering both the log-likelihood of the observed data and the effective number of parameters, it provides a balance between model fit and complexity. Unlike some other criteria, WAIC inherently accounts for the posterior distribution of the parameters rather than relying on point estimates, making it particularly suited to Bayesian analyses.
Given a data set and a Bayesian model, the WAIC is calculated as:
$$
\text{WAIC} = -2 \times \left( \text{LLPD} - p_{\text{WAIC}} \right)
$$
Where: $\text{LLPD}$ is the log pointwise predictive density, calculated as the average log likelihood for each observed data point across the posterior samples. $\text{WAIC}$ is the effective number of parameters, computed as the difference between the average of the log likelihoods and the log likelihood of the averages across posterior samples.
We use the `compare_performance` function from the `performance` package, part of the `easystats` environment [@performance; @makowski2019; @bayestestR].
```{r}
#| label: Models_Compare
#| include: true
#| warning: false
#| message: false
#| results: false
Fit_Comp <-
compare_performance(
Gaussian_Fit1,
Student_Fit,
Student_Fit2,
metrics = "all")
Fit_Comp
```
The output shows that our Student-t model predicting sigma is the least penalized (WAIC = 497) for out-of-sample prediction. Note that there is no estimate for sigma in this model because it was included as a response variable. This table also shows that the student-t model has less residual variance (sigma) than the Gaussian model, which means that the variance is better explained by the predictors. We can visualize the same results as a graph:
```{r}
#| label: fig-Models_Graph
#| include: true
#| warning: false
#| message: false
#| results: false
#| fig-cap: Model comparison by WAIC
#| fig-height: 5
#| fig-width: 8
Fit_Comp_W <-
loo_compare(
Gaussian_Fit1,
Student_Fit,
Student_Fit2,
criterion = "waic")
# Generate WAIC graph
Fit_Comp_WAIC <-
Fit_Comp_W[, 7:8] %>%
data.frame() %>%
rownames_to_column(var = "model_name") %>%
ggplot(
aes(x = model_name,
y = waic,
ymin = waic - se_waic,
ymax = waic + se_waic)
) +
geom_pointrange(shape = 21) +
scale_x_discrete(
breaks=c("Gaussian_Fit1",
"Student_Fit",
"Student_Fit2"),
labels=c("Gaussian_Fit1",
"Student_Fit",
"Student_Fit2")
) +
coord_flip() +
labs(x = "",
y = "WAIC (score)",
title = "") +
Plot_theme
Fit_Comp_WAIC
```
@fig-Models_Graph shows that our last model is less penalized for out-of-sample prediction.
You can find an updated version of this post on my [GitHub site](https://github.com/daniel-manrique/MediumBlog/blob/main/20240222_OutliersStudent-t.qmd). Let me know if this journey was useful to you, and if you have any constructive comments to add to this exercise.
# References
::: {#refs}
:::
```{r}
sessionInfo()
```