08-linear-mixed-models.Rmd

# Linear Mixed Models

## Dependent Data

Forms of dependent data:

-   Multivariate measurements on different individuals: (e.g., a person's blood pressure, fat, etc are correlated)
-   Clustered measurements: (e.g., blood pressure measurements of people in the same family can be correlated).
-   Repeated measurements: (e.g., measurement of cholesterol over time can be correlated) "If data are collected repeatedly on experimental material to which treatments were applied initially, the data is a repeated measure." [@Schabenberger_2001]
-   Longitudinal data: (e.g., individual's cholesterol tracked over time are correlated): "data collected repeatedly over time in an observational study are termed longitudinal." [@Schabenberger_2001]
-   Spatial data: (e.g., measurement of individuals living in the same neighborhood are correlated)

Hence, we like to account for these correlations.

**Linear Mixed Model** (LMM), also known as **Mixed Linear Model** has 2 components:

-   **Fixed effect** (e.g, gender, age, diet, time)

-   **Random effects** representing individual variation or auto correlation/spatial effects that imply **dependent (correlated) errors**

Review [Two-Way Mixed Effects ANOVA]

We choose to model the random subject-specific effect instead of including dummy subject covariates in our model because:

-   reduction in the number of parameters to estimate
-   when you do inference, it would make more sense that you can infer from a population (i.e., random effect).

**LLM Motivation**

In a repeated measurements analysis where $Y_{ij}$ is the response for the $i$-th individual measured at the $j$-th time,

$i =1,…,N$ ; $j = 1,…,n_i$

$$
\mathbf{Y}_i = 
\left(
\begin{array}
{c}
Y_{i1} \\
. \\
.\\
.\\
Y_{in_i}
\end{array}
\right)
$$

is all measurements for subject $i$.

[*Stage 1: (Regression Model)*]{.underline} how the response changes over time for the $i$-th subject

$$
\mathbf{Y_i = Z_i \beta_i + \epsilon_i}
$$

where

-   $Z_i$ is an $n_i \times q$ matrix of known covariates
-   $\beta_i$ is an unknown $q \times 1$ vector of subjective -specific coefficients (regression coefficients different for each subject)
-   $\epsilon_i$ are the random errors (typically $\sim N(0, \sigma^2 I)$)

We notice that there are two many $\beta$ to estimate here. Hence, this is the motivation for the second stage

[*Stage 2: (Parameter Model)*]{.underline}

$$
\mathbf{\beta_i = K_i \beta + b_i}
$$

where

-   $K_i$ is a $q \times p$ matrix of known covariates
-   $\beta$ is a $p \times 1$ vector of unknown parameter
-   $\mathbf{b}_i$ are independent $N(0,D)$ random variables

This model explain the observed variability between subjects with respect to the subject-specific regression coefficients, $\beta_i$. We model our different coefficient ($\beta_i$) with respect to $\beta$.

Example:

Stage 1:

$$
Y_{ij} = \beta_{1i} + \beta_{2i}t_{ij} + \epsilon_{ij}
$$

where

-   $j = 1,..,n_i$

In the matrix notation,

$$
\mathbf{Y_i} = 
\left(
\begin{array}
{c}
Y_{i1} \\
.\\
Y_{in_i}
\end{array}
\right); \mathbf{Z}_i = 
\left(
\begin{array}
{cc}
1 & t_{i1} \\
. & . \\
1 & t_{in_i} 
\end{array}
\right)
$$

$$
\beta_i =
\left(
\begin{array}
{c}
\beta_{1i} \\
\beta_{2i}
\end{array}
\right); \epsilon_i = 
\left(
\begin{array}
{c}
\epsilon_{i1} \\
. \\
\epsilon_{in_i}
\end{array}
\right)
$$

Thus,

$$
\mathbf{Y_i = Z_i \beta_i + \epsilon_i}
$$

Stage 2:

$$
\begin{aligned}
\beta_{1i} &= \beta_0 + b_{1i} \\
\beta_{2i} &= \beta_1 L_i + \beta_2 H_i + \beta_3 C_i + b_{2i}
\end{aligned}
$$

where $L_i, H_i, C_i$ are indicator variables defined to 1 as the subject falls into different categories.

Subject specific intercepts do not depend upon treatment, with $\beta_0$ (the average response at the start of treatment), and $\beta_1 , \beta_2, \beta_3$ (the average time effects for each of three treatment groups).

$$
\begin{aligned}
\mathbf{K}_i &= \left(
\begin{array}
{cccc}
1 & 0 & 0 & 0 \\
0 & L_i & H_i & C_i 
\end{array}
\right) \\ 
\beta &= (\beta_0 , \beta_1, \beta_2, \beta_3)' \\ 
\mathbf{b}_i &= 
\left(
\begin{array}
{c}
b_{1i} \\
b_{2i} \\
\end{array}
\right) \\ 
\beta_i &= \mathbf{K_i \beta + b_i}
\end{aligned}
$$

To get $\hat{\beta}$, we can fit the model sequentially:

1.  Estimate $\hat{\beta_i}$ in the first stage
2.  Estimate $\hat{\beta}$ in the second stage by replacing $\beta_i$ with $\hat{\beta}_i$

However, problems arise from this method:

-   information is lost by summarizing the vector $\mathbf{Y}_i$ solely by $\hat{\beta}_i$
-   we need to account for variability when replacing $\beta_i$ with its estimate
-   different subjects might have different number of observations.

To address these problems, we can use **Linear Mixed Model** [@laird1982random]

Substituting stage 2 into stage 1:

$$
\mathbf{Y}_i = \mathbf{Z}_i \mathbf{K}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \mathbf{\epsilon}_i
$$

Let $\mathbf{X}_i = \mathbf{Z}_i \mathbf{K}_i$ be an $n_i \times p$ matrix . Then, the LMM is

$$
\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \mathbf{\epsilon}_i
$$

where

-   $i = 1,..,N$
-   $\beta$ are the fixed effects, which are common to all subjects
-   $\mathbf{b}_i$ are the subject specific random effects. $\mathbf{b}_i \sim N_q (\mathbf{0,D})$
-   $\mathbf{\epsilon}_i \sim N_{n_i}(\mathbf{0,\Sigma_i})$
-   $\mathbf{b}_i$ and $\epsilon_i$ are independent
-   $\mathbf{Z}_{i(n_i \times q})$ and $\mathbf{X}_{i(n_i \times p})$ are matrices of known covariates.

Equivalently, in the hierarchical form, we call **conditional** or **hierarchical** formulation of the linear mixed model

$$
\begin{aligned}
\mathbf{Y}_i | \mathbf{b}_i &\sim N(\mathbf{X}_i \beta+ \mathbf{Z}_i \mathbf{b}_i, \mathbf{\Sigma}_i) \\
\mathbf{b}_i &\sim N(\mathbf{0,D})
\end{aligned}
$$

for $i = 1,..,N$. denote the respective functions by $f(\mathbf{Y}_i |\mathbf{b}_i)$ and $f(\mathbf{b}_i)$

In general,

$$
\begin{aligned}
f(A,B) &= f(A|B)f(B) \\
f(A) &= \int f(A,B)dB = \int f(A|B) f(B) dB
\end{aligned}
$$

In the LMM, the marginal density of $\mathbf{Y}_i$ is

$$
f(\mathbf{Y}_i) = \int f(\mathbf{Y}_i | \mathbf{b}_i) f(\mathbf{b}_i) d\mathbf{b}_i
$$

which can be shown

$$
\mathbf{Y}_i \sim N(\mathbf{X_i \beta, Z_i DZ'_i + \Sigma_i})
$$

This is the **marginal** formulation of the linear mixed model

Notes:

We no longer have $Z_i b_i$ in the mean, but add error in the variance (marginal dependence in Y). kinda of averaging out the common effect. Technically, we shouldn't call it averaging the error b (adding it to the variance covariance matrix), it should be called adding random effect

Continue with our example

$$
Y_{ij} = (\beta_0 + b_{1i}) + (\beta_1L_i + \beta_2 H_i + \beta_3 C_i + b_{2i})t_{ij} + \epsilon_{ij}
$$

for each treatment group

$$
Y_{ik}= 
\begin{cases}
\beta_0 + b_{1i} + (\beta_1 + \ b_{2i})t_{ij} + \epsilon_{ij} & L \\
\beta_0 + b_{1i} + (\beta_2 + \ b_{2i})t_{ij} + \epsilon_{ij} & H\\
\beta_0 + b_{1i} + (\beta_3 + \ b_{2i})t_{ij} + \epsilon_{ij} & C
\end{cases}
$$

-   Intercepts and slopes are all subject specific
-   Different treatment groups have different slops, but the same intercept.

**In the hierarchical model form**

$$
\begin{aligned}
\mathbf{Y}_i | \mathbf{b}_i &\sim N(\mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i, \mathbf{\Sigma}_i)\\
\mathbf{b}_i &\sim N(\mathbf{0,D})
\end{aligned}
$$

X will be in the form of

$$
\beta = (\beta_0, \beta_1, \beta_2, \beta_3)'
$$

$$
\begin{aligned}
\mathbf{X}_i &= \mathbf{Z}_i \mathbf{K}_i \\
&= 
\left[
\begin{array}
{cc}
1 & t_{i1} \\
1 & t_{i2} \\
. & . \\
1 & t_{in_i}
\end{array}
\right]
\times
\left[
\begin{array}
{cccc}
1 & 0 & 0 & 0 \\
0 & L_i & H_i & C_i \\
\end{array}
\right] \\
&=
\left[ 
\begin{array}
{cccc}
1 & t_{i1}L_i & t_{i1}H_i & T_{i1}C_i \\
1 & t_{i2}L_i & t_{i2}H_i & T_{i2}C_i \\
. &. &. &. \\
1 & t_{in_i}L_i & t_{in_i}H_i & T_{in_i}C_i \\
\end{array}
\right]\end{aligned}
$$

$$
\mathbf{b}_i = 
\left(
\begin{array}
{c}
b_{1i} \\
b_{2i}
\end{array}
\right)
$$

$$
D = 
\left(
\begin{array}
{cc}
d_{11} & d_{12}\\
d_{12} & d_{22}
\end{array}
\right)
$$

Assuming $\mathbf{\Sigma}_i = \sigma^2 \mathbf{I}_{n_i}$, which is called **conditional independence**, meaning the response on subject i are independent conditional on $\mathbf{b}_i$ and $\beta$

**In the marginal model form**

$$
Y_{ij} = \beta_0 + \beta_1 L_i t_{ij} + \beta_2 H_i t_{ij} + \beta_3 C_i t_{ij} + \eta_{ij}
$$

where $\eta_i \sim N(\mathbf{0},\mathbf{Z}_i\mathbf{DZ}_i'+ \mathbf{\Sigma}_i)$

Equivalently,

$$
\mathbf{Y_i \sim N(X_i \beta, Z_i DZ_i' + \Sigma_i})
$$

In this case that $n_i = 2$

$$
\begin{aligned}
\mathbf{Z_iDZ_i'} &= 
\left(
\begin{array}
{cc}
1 & t_{i1} \\
1 & t_{i2} 
\end{array}
\right)
\left(
\begin{array}
{cc}
d_{11} & d_{12} \\
d_{12} & d_{22} 
\end{array}
\right)
\left(
\begin{array}
{cc}
1 & 1 \\
t_{i1} & t_{i2} 
\end{array}
\right) \\
&=
\left(
\begin{array}
{cc}
d_{11} + 2d_{12}t_{i1} + d_{22}t_{i1}^2 & d_{11} + d_{12}(t_{i1} + t_{i2}) + d_{22}t_{i1}t_{i2} \\
d_{11} + d_{12}(t_{i1} + t_{i2}) + d_{22} t_{i1} t_{i2} & d_{11} + 2d_{12}t_{i2} + d_{22}t_{i2}^2  
\end{array}
\right)
\end{aligned}
$$

$$
var(Y_{i1}) = d_{11} + 2d_{12}t_{i1} + d_{22} t_{i1}^2 + \sigma^2
$$

On top of correlation in the errors, the marginal implies that the variance function of the response is quadratic over time, with positive curvature $d_{22}$

### Random-Intercepts Model

If we remove the random slopes,

-   the assumption is that all variability in subject-specific slopes can be attributed to treatment differences
-   the model is random-intercepts model. This has subject specific intercepts, but the same slopes within each treatment group.

$$
\begin{aligned}
\mathbf{Y}_i | b_i &\sim N(\mathbf{X}_i \beta + 1 b_i , \Sigma_i) \\
b_i &\sim N(0,d_{11})
\end{aligned}
$$

The marginal model is then ($\mathbf{\Sigma}_i = \sigma^2 \mathbf{I}$)

$$
\mathbf{Y}_i \sim N(\mathbf{X}_i \beta, 11'd_{11} + \sigma^2 \mathbf{I})
$$

The marginal covariance matrix is

$$
\begin{aligned}
cov(\mathbf{Y}_i)  &= 11'd_{11} + \sigma^2I \\
&=
\left(
\begin{array}
{cccc}
d_{11}+ \sigma^2 & d_{11} & ... & d_{11} \\
d_{11} & d_{11} + \sigma^2 & d_{11} & ... \\
. & . & . & . \\
d_{11} & ... & ... & d_{11} + \sigma^2
\end{array}
\right)
\end{aligned}
$$

the associated correlation matrix is

$$
corr(\mathbf{Y}_i) = 
\left(
\begin{array}
{cccc}
1 & \rho & ... & \rho \\
\rho & 1 & \rho & ... \\
. & . & . & . \\
\rho & ... & ... & 1 \\
\end{array}
\right)
$$

where $\rho \equiv \frac{d_{11}}{d_{11} + \sigma^2}$

Thu, we have

-   constant variance over time
-   equal, positive correlation between any two measurements from the same subject
-   a covariance structure that is called **compound symmetry**, and $\rho$ is called the **intra-class correlation**
-   that when $\rho$ is large, the **inter-subject variability** ($d_{11}$) is large relative to the intra-subject variability ($\sigma^2$)

### Covariance Models

If the conditional independence assumption, ($\mathbf{\Sigma_i= \sigma^2 I_{n_i}}$). Consider, $\epsilon_i = \epsilon_{(1)i} + \epsilon_{(2)i}$, where

-   $\epsilon_{(1)i}$ is a "serial correlation" component. That is, part of the individual's profile is a response to time-varying stochastic processes.
-   $\epsilon_{(2)i}$ is the measurement error component, and is independent of $\epsilon_{(1)i}$

Then

$$
\mathbf{Y_i = X_i \beta + Z_i b_i + \epsilon_{(1)i} + \epsilon_{(2)i}}
$$

where

-   $\mathbf{b_i} \sim N(\mathbf{0,D})$

-   $\epsilon_{(2)i} \sim N(\mathbf{0,\sigma^2 I_{n_i}})$

-   $\epsilon_{(1)i} \sim N(\mathbf{0,\tau^2H_i})$

-   $\mathbf{b}_i$ and $\epsilon_i$ are mutually independent

To model the structure of the $n_i \times n_i$ correlation (or covariance ) matrix $\mathbf{H}_i$. Let the (j,k)th element of $\mathbf{H}_i$ be $h_{ijk}= g(t_{ij}t_{ik})$. that is a function of the times $t_{ij}$ and $t_{ik}$ , which is assumed to be some function of the "distance' between the times.

$$
h_{ijk} = g(|t_{ij}-t_{ik}|)
$$

for some decreasing function $g(.)$ with $g(0)=1$ (for correlation matrices).

Examples of this type of function:

-   Exponential function: $g(|t_{ij}-t_{ik}|) = \exp(-\phi|t_{ij} - t_{ik}|)$
-   Gaussian function: $g(|t_{ij} - t_{ik}|) = \exp(-\phi(t_{ij} - t_{ik})^2)$

Similar structures could also be used for $\mathbf{D}$ matrix (of $\mathbf{b}$)

Example: Autoregressive Covariance Structure

A first order Autoregressive Model (AR(1)) has the form

$$
\alpha_t = \phi \alpha_{t-1} + \eta_t
$$

where $\eta_t \sim iid N (0,\sigma^2_\eta)$

Then, the covariance between two observations is

$$
cov(\alpha_t, \alpha_{t+h}) = \frac{\sigma^2_\eta \phi^{|h|}}{1- \phi^2}
$$

for $h = 0, \pm 1, \pm 2, ...; |\phi|<1$

Hence,

$$
corr(\alpha_t, \alpha_{t+h}) = \phi^{|h|}
$$

If we let $\alpha_T = (\alpha_1,...\alpha_T)'$, then

$$
corr(\alpha_T) = 
\left[
\begin{array}
{ccccc}
1 & \phi^1 & \phi^2 & ... & \phi^2 \\
\phi^1 & 1 & \phi^1 & ... & \phi^{T-1} \\
\phi^2 & \phi^1 & 1 & ... & \phi^{T-2} \\
. & . & . & . &. \\
\phi^T & \phi^{T-1} & \phi^{T-2} & ... & 1
\end{array}
\right]
$$

Notes:

-   The correlation decreases as time lag increases
-   This matrix structure is known as a **Toeplitz** structure
-   More complicated covariance structures are possible, which is critical component of spatial random effects models and time series models.
-   Often, we don't need both random effects $\mathbf{b}$ and $\epsilon_{(1)i}$

More in the [Time Series] section

## Estimation

$$
\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i
$$

where $\beta, \mathbf{b}_i, \mathbf{D}, \mathbf{\Sigma}_i$ we must obtain estimation from the data

-   $\mathbf{\beta}, \mathbf{D}, \mathbf{\Sigma}_i$ are unknown, but fixed, parameters, and must be estimated from the data
-   $\mathbf{b}_i$ is a random variable. Thus, we can't estimate these values, but we can predict them. (i.e., you can't estimate a random thing).

If we have

-   $\hat{\beta}$ as an estimator of $\beta$
-   $\hat{\mathbf{b}}_i$ as a predictor of $\mathbf{b}_i$

Then,

-   The population average estimate of $\mathbf{Y}_i$ is $\hat{\mathbf{Y}_i} = \mathbf{X}_i \hat{\beta}$
-   The subject-specific prediction is $\hat{\mathbf{Y}_i} = \mathbf{X}_i \hat{\beta} + \mathbf{Z}_i \hat{b}_i$

According to [@henderson1975best], estimating equations known as the mixed model equations:

$$
\left[
\begin{array}
{c}
\hat{\beta} \\
\hat{\mathbf{b}}
\end{array}
\right]
=
\left[
\begin{array}
{cc}
\mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\
\mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z +B^{-1}}
\end{array}
\right]
\left[
\begin{array}
{cc}
\mathbf{X'\Sigma^{-1}Y} \\
\mathbf{Z'\Sigma^{-1}Y}
\end{array}
\right]
$$

where

$$
\begin{aligned}
\mathbf{Y}
&=
\left[
\begin{array}
{c}
\mathbf{y}_1 \\
. \\
\mathbf{y}_N
\end{array}
\right] ;
\mathbf{X}
=
\left[
\begin{array}
{c}
\mathbf{X}_1 \\
. \\
\mathbf{X}_N
\end{array}
\right];
\mathbf{b} = 
\left[
\begin{array}
{c}
\mathbf{b}_1 \\
. \\
\mathbf{b}_N
\end{array}
\right] ;
\epsilon = 
\left[
\begin{array}
{c}
\epsilon_1 \\
. \\
\epsilon_N
\end{array}
\right]
\\
cov(\epsilon) &= \mathbf{\Sigma},
\mathbf{Z} = 
\left[
\begin{array}
{cccc}
\mathbf{Z}_1 & 0 &  ... & 0 \\
0 & \mathbf{Z}_2 & ... & 0 \\
. & . & . & . \\
0 & 0 & ... & \mathbf{Z}_n
\end{array}
\right],
\mathbf{B} =
\left[
\begin{array}
{cccc}
\mathbf{D} & 0 & ... & 0 \\
0 & \mathbf{D} & ... & 0 \\
. & . & . & . \\
0 & 0 & ... & \mathbf{D}
\end{array}
\right]
\end{aligned}
$$

The model has the form

$$
\begin{aligned}
\mathbf{Y} &= \mathbf{X \beta + Z b + \epsilon} \\
\mathbf{Y} &\sim N(\mathbf{X \beta, ZBZ' + \Sigma})
\end{aligned}
$$

If $\mathbf{V = ZBZ' + \Sigma}$, then the solutions to the estimating equations can be

$$
\begin{aligned}
\hat{\beta} &= \mathbf{(X'V^{-1}X)^{-1}X'V^{-1}Y} \\
\hat{\mathbf{b}} &= \mathbf{BZ'V^{-1}(Y-X\hat{\beta}})
\end{aligned}
$$

The estimate $\hat{\beta}$ is a generalized least squares estimate.

The predictor, $\hat{\mathbf{b}}$ is the best linear unbiased predictor (BLUP), for $\mathbf{b}$

$$
\begin{aligned}
E(\hat{\beta}) &= \beta \\
var(\hat{\beta}) &= (\mathbf{X'V^{-1}X})^{-1} \\
E(\hat{\mathbf{b}}) &= 0
\end{aligned}
$$

$$
var(\mathbf{\hat{b}-b}) = \mathbf{B-BZ'V^{-1}ZB + BZ'V^{-1}X(X'V^{-1}X)^{-1}X'V^{-1}B}
$$

The variance here is the variance of the prediction error (mean squared prediction error, MSPE), which is more meaningful than $var(\hat{\mathbf{b}})$, since MSPE accounts for both variance and bias in the prediction.

To derive the mixed model equations, consider

$$
\mathbf{\epsilon = Y - X\beta - Zb}
$$

Let $T = \sum_{i=1}^N n_i$ be the total number of observations (i.e., the length of $\mathbf{Y},\epsilon$) and $Nq$ the length of $\mathbf{b}$. The joint distribution of $\mathbf{b, \epsilon}$ is

$$
f(\mathbf{b,\epsilon})= \frac{1}{(2\pi)^{(T+ Nq)/2}}
\left|
\begin{array}
{cc}
\mathbf{B} & 0 \\
0 & \mathbf{\Sigma}
\end{array}
\right| ^{-1/2}
\exp
\left(
-\frac{1}{2}
\left[
\begin{array}
{c}
\mathbf{b} \\
\mathbf{Y - X \beta - Zb}
\end{array}
\right]'
\left[
\begin{array}
{cc}
\mathbf{B} & 0 \\
0 & \mathbf{\Sigma}
\end{array}
\right]^{-1}
\left[
\begin{array}
{c}
\mathbf{b} \\
\mathbf{Y - X \beta - Zb}
\end{array}
\right]
\right)
$$

Maximization of $f(\mathbf{b},\epsilon)$ with respect to $\mathbf{b}$ and $\beta$ requires minimization of

$$
\begin{aligned}
Q &= 
\left[
\begin{array}
{c}
\mathbf{b} \\
\mathbf{Y - X \beta - Zb}
\end{array}
\right]'
\left[
\begin{array}
{cc}
\mathbf{B} & 0 \\
0 & \mathbf{\Sigma}
\end{array}
\right]^{-1}
\left[
\begin{array}
{c}
\mathbf{b} \\
\mathbf{Y - X \beta - Zb}
\end{array}
\right] \\
&= \mathbf{b'B^{-1}b+(Y-X \beta-Zb)'\Sigma^{-1}(Y-X \beta-Zb)}
\end{aligned}
$$

Setting the derivatives of Q with respect to $\mathbf{b}$ and $\mathbf{\beta}$ to zero leads to the system of equations:

$$
\begin{aligned}
\mathbf{X'\Sigma^{-1}X\beta + X'\Sigma^{-1}Zb} &= \mathbf{X'\Sigma^{-1}Y}\\
\mathbf{(Z'\Sigma^{-1}Z + B^{-1})b + Z'\Sigma^{-1}X\beta} &= \mathbf{Z'\Sigma^{-1}Y}
\end{aligned}
$$

Rearranging

$$
\left[
\begin{array}
{cc}
\mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\
\mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}}
\end{array}
\right]
\left[
\begin{array}
{c}
\beta \\
\mathbf{b}
\end{array}
\right]
= 
\left[
\begin{array}
{c}
\mathbf{X'\Sigma^{-1}Y} \\
\mathbf{Z'\Sigma^{-1}Y}
\end{array}
\right]
$$

Thus, the solution to the mixed model equations give:

$$
\left[
\begin{array}
{c}
\hat{\beta} \\
\hat{\mathbf{b}}
\end{array}
\right]
= 
\left[
\begin{array}
{cc}
\mathbf{X'\Sigma^{-1}X} & \mathbf{X'\Sigma^{-1}Z} \\
\mathbf{Z'\Sigma^{-1}X} & \mathbf{Z'\Sigma^{-1}Z + B^{-1}}
\end{array}
\right] ^{-1}
\left[
\begin{array}
{c}
\mathbf{X'\Sigma^{-1}Y} \\
\mathbf{Z'\Sigma^{-1}Y}
\end{array}
\right]
$$

Equivalently,

Bayes' theorem

$$
f(\mathbf{b}| \mathbf{Y}) = \frac{f(\mathbf{Y}|\mathbf{b})f(\mathbf{b})}{\int f(\mathbf{Y}|\mathbf{b})f(\mathbf{b}) d\mathbf{b}}
$$

where

-   $f(\mathbf{Y}|\mathbf{b})$ is the "likelihood"
-   $f(\mathbf{b})$ is the prior
-   the denominator is the "normalizing constant"
-   $f(\mathbf{b}|\mathbf{Y})$ is the posterior distribution

In this case

$$
\begin{aligned}
\mathbf{Y} | \mathbf{b} &\sim N(\mathbf{X\beta+Zb,\Sigma}) \\
\mathbf{b} &\sim N(\mathbf{0,B})
\end{aligned}
$$

The posterior distribution has the form

$$
\mathbf{b}|\mathbf{Y} \sim N(\mathbf{BZ'V^{-1}(Y-X\beta),(Z'\Sigma^{-1}Z + B^{-1})^{-1}})
$$

Hence, the best predictor (based on squared error loss)

$$
E(\mathbf{b}|\mathbf{Y}) = \mathbf{BZ'V^{-1}(Y-X\beta)}
$$

### Estimating $\mathbf{V}$

If we have $\tilde{\mathbf{V}}$ (estimate of $\mathbf{V}$), then we can estimate:

$$
\begin{aligned}
\hat{\beta} &= \mathbf{(X'\tilde{V}^{-1}X)^{-1}X'\tilde{V}^{-1}Y} \\
\hat{\mathbf{b}} &= \mathbf{BZ'\tilde{V}^{-1}(Y-X\hat{\beta})}
\end{aligned}
$$

where ${\mathbf{b}}$ is **EBLUP** (estimated BLUP) or **empirical Bayes estimate**

Note:

-   $\hat{var}(\hat{\beta})$ is a consistent estimator of $var(\hat{\beta})$ if $\tilde{\mathbf{V}}$ is a consistent estimator of $\mathbf{V}$
-   However, $\hat{var}(\hat{\beta})$ is biased since the variability arises from estimating $\mathbf{V}$ is not accounted for in the estimate.
-   Hence, $\hat{var}(\hat{\beta})$ underestimates the true variability

Ways to estimate $\mathbf{V}$

-   [Maximum Likelihood Estimation (MLE)](#maximum-likelihood-estimation-mle)
-   [Restricted Maximum Likelihood (REML)](#restricted-maximum-likelihood-reml)
-   [Estimated Generalized Least Squares]
-   [Bayesian Hierarchical Models (BHM)](#bayesian-hierarchical-models-bhm)

#### Maximum Likelihood Estimation (MLE) {#maximum-likelihood-estimation-mle}

Grouping unknown parameters in $\Sigma$ and $B$ under a parameter vector $\theta$. Under MLE, $\hat{\theta}$ and $\hat{\beta}$ maximize the likelihood $\mathbf{y} \sim N(\mathbf{X\beta, V(\theta))}$. Synonymously, $-2\log L(\mathbf{y;\theta,\beta})$:

$$
-2l(\mathbf{\beta,\theta,y}) = \log |\mathbf{V(\theta)}| + \mathbf{(y-X\beta)'V(\theta)^{-1}(y-X\beta)} + N \log(2\pi)
$$

-   Step 1: Replace $\beta$ with its maximum likelihood (where $\theta$ is known $\hat{\beta}= (\mathbf{X'V(\theta)^{-1}X)^{-1}X'V(\theta)^{-1}y}$
-   Step 2: Minimize the above equation with respect to $\theta$ to get the estimator $\hat{\theta}_{MLE}$
-   Step 3: Substitute $\hat{\theta}_{MLE}$ back to get $\hat{\beta}_{MLE} = (\mathbf{X'V(\theta_{MLE})^{-1}X)^{-1}X'V(\theta_{MLE})^{-1}y}$
-   Step 4: Get $\hat{\mathbf{b}}_{MLE} = \mathbf{B(\hat{\theta}_{MLE})Z'V(\hat{\theta}_{MLE})^{-1}(y-X\hat{\beta}_{MLE})}$

Note:

-   $\hat{\theta}$ are typically negatively biased due to unaccounted fixed effects being estimated, which we could try to account for.

#### Restricted Maximum Likelihood (REML) {#restricted-maximum-likelihood-reml}

REML accounts for the number of estimated mean parameters by adjusting the objective function. Specifically, the likelihood of linear combination of the elements of $\mathbf{y}$ is accounted for.

We have $\mathbf{K'y}$, where $\mathbf{K}$ is any $N \times (N - p)$ full-rank contrast matrix, which has columns orthogonal to the $\mathbf{X}$ matrix (that is $\mathbf{K'X} = 0$). Then,

$$
\mathbf{K'y} \sim N(0,\mathbf{K'V(\theta)K})
$$

where $\beta$ is no longer in the distribution

We can proceed to maximize this likelihood for the contrasts to get $\hat{\theta}_{REML}$, which does not depend on the choice of $\mathbf{K}$. And $\hat{\beta}$ are based on $\hat{\theta}$

Comparison REML and MLE

-   Both methods are based upon the likelihood principle, and have desired properties for the estimates:

    -   consistency

    -   asymptotic normality

    -   efficiency

-   ML estimation provides estimates for fixed effects, while REML can't

-   In balanced models, REML is identical to ANOVA

-   REML accounts for df for the fixed effects int eh model, which is important when $\mathbf{X}$ is large relative to the sample size

-   Changing $\mathbf{\beta}$ has no effect on the REML estimates of $\theta$

-   REML is less sensitive to outliers than MLE

-   MLE is better than REML regarding model comparisons (e.g., AIC or BIC)

#### Estimated Generalized Least Squares

MLE and REML rely upon the Gaussian assumption. To overcome this issue, EGLS uses the first and second moments.

$$
\mathbf{Y}_i = \mathbf{X}_i \beta + \mathbf{Z}_i \mathbf{b}_i + \epsilon_i
$$

where

-   $\epsilon_i \sim (\mathbf{0,\Sigma_i})$
-   $\mathbf{b}_i \sim (\mathbf{0,D})$
-   $cov(\epsilon_i, \mathbf{b}_i) = 0$

Then the EGLS estimator is

$$
\begin{aligned}
\hat{\beta}_{GLS} &= \{\sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}X_i}  \}^{-1} \sum_{i=1}^n \mathbf{X'_iV_i(\theta)^{-1}Y_i} \\
&=\{\mathbf{X'V(\theta)^{-1}X} \}^{-1} \mathbf{X'V(\theta)^{-1}Y}
\end{aligned}
$$

depends on the first two moments

-   $E(\mathbf{Y}_i) = \mathbf{X}_i \beta$
-   $var(\mathbf{Y}_i)= \mathbf{V}_i$

EGLS use $\hat{\mathbf{V}}$ for $\mathbf{V(\theta)}$

$$
\hat{\beta}_{EGLS} = \{ \mathbf{X'\hat{V}^{-1}X} \}^{-1} \mathbf{X'\hat{V}^{-1}Y}
$$

Hence, the fixed effects estimators for the MLE, REML, and EGLS are of the same form, except for the estimate of $\mathbf{V}$

In case of non-iterative approach, EGLS can be appealing when $\mathbf{V}$ can be estimated without much computational burden.

#### Bayesian Hierarchical Models (BHM) {#bayesian-hierarchical-models-bhm}

Joint distribution cane be decomposed hierarchically in terms of the product of conditional distributions and a marginal distribution

$$
f(A,B,C) = f(A|B,C) f(B|C)f(C)
$$

Applying to estimate $\mathbf{V}$

$$
\begin{aligned}
f(\mathbf{Y, \beta, b, \theta}) &= f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta,\beta})f(\mathbf{\beta|\theta})f(\mathbf{\theta}) & \text{based on probability decomposition} \\
&= f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta})f(\mathbf{\beta})f(\mathbf{\theta}) & \text{based on simplifying modeling assumptions}
\end{aligned}
$$

elaborate on the second equality, if we assume conditional independence (e.g., given $\theta$, no additional info about $\mathbf{b}$ is given by knowing $\beta$), then we can simply from the first equality

Using Bayes' rule

$$
f(\mathbf{\beta, b, \theta|Y}) \propto f(\mathbf{Y|\beta,b, \theta})f(\mathbf{b|\theta})f(\mathbf{\beta})f(\mathbf{\theta})
$$

where

$$
\begin{aligned}
\mathbf{Y| \beta, b, \theta} &\sim \mathbf{N(X\beta+ Zb, \Sigma(\theta))} \\
\mathbf{b | \theta} &\sim \mathbf{N(0, B(\theta))}
\end{aligned}
$$

and we also have to have prior distributions for $f(\beta), f(\theta)$

With normalizing constant, we can obtain the posterior distribution. Typically, we can't get analytical solution right away. Hence, we can use Markov Chain Monte Carlo (MCMC) to obtain samples from the posterior distribution.

Bayesian Methods:

-   account for the uncertainty in parameters estimates and accommodate the propagation of that uncertainty through the model
-   can adjust prior information (i.e., priori) in parameters
-   Can extend beyond Gaussian distributions
-   but hard to implement algorithms and might have problem converging

## Inference

### Parameters $\beta$

#### Wald test {#wald-test-GLMM}

We have

$$
\begin{aligned}
\mathbf{\hat{\beta}(\theta)} &= \mathbf{\{X'V^{-1}(\theta) X\}^{-1}X'V^{-1}(\theta) Y} \\
var(\hat{\beta}(\theta)) &= \mathbf{\{X'V^{-1}(\theta) X\}^{-1}}
\end{aligned}
$$

We can use $\hat{\theta}$ in place of $\theta$ to approximate Wald test

$$
H_0: \mathbf{A \beta =d} 
$$

With

$$
W = \mathbf{(A\hat{\beta} - d)'[A(X'\hat{V}^{-1}X)^{-1}A']^{-1}(A\hat{\beta} - d)}
$$

where $W \sim \chi^2_{rank(A)}$ under $H_0$ is true. However, it does not take into account variability from using $\hat{\theta}$ in place of $\theta$, hence the standard errors are underestimated

#### F-test

Alternatively, we can use the modified F-test, suppose we have $var(\mathbf{Y}) = \sigma^2 \mathbf{V}(\theta)$, then

$$
F^* = \frac{\mathbf{(A\hat{\beta} - d)'[A(X'\hat{V}^{-1}X)^{-1}A']^{-1}(A\hat{\beta} - d)}}{\hat{\sigma}^2 \text{rank}(A)}
$$

where $F^* \sim f_{rank(A), den(df)}$ under the null hypothesis. And den(df) needs to be approximated from the data by either:

-   Satterthwaite method
-   Kenward-Roger approximation

Under balanced cases, the Wald and F tests are similar. But for small sample sizes, they can differ in p-values. And both can be reduced to t-test for a single $\beta$

#### Likelihood Ratio Test

$$
H_0: \beta \in \Theta_{\beta,0}
$$

where $\Theta_{\beta, 0}$ is a subspace of the parameter space, $\Theta_{\beta}$ of the fixed effects $\beta$ . Then

$$
-2\log \lambda_N = -2\log\{\frac{\hat{L}_{ML,0}}{\hat{L}_{ML}}\}
$$

where

-   $\hat{L}_{ML,0}$ , $\hat{L}_{ML}$ are the maximized likelihood obtained from maximizing over $\Theta_{\beta,0}$ and $\Theta_{\beta}$
-   $-2 \log \lambda_N \dot{\sim} \chi^2_{df}$ where df is the difference in the dimension (i.e., number of parameters) of $\Theta_{\beta,0}$ and $\Theta_{\beta}$

This method is not applicable for REML. But REML can still be used to test for covariance parameters between nested models.

### Variance Components

-   For ML and REML estimator, $\hat{\theta} \sim N(\theta, I(\theta))$ for large samples

-   Wald test in variance components is analogous to the fixed effects case (see \@ref(wald-test-GLMM) )

    -   However, the normal approximation depends largely on the true value of $\theta$. It will fail if the true value of $\theta$ is close to the boundary of the parameter space $\Theta_{\theta}$ (i.e., $\sigma^2 \approx 0$)

    -   Typically works better for covariance parameter, than variance parameters.

-   The likelihood ratio tests can also be used with ML or REML estimates. However, the same problem of parameters

## Information Criteria

-   account for the likelihood and the number of parameters to assess model comparison.

### Akaike's Information Criteria (AIC)

Derived as an estimator of the expected Kullback discrepancy between the true model and a fitted candidate model

$$
AIC = -2l(\hat{\theta}, \hat{\beta}) + 2q
$$

where

-   $l(\hat{\theta}, \hat{\beta})$ is the log-likelihood
-   q = the effective number of parameters; total of fixed and those associated with random effects (variance/covariance; those not estimated to be on a boundary constraint)

Note:

-   In comparing models that differ in their random effects, this method is not advised to due the inability to get the correct number of effective parameters).
-   We prefer smaller AIC values.
-   If your program uses $l-q$ then we prefer larger AIC values (but rarely).
-   can be used for mixed model section, (e.g., selection of the covariance structure), but the sample size must be very large to have adequate comparison based on the criterion
-   Can have a large negative bias (e.g., when sample size is small but the number of parameters is large) due to the penalty term can't approximate the bias adjustment adequately

### Corrected AIC (AICC)

-   developed by [@hurvich1989regression]
-   correct small-sample adjustment
-   depends on the candidate model class
-   Only if you have fixed covariance structure, then AICC is justified, but not general covariance structure

### Bayesian Information Criteria (BIC)

$$
BIC = -2l(\hat{\theta}, \hat{\beta}) + q \log n
$$

where n = number of observations.

-   we prefer smaller BIC value
-   BIC and AIC are used for both REML and MLE if we have the same mean structure. Otherwise, in general, we should prefer MLE

With our example presented at the beginning of [Linear Mixed Models],

$$
Y_{ik}= 
\begin{cases}
\beta_0 + b_{1i} + (\beta_1 + \ b_{2i})t_{ij} + \epsilon_{ij} & L \\
\beta_0 + b_{1i} + (\beta_2 + \ b_{2i})t_{ij} + \epsilon_{ij} & H\\
\beta_0 + b_{1i} + (\beta_3 + \ b_{2i})t_{ij} + \epsilon_{ij} & C
\end{cases}
$$

where

-   $i = 1,..,N$
-   $j = 1,..,n_i$ (measures at time $t_{ij}$)

Note:

-   we have subject-specific intercepts,

$$
\begin{aligned}
\mathbf{Y}_i |b_i &\sim N(\mathbf{X}_i \beta + 1 b_i, \sigma^2 \mathbf{I}) \\
b_i &\sim N(0,d_{11})
\end{aligned}
$$

here, we want to estimate $\beta, \sigma^2, d_{11}$ and predict $b_i$

## Split-Plot Designs

-   Typically used in the case that you have two factors where one needs much larger units than the other.

Example:

A: 3 levels (large units)

B: 2 levels (small units)

-   A and B levels are randomized into 4 blocks.
-   But it differs from [Randomized Block Designs]. In each block, both have one of the 6 (3x2) treatment combinations. But [Randomized Block Designs] assign in each block randomly, while split-plot does not randomize this step.
-   Moreover, because A needs to be applied in large units, factor A is applied only once in each block while B can be applied multiple times.

Hence, we have our model

If A is our factor of interest

$$
Y_{ij} = \mu + \rho_i + \alpha_j + e_{ij}
$$

where

-   $i$ = replication (block or subject)
-   $j$ = level of Factor A
-   $\mu$ = overall mean
-   $\rho_i$ = variation due to the $i$-th block
-   $e_{ij} \sim N(0, \sigma^2_e)$ = whole plot error

If B is our factor of interest

$$
Y_{ijk} = \mu + \phi_{ij} + \beta_k + \epsilon_{ijk}
$$

where

-   $\phi_{ij}$ = variation due to the $ij$-th main plot
-   $\beta_k$ = Factor B effect
-   $\epsilon_{ijk} \sim N(0, \sigma^2_\epsilon)$ = subplot error
-   $\phi_{ij} = \rho_i + \alpha_j + e_{ij}$

Together, the split-plot model

$$
Y_{ijk} = \mu + \rho_i + \alpha_j + e_{ij} + \beta_k + (\alpha \beta)_{jk} + \epsilon_{ijk}
$$

where

-   $i$ = replicate (blocks or subjects)
-   $j$ = level of factor A
-   $k$ = level of factor B
-   $\mu$ = overall mean
-   $\rho_i$ = effect of the block
-   $\alpha_j$ = main effect of factor A (fixed)
-   $e_{ij} = (\rho \alpha)_{ij}$ = block by factor A interaction (the whole plot error, random)
-   $\beta_k$ = main effect of factor B (fixed)
-   $(\alpha \beta)_{jk}$ = interaction between factors A and B (fixed)
-   $\epsilon_{ijk}$ = subplot error (random)

We can approach sub-plot analysis based on

-   the ANOVA perspective

    -   Whole plot comparisons

        -   Compare factor A to the whole plot error (i.e., $\alpha_j$ to $e_{ij}$)

        -   Compare the block to the whole plot error (i.e., $\rho_i$ to $e_{ij}$)

    -   Sub-plot comparisons:

        -   Compare factor B to the subplot error ($\beta$ to $\epsilon_{ijk}$)

        -   Compare the AB interaction to the subplot error ($(\alpha \beta)_{jk}$ to $\epsilon_{ijk}$)

-   the mixed model perspective

$$
\mathbf{Y = X \beta + Zb + \epsilon}
$$

### Application

#### Example 1

$$
y_{ijk} = \mu + i_i + v_j + (iv)_{ij} + f_k + \epsilon_{ijk}
$$

where

-   $y_{ijk}$ = observed yield
-   $\mu$ = overall average yield
-   $i_i$ = irrigation effect
-   $v_j$ = variety effect
-   $(iv)_{ij}$ = irrigation by variety interaction
-   $f_k$ = random field (block) effect
-   $\epsilon_{ijk}$ = residual
-   because variety-field combination is only observed once, we can't have the random interaction effects between variety and field

```{r}
library(ggplot2)
data(irrigation, package = "faraway")
summary(irrigation)
head(irrigation, 4)
ggplot(irrigation,
       aes(
         x     = field,
         y     = yield,
         shape = irrigation,
         color = variety
       )) +
  geom_point(size = 3)

```

```{r}
sp_model <-
    lmerTest::lmer(yield ~ irrigation * variety 
                   + (1 |field), irrigation)
summary(sp_model)

anova(sp_model, ddf = c("Kenward-Roger"))
```

Since p-value of the interaction term is insignificant, we consider fitting without it.

```{r}
library(lme4)
sp_model_additive <- lmer(yield ~ irrigation + variety 
                          + (1 | field), irrigation)
anova(sp_model_additive,sp_model,ddf = "Kenward-Roger")
```

Since $p$-value of $\chi^2$ test is insignificant, we can't reject the additive model is already sufficient. Looking at AIC and BIC, we can also see that we would prefer the additive model

**Random Effect Examination**

`exactRLRT` test

-   $H_0$: Var(random effect) (i.e., $\sigma^2$)= 0
-   $H_a$: Var(random effect) (i.e., $\sigma^2$) \> 0

```{r}
sp_model <- lme4::lmer(yield ~ irrigation * variety 
                       + (1 | field), irrigation)
library(RLRsim)
exactRLRT(sp_model)
```

Since the p-value is significant, we reject $H_0$

## Repeated Measures in Mixed Models

$$
Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \delta_{i(k)}+ \epsilon_{ijk}
$$

where

-   $i$-th group (fixed)
-   $j$-th (repeated measure) time effect (fixed)
-   $k$-th subject
-   $\delta_{i(k)} \sim N(0,\sigma^2_\delta)$ (k-th subject in the $i$-th group) and $\epsilon_{ijk} \sim N(0,\sigma^2)$ (independent error) are random effects ($i = 1,..,n_A, j = 1,..,n_B, k = 1,...,n_i$)

hence, the variance-covariance matrix of the repeated observations on the k-th subject of the i-th group, $\mathbf{Y}_{ik} = (Y_{i1k},..,Y_{in_Bk})'$, will be

$$
\begin{aligned}
\mathbf{\Sigma}_{subject} &=
\left(
\begin{array}
{cccc}
\sigma^2_\delta + \sigma^2 & \sigma^2_\delta & ... & \sigma^2_\delta \\
\sigma^2_\delta & \sigma^2_\delta +\sigma^2 & ... & \sigma^2_\delta \\
. & . & . & . \\
\sigma^2_\delta & \sigma^2_\delta & ... & \sigma^2_\delta + \sigma^2 \\
\end{array}
\right) \\
&= (\sigma^2_\delta + \sigma^2)
\left(
\begin{array}
{cccc}
1 & \rho & ... & \rho \\
\rho & 1 & ... & \rho \\
. & . & . & . \\
\rho & \rho & ... & 1 \\
\end{array}
\right) 
& \text{product of a scalar and a correlation matrix}
\end{aligned}
$$

where $\rho = \frac{\sigma^2_\delta}{\sigma^2_\delta + \sigma^2}$, which is the compound symmetry structure that we discussed in [Random-Intercepts Model]

But if you only have repeated measurements on the subject over time, AR(1) structure might be more appropriate

Mixed model for a repeated measure

$$
Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk}
$$

where

-   $\epsilon_{ijk}$ combines random error of both the whole and subplots.

In general,

$$
\mathbf{Y = X \beta + \epsilon}
$$

where

-   $\epsilon \sim N(0, \sigma^2 \mathbf{\Sigma})$ where $\mathbf{\Sigma}$ is block diagonal if the random error covariance is the same for each subject

The variance covariance matrix with AR(1) structure is

$$
\mathbf{\Sigma}_{subject} =
\sigma^2
\left(
\begin{array}
{ccccc}
1  & \rho & \rho^2 & ... & \rho^{n_B-1} \\
\rho & 1 & \rho & ... & \rho^{n_B-2} \\
. & . & . & . & . \\
\rho^{n_B-1} & \rho^{n_B-2} & \rho^{n_B-3} & ... & 1 \\
\end{array}
\right)
$$

Hence, the mixed model for a repeated measure can be written as

$$
Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha \beta)_{ij} + \epsilon_{ijk}
$$

where

-   $\epsilon_{ijk}$ = random error of whole and subplots

Generally,

$$
\mathbf{Y = X \beta + \epsilon} 
$$

where $\epsilon \sim N(0, \mathbf{\sigma^2 \Sigma})$ and $\Sigma$ = block diagonal if the random error covariance is the same for each subject.

## Unbalanced or Unequally Spaced Data

Consider the model

$$
Y_{ikt} = \beta_0 + \beta_{0i} + \beta_{1}t + \beta_{1i}t + \beta_{2} t^2 + \beta_{2i} t^2 + \epsilon_{ikt}
$$

where

-   $i = 1,2$ (groups)
-   $k = 1,…, n_i$ ( individuals)
-   $t = (t_1,t_2,t_3,t_4)$ (times)
-   $\beta_{2i}$ = common quadratic term
-   $\beta_{1i}$ = common linear time trends
-   $\beta_{0i}$ = common intercepts

Then, we assume the variance-covariance matrix of the repeated measurements collected on a particular subject over time has the form

$$
\mathbf{\Sigma}_{ik} = \sigma^2
\left(
\begin{array}
{cccc}
1 & \rho^{t_2-t_1} & \rho^{t_3-t_1} & \rho^{t_4-t_1} \\
\rho^{t_2-t_1} & 1 & \rho^{t_3-t_2} & \rho^{t_4-t_2} \\
\rho^{t_3-t_1} & \rho^{t_3-t_2} & 1 & \rho^{t_4-t_3} \\
\rho^{t_4-t_1} & \rho^{t_4-t_2} & \rho^{t_4-t_3} & 1
\end{array}
\right)
$$

which is called "power" covariance model

We can consider $\beta_{2i} , \beta_{1i}, \beta_{0i}$ accordingly to see whether these terms are needed in the final model

## Application

R Packages for mixed models

-   `nlme`

    -   has nested structure

    -   flexible for complex design

    -   not user-friendly

-   `lme4`

    -   computationally efficient

    -   user-friendly

    -   can handle non-normal response

    -   for more detailed application, check [Fitting Linear Mixed-Effects Models Using lme4](https://arxiv.org/abs/1406.5823)

-   Others

    -   Bayesian setting: `MCMCglmm`, `brms`

    -   For genetics: `ASReml`

### Example 1 (Pulps)

Model:

$$
y_{ij} = \mu + \alpha_i + \epsilon_{ij}
$$

where

-   $i = 1,..,a$ groups for random effect $\alpha_i$
-   $j = 1,...,n$ individuals in each group
-   $\alpha_i \sim N(0, \sigma^2_\alpha)$ is random effects
-   $\epsilon_{ij} \sim N(0, \sigma^2_\epsilon)$ is random effects
-   Imply compound symmetry model where the intraclass correlation coefficient is: $\rho = \frac{\sigma^2_\alpha}{\sigma^2_\alpha + \sigma^2_\epsilon}$
-   If factor $a$ does not explain much variation, low correlation within the levels: $\sigma^2_\alpha \to 0$ then $\rho \to 0$
-   If factor $a$ explain much variation, high correlation within the levels $\sigma^2_\alpha \to \infty$ hence, $\rho \to 1$

```{r}
data(pulp, package = "faraway")
plot(
    y    = pulp$bright,
    x    = pulp$operator,
    xlab = "Operator",
    ylab = "Brightness"
)
pulp %>% dplyr::group_by(operator) %>% 
    dplyr::summarise(average = mean(bright))

```

`lmer` application

```{r}
library(lme4)
mixed_model <-
    lmer(
        # pipe (i..e, | ) denotes random-effect terms
        formula = bright ~ 1 + (1 |operator), 
         data = pulp)
summary(mixed_model)
coef(mixed_model)
fixef(mixed_model)   # fixed effects
confint(mixed_model) # confidence interval
ranef(mixed_model)   # random effects
VarCorr(mixed_model) # random effects standard deviation
re_dat = as.data.frame(VarCorr(mixed_model))

# rho based on the above formula
rho = re_dat[1, 'vcov'] / (re_dat[1, 'vcov'] + re_dat[2, 'vcov'])
rho
```

To Satterthwaite approximation for the denominator df, we use `lmerTest`

```{r}
library(lmerTest)
summary(lmerTest::lmer(bright ~ 1 + (1 | operator), pulp))$coefficients
confint(mixed_model)[3, ]
```

In this example, we can see that the confidence interval computed by `confint` in `lmer` package is very close is `confint` in `lmerTest` model.

`MCMglmm` application

under the Bayesian framework

```{r}
library(MCMCglmm)
mixed_model_bayes <-
    MCMCglmm(
        bright ~ 1,
        random =  ~ operator,
        data = pulp,
        verbose = FALSE
    )
summary(mixed_model_bayes)$solutions
```

this method offers the confidence interval slightly more positive than `lmer` and `lmerTest`

#### Prediction

```{r}
# random effects prediction (BLUPs)
ranef(mixed_model)$operator

# prediction for each categories
fixef(mixed_model) + ranef(mixed_model)$operator 

# equivalent to the above method
predict(mixed_model, newdata = data.frame(operator = c('a', 'b', 'c', 'd'))) 

```

use `bootMer()` to get bootstrap-based confidence intervals for predictions.

Another example using GLMM in the context of blocking

Penicillin data

```{r}
data(penicillin, package = "faraway")
summary(penicillin)
library(ggplot2)
ggplot(penicillin, aes(
    y     = yield,
    x     = treat,
    shape = blend,
    color = blend
)) + 
    # treatment = fixed effect
    # blend = random effects
    geom_point(size = 3) +
    xlab("Treatment")

library(lmerTest) # for p-values
mixed_model <- lmerTest::lmer(yield ~ treat + (1 | blend),
                              data = penicillin)
summary(mixed_model)

#The BLUPs for the each blend
ranef(mixed_model)$blend
```

Examine treatment effect

```{r}
anova(mixed_model) # p-value based on lmerTest
```

Since the p-value is greater than 0.05, we can't reject the null hypothesis that there is no treatment effect.

```{r}
library(pbkrtest)
# REML is not appropriate for testing fixed effects, it should be ML
full_model <-
    lmer(yield ~ treat + (1 | blend), 
         penicillin, 
         REML = FALSE) 
null_model <- lmer(yield ~ 1 + (1 | blend), 
                   penicillin, 
                   REML = FALSE)

# use  Kenward-Roger approximation for df
KRmodcomp(full_model, null_model) 
```

Since the p-value is greater than 0.05, and consistent with our previous observation, we conclude that we can't reject the null hypothesis that there is no treatment effect.

### Example 2 (Rats)

```{r}
rats <- read.csv(
    "images/rats.dat",
    header = F,
    sep = ' ',
    col.names = c('Treatment', 'rat', 'age', 'y')
)

# log transformed age
rats$t <- log(1 + (rats$age - 45) / 10) 
```

We are interested in whether treatment effect induces changes over time.

```{r}
rat_model <-
    # treatment = fixed effect, rat = random effects
    lmerTest::lmer(y ~ t:Treatment + (1 | rat), 
                   data = rats) 
summary(rat_model)
anova(rat_model)
```

Since the p-value is significant, we can be confident concluding that there is a treatment effect

### Example 3 (Agridat)

```{r}
library(agridat)
library(latticeExtra)
dat <- harris.wateruse
# Compare to Schabenberger & Pierce, fig 7.23
useOuterStrips(
    xyplot(
        water ~ day | species * age,
        dat,
        as.table = TRUE,
        group = tree,
        type = c('p', 'smooth'),
        main = "harris.wateruse 2 species, 2 ages (10 trees each)"
    )
)
```

Remove outliers

```{r}
dat <- subset(dat, day!=268)
```

Plot between age and species

```{r}
xyplot(
    water ~ day | tree,
    dat,
    subset   = age == "A2" & species == "S2",
    as.table = TRUE,
    type     = c('p', 'smooth'),
    ylab     = "Water use profiles of individual trees",
    main     = "harris.wateruse (Age 2, Species 2)"
)
```

```{r}
# Rescale day for nicer output, 
# and convergence issues, add quadratic term
dat <- transform(dat, ti = day / 100)
dat <- transform(dat, ti2 = ti * ti)
# Start with a subgroup: age 2, species 2
d22 <- droplevels(subset(dat, age == "A2" & species == "S2"))
```

`lme` function from `nlme` package

```{r}
library(nlme)

## We use pdDiag() to get uncorrelated random effects
m1n <- lme(
    water ~ 1 + ti + ti2,
    #intercept, time and time-squared = fixed effects
    data = d22,
    na.action = na.omit,
    random = list(tree = pdDiag(~ 1 + ti + ti2)) 
    # random intercept, time 
    # and time squared per tree = random effects
)
ranef(m1n)
```

```{r}
fixef(m1n)
summary(m1n)

```

`lmer` function from `lme4` package

```{r}
m1lmer <-
    lmer(water ~ 1 + ti + ti2 + (ti + ti2 ||
                                     tree),
         data = d22,
         na.action = na.omit)
ranef(m1lmer)

```

Notes:

-   `||` double pipes= uncorrelated random effects

-   To remove the intercept term:

    -   `(0+ti|tree)`

    -   `(ti-1|tree)`

```{r}
fixef(m1lmer)
m1l <-
    lmer(water ~ 1 + ti + ti2 
         + (1 | tree) + (0 + ti | tree) 
         + (0 + ti2 | tree), data = d22)
ranef(m1l)
fixef(m1l)
```

To include structured covariance terms, we can use the following way

```{r}
m2n <- lme(
    water ~ 1 + ti + ti2,
    data = d22,
    random = ~ 1 | tree,
    cor = corExp(form =  ~ day | tree),
    na.action = na.omit
)
ranef(m2n)
fixef(m2n)
summary(m2n)
```