-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgwasBackground.Rmd
173 lines (118 loc) · 6.12 KB
/
gwasBackground.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
title: "GWAS background"
author: "Lieven Clement"
date: "statOmics, Ghent University (https://statomics.github.io)"
output:
bookdown::html_document2:
code_download: true
df_print: paged
theme: flatly
highlight: tango
toc: true
toc_float: true
number_sections: true
code_folding: show
---
# Background
## DNA
- 6 Billion base pairs: 3 billion from father and 3 billion from mother
- Organised in $2 \times 23$ chromosomes of length 50 - 250 milion bp
data:image/s3,"s3://crabby-images/727fb/727fb3a131c85946e7147308a146363e8ca09dbc" alt=""
</br>
## Transcription / translation
data:image/s3,"s3://crabby-images/59a78/59a787dd7a787f3ea285f1a95d2f569913c9a0ae" alt=""
## Variation in DNA
- 90% of variation in DNA are SNP: single nucleotide polymorphism. Different base at a single position in the DNA
- Humans: $\pm$ 5 million SNPs
- Most of them are neutral: high redundancy in the genomic code
- Sometimes they are not neutral:
data:image/s3,"s3://crabby-images/df5bf/df5bf61b44fa51fca74b29334fc2faf0f6aa9922" alt=""{width=25%}data:image/s3,"s3://crabby-images/cb416/cb41608e2c31f0d8939e2a2c44367bfbef77436d" alt=""{width=75%}
- Genomic recombination of parental chromosomes when producing germ cells.
data:image/s3,"s3://crabby-images/53ce0/53ce08dbdb956feb47881d1fc88eb039af74b248" alt=""
- Linkage disequilibrium: SNPs often occur together because of genomic recombination!
## GWAS in University of Bergen
- GWAS: Genome Wide Association Studies
- Studies in large cohorts
- Use SNPs to identify genes associated with a particular trait: e.g. birth weight, pacenta weight, BMI, ... .
data:image/s3,"s3://crabby-images/159ff/159ffa540d8efe04b6b0c6635086d19edd9bb2b3" alt=""
# Linear models for GWAS
In GWAS one often corrects for population stratification using the following linear model.
$$
\tag{1}
\mathbf{y} = \mathbf{x}_\text{test}\beta_\text{test} + \mathbf{X}_c\boldsymbol{\beta}_c + \mathbf{X}_\text{PCA} \boldsymbol{\beta}_\text{PCA} +\boldsymbol{\epsilon}
$$
with
- $\mathbf{y}$ an $N\times1$ vector of the phenotype
- $\mathbf{x}_\text{test}$ an $N\times1$ vector with the genotype for the candidate SNP
- $\beta_\text{test}$ the association of candidate SNP and the phenotype
- $\mathbf{X}_c$ an $N\times C$ matrix with the covariate pattern for $C$ known covariates (vector of ones (intercept), age, gender, batch,...)
- $\boldsymbol{\beta}_c$ the $p\times 1$ vector of parameters modeling the association of the p covariates and the phenotype.
- $\mathbf{X}_\text{PCA}$ an $N\times p$ matrix with p PCs used to correct for population stratification
- $\boldsymbol{\epsilon}$ an $N\times 1$ vector with environmental residuals that are assumed to be i.i.d. $\epsilon_i \sim N(0,\sigma_u^2)$ with $i = 1\ldots N$
Let $\mathbf{Z}$ be an $N\times M$ genetic relationship matrix with all $M$ normalised genotypes.
Then with the SVD we can decompose $\mathbf{Z}$
$$
\mathbf{Z} = \mathbf{U}\boldsymbol{\Delta}\mathbf{V}^T
$$
Note, that the $N\times M$ matrix $\mathbf{V}$ are also the M PCs of an PCA.
So we can approximate $\mathbf{Z}$ using a truncated PCA, e.g. by using the first $p$ PCs.
$$
\mathbf{Z}_p = \mathbf{U}_{p} \boldsymbol{\Delta}_p\mathbf{V}^T_p
$$
with
$$
\mathbf{X}_\text{PCA} = \mathbf{U}_{p} \boldsymbol{\Delta}_p
$$
the scores on the first p PCs that can be used to correct for population stratification.
# Linear mixed model for GWAS
## Specification
$$
\tag{1}
\mathbf{Y} = \mathbf{x}_\text{test}\beta_\text{test} + \mathbf{X}_c\boldsymbol{\beta}_c + \mathbf{Z}_{GRM}\mathbf{u} +\boldsymbol{\epsilon}
$$
with
- $\mathbf{Y}$ an $N\times1$ vector of the phenotype
- $\mathbf{x}_\text{test}$ an $N\times1$ vector with the genotype for the candidate SNP
- $\beta_\text{test}$ the association of candidate SNP and the phenotype
- $\mathbf{X}_c$ an $N\times p$ matrix with the covariate pattern for $C$ known covariates (vector of ones (intercept), age, gender, batch,...)
- $\boldsymbol{\beta}_c$ the $p\times 1$ vector of parameters modeling the association of the p covariates and the phenotype.
- $\mathbf{Z}$ an $N\times M$ genetic relationship matrix (GRM) with all normalised genotypes
- $\mathbf{u}$ an $M\times 1$ vector with i.i.d. random effect for each SNP $\mathbf{u}\sim \text{MVN}(0,\mathbf{I}\sigma_u^2)$
- $\boldsymbol{\epsilon}$ an $N\times 1$ vector with environmental residuals that are assumed to be independent of $\mathbf{u}$ and i.i.d. $\boldsymbol{\epsilon}\sim \text{MVN}(0,\mathbf{I}\sigma_\epsilon^2)$
Random effects are used to model the correlation structure in the data. They imply a certain covariance structure of $\mathbf{y}$
## Covariance structure
Covariance structure of $\mathbf{y}$ implied by GWAS mixed model:
$$
\begin{array}{ccl}
\text{var}\left[\mathbf{Y}\right] &=& \text{var}\left[\mathbf{x}_\text{test}\beta_\text{test} + \mathbf{X}_c\boldsymbol{\beta}_c + \mathbf{Z}_\text{GRM}\mathbf{u} +\boldsymbol{\epsilon}\right]\\\\
&\updownarrow& \mathbf{u} \perp \boldsymbol{\epsilon}\\\\
&=& \text{var}[\mathbf{Z}_\text{GRM}\mathbf{u}] + \text{var}[\boldsymbol{\epsilon}]\\\\
&=&\mathbf{Z}_\text{GRM}\text{var}[\mathbf{u}]\mathbf{Z}_\text{GRM}^T + \mathbf{I} \sigma^2\\\\
&=&\mathbf{Z}_\text{GRM}\mathbf{I}\sigma^2_u\mathbf{Z}_\text{GRM}^T + \mathbf{I} \sigma^2_\epsilon \\\\
&=&\mathbf{Z}_\text{GRM}\mathbf{Z}_\text{GRM}^T \sigma^2_u+ \mathbf{I} \sigma^2_\epsilon
\end{array}
$$
Note that the model is often also written in another way:
$$
\tag{1}
\mathbf{Y} = \mathbf{x}_\text{test}\beta_\text{test} + \mathbf{X}_c\boldsymbol{\beta}_c + \mathbf{g} +\boldsymbol{\epsilon}
$$
- with $\mathbf{g} \sim \text{MVN}(\mathbf{0},\mathbf{K}\sigma^2_g)$
- $\mathbf{K}$ the $N \times N$ empirical kinship matrix
$$
\mathbf{K} = \frac{\mathbf{Z}_\text{GRM}\mathbf{Z}^T_\text{GRM}}{M}
$$
- $\sigma_g^2$ the polygenic variance $\sigma_g^2=M\sigma_u^2$
## Main advantages of LMM method
1. Better control of false positive associations by correcting for population or relatedness structure
2. An increase in power:
- Through the correction for this structure.
- by conditioning on associated loci other than the candidate locus.
## Pitfalls of LMM
1. Computational complexity:
- $M > 500.000$, $N > 70000$
- Building the GRM ($M \times M$ matrix)
- Estimating the mean and variance components for each of the $M$ candidate SNP!
- Association statistics for each variant (for each SNP!)
2. Loss in power when the candidate marker is included in the GRM
3. Using a small subset of markers in the GRM can compromise correction for stratification