-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathUnderstanding_heritability_0.9.1.Rmd
1388 lines (1044 loc) · 64.8 KB
/
Understanding_heritability_0.9.1.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Risk in relatives, heritability, SNP-based heritability and genetic correlations in psychiatric disorders: a review"
authors: "Bart Baselmans, Loic Yengo, Wouter van Rheenen & Naomi Wray"
# output: pdf_document
# date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
#date: "`r format(Sys.time(), '%d %B, %Y')`"
toc: true
toc_depth: 2
toc_float: true
number_sections: true
smooth_scroll : true
code_folding: "hide"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This supplementary note provides more details on the content discussed in the paper "*Risk in relatives, heritability, SNP-based heritability and genetic correlations in psychiatric disorders: a review*", by Baselsman, Yengo, van Rheenen & Wray published in *Biological Psychiatry*. If you use this code, please cite that paper. In the .html version of this document click on "Code"" to expand to see the R code. “There is an accompanying ShinyApp, [CHARRGE (Calculating Heritabilities and Relative Risks and GEnetic correlations)](https://shiny.cnsgenomics.com/CHARRGe/). The Rmarkdown version of this file as well as the code for the ShinyApp can be found at [github](https://github.com/BartBaselmans/CHARRGe)
links: \
**ShinyApp**: https://shiny.cnsgenomics.com/CHARRGe/ \
**Github**: https://github.com/BartBaselmans/CHARRGe)
```{r echo=FALSE, message=FALSE, warning=FALSE}
# You need these libraries for html
library(rmarkdown) # install.packages("rmarkdown")
library(data.table)
library(knitr)
#To knit to PDF you need to install: MaxTeX (http://www.tug.org/mactex/) or TeX (https://www.tug.org/texlive/)
```
# Heritability
## Parameters for psychiatric disorders
In Figure 1 of the paper we show estimates of the population lifetime risk ($K$) and the risk in first degree relatives of those with one affected parent ($K_{1}$), and the risk ratio ($RR_{1} = \lambda_{1}=K_{1}/K$) for common psychiatric disorders. Although $K$ and the risk in relatives of different types ($K_R$) are measurable, often only the heritability estimated from one or more types of relatives are reported. Heritability is defined as the proportion of variance in liability attributed to additive genetic factors. Here, we present the $K$ and heritabilities for the major psychiatric disorders (see Supplementary Table 1 for references). The estimates using twin data are usually higher than those estimated from family data (different types of relatives) although the latter are often based on larger sample sizes. Here, we use "round" $h^2$ numbers (approximate average of $h^2_{twin}$ and $h^2_{family}$) in subsequent calculations. We also provide SNP-based heritability ($h^2_{SNP}$) estimates (see section 3), where SNP-based heritability, by definition is smaller than the estimates based on family records, as it tracks only the proportion of variance attributable to additive genetic values tagged by common SNPs (or other measurable common DNA variants). In contrast, total heritability tracks the contribution to variance of genetic variants across the allelic frequency spectrum.
Phenotype | $h^2_{twin}$ (%) | $h^2_{family}$ (%) |$h^2_{round}$ (%) | $h^2_{SNP(sBayesS)}$ (%) |$h^2_{SNP(ldsc)}$ (%) |$K$ |
|:----------|:----------|:----------|:----------|:----------|:----------|:----------|
**Schizophrenia** | 81 | 64 |70 |30|26 |0.01 |
**Bipolar Disorder** | 75 | 59 | 65 |22 |20 | 0.01 |
**Major Depressive Disorder** | 37 | 32 |35| 9 |10 |0.15
**ADHD** | 75 | - |75| 24| 21|0.05 |
**Anorexia Nervosa** |56 | 43|50| 15 |14 |0.01
**Autism Spectrum Syndrome** | 80 | 85 |80| 11 |12 |0.01
$h^2_{round}$ is an approximate average of $h^2_{family}$ and $h^2_{twin}$.
See Supplementary Table 1 for the references from which these estimates were derived.
## Liability Threshold model
When many factors contribute to the risk of disease it can be helpful to consider a model of disease where there is a latent distribution of liability to disease representing the dichotomous Case/Control status.(e.g. [Falconer, 1965](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1965.tb00500.x) or [Reich 1972](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1972.tb00767.x)). Since this latent liability comprises many genetic and other risks it is reasonable to assume that the liability distribution is approximately normally distributed (since many things added together will make a bell-shaped distribution in a population sample) and that those affected by disease have the combination of genetic and other factors that it places them in the top end of the liability distribution.
```{r, liability_model, fig.height=4, fig.width=10, fig.align="center", echo=T}
h2x = 0.7
h2y = 0.35
Kx = 0.01
Ky = 0.15
disorder1 = "Schizophrenia"
disorder2 = "Major Depressive Disorder"
plot_liability_model <- function(h2x,h2y,Kx,Ky, disorder1, disorder2){
Tx = -qnorm(Kx, 0,1)
Ty = -qnorm(Ky, 0,1)
Ks <- c(Kx,Ky) #Lifetime risk
Heritability <- c(h2x,h2y)
Threshold <-c(Tx,Ty)
disorder <-c(disorder1,disorder2)
layout(matrix(c(1,2),1, 2, byrow = TRUE))
for(j in 1:2){
h2 <- c(as.expression(bquote(italic(h)^2~'='~.(Heritability[j]))),
as.expression(bquote(italic(h)^2~'='~.(Heritability[j]))))
K_pop <- c(as.expression(bquote(italic(K)~'='~.(Ks[j]))),
as.expression(bquote(italic(K)~'='~.(Ks[j]))))
T0 = Threshold[j]
z = dnorm(T0)
i = z / Ks[j] # mean phenotypic liability of those with disease
mean = -(T0) ;sd=1
lb=0; ub=4
x <- seq(T0-6,T0+3,length=1000)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="", ylab="",
main="", axes=FALSE)
i <- x >= lb & x <= ub
l <- x <= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="darkblue")
polygon(c(-5,x[l],lb), c(0,hx[l],0), col="lightgrey")
#axis(1, at=seq(-5, 2, 1), pos=0, cex.axis =1)
abline(h=0)
abline(v=0,h=-2)
mtext(K_pop[j], side = 3, line = -2.1,at = 2,font = 2, cex = 0.8, adj=1)
mtext(disorder[j], side = 1, line = 1.5,at = 0,font = 2, cex = 0.8, adj=1)
}
}
plot_liability_model(h2x,h2y,Kx,Ky, disorder1, disorder2)
```
Hence, those whose liability to disease is above the threshold are affected, and so this model is sometimes called the liability threshold model. The threshold can be calculated using the lifetime risk ($K$) parameter as $K$ represents the proportion of individuals in the general population have the disorder of interest. The threshold (or boundary) that determines the area of $K$ (indicated in blue) in a normal distribution is given by:
$$
T=\Phi^{-1}(1-K)
$$
Conversely, the life-time risk ($K$) is derived from the threshold parameter
$$
K = 1-\Phi(T)
$$
where $\Phi^{-1}(x)$ is the inverse of the cummulative standard normal distribution function.
An example with R code using lifetime risk $K = 0.15$:
```{r}
K = 0.15
Threshold_prevalence <- function(K){
Tx = -qnorm(K,0,1)
Kx = 1-pnorm(Tx,0,1)
cat("Output Parameters:\n")
cat("----------------------------\n")
cat("Tx: ", round(Tx,2),"\n")
cat("Kx: ", round(Kx,2),"\n")
cat("----------------------------\n")
}
Threshold_prevalence(K)
```
## Risk in relatives and heritability of liability
The lifetime risk of a disease in relatives ($K_{R}$) of those affected by the disease is expected to be greater, or equal to, the lifetime risk of the disease in a population sample. It is logical to assume that the threshold in liability associated with disease has the same value in the relatives of those affected. As a result, the liability distribution in first degree members must be shifted (in the direction of increased liability) compared to the general population to an extent consistent with the observed higher risk of $K_{R}$.\
We can then ask: what proportion of the variation in liability must be attributable to genetic factors (i.e., what is the heritability?) in order to generate this observed increased risk in relatives, given the known coefficient of relationship between the relatives, $a_R$, i.e., 0.5 for 1st degree relatives. First, we introduce some properties of the normal distribution.
```{r, liability_modelx, fig.height=4, fig.width=5, fig.align="center", echo=T}
h2x = 0.7
Kx = 0.15
plot_liability_model <- function(h2x,Kx){
Tx = -qnorm(Kx, 0,1)
Ks <- c(Kx) #Lifetime risk
Heritability <- c(h2x)
Threshold <-c(Tx)
disorder <-c(disorder1)
#layout(matrix(c(1,2),1, 2, byrow = TRUE))
h2 <- c(as.expression(bquote(italic(h)^2~'='~.(Heritability))))
K_pop <- c(as.expression(bquote(italic(K)~'='~.(Ks))))
T0 = Threshold
z = dnorm(T0)
i = z / Ks # mean phenotypic liability of those with disease
mean = -(T0) ;sd=1
lb=0; ub=4
x <- seq(T0-6,T0+3,length=1000)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="", ylab="",
main="", axes=FALSE)
i <- x >= lb & x <= ub
l <- x <= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="darkblue")
polygon(c(-5,x[l],lb), c(0,hx[l],0), col="lightgrey")
mtext(bquote(italic("K")), side = 1, line = -2.5,at = 0.6,font = 2, cex = 1.2, adj=1, col = "white")
mtext(bquote(italic("T")), side = 1, line = -1.6,at = -0.12,font = 2, cex = 1.2, adj=1, col = "black")
mtext(bquote(italic("z")), side = 1, line = -7.3,at = 0.3,font = 2, cex = 1.2, adj=1)
mtext(bquote(italic("i")), side = 1, line = 0.45,at = 1,font = 2, cex = 1, adj=1)
xtick <- seq(1,1, by =1)
axis(side=1, at = xtick, labels =F,pos = 0)
}
plot_liability_model(h2x,Kx)
```
We define $z$ as the height of the standard normal curve at threshold $T$, that corresponds to the proportion $K$, and $i\sigma_p$ is the mean phenotypic liability of those with disease. Since the phenotypic variance of liability ($\sigma^2_{p}$ ) is by definition 1, the mean phenotypic liability is simply $i$.From mathematical propertis of the normal distribution $i$ = $z/K$.
The mean liability of relatives of those ascertained to have disease, accounts for the coefficient of relationship between the relatives and the proportion of variance that is genetic: $a_Rih^2$. Hence, for a trait with no genetic contribution ($h^2$=0), the mean liability in relatives is the same as in the general population. Overall, the shift in mean liability between population and 1st degree relatives is higher for diseases with higher heritability and for diseases that are less common.
The difference in mean liabilities between population ($M_{pop}$ = 0) and relatives ($M_{R}$), can be shown to be equivalent to the difference in thresholds when the liability distribution of relatives is scaled back to a N(0,1) distribution. i.e.,$M_{R}$ - $M_{pop}$ = $a_Rih^2$ = $T_R - T$, where $T_R=\Phi^{-1}(1-K_R)$ [Falconer (1965)](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1965.tb00500.x). Hence,
$h^2 = \frac{T - T_R}{a_Ri}$, i.e. everything on the right hand side of the equation is derived from measurable statistics of $K$, $K_R$ and $a_R$.
However, Falconer's derivation assumed that the genetic (and hence liability) variance amongst the relatives was not changed by ascertainment on disease in the probands. [Reich et al.,(1972)](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1972.tb00767.x) showed that while it is OK to assume that the distribution of disease liability in relatives of those affected is approximately normal, it should be recognised that the variance in liability is slightly reduced in the relatives as a result of ascertaining the proband as been affected. They showed that the variance in liability is reduced by a factor of $a_{R}^2h^4i(i-T)$
so that the scaling of the $T_R$ to a standard N(0,1) distribution has a denominator of the standard deviation of liability in the relatives given that the probands have disease:
$$
T_{R} = \frac{T -a_Rih^2}{\sqrt{1-a_{R}^2h^4i(i-T)}}
$$
Since heritability is in the numerator and denominator, making $h^2$ the subject of the equation requires solving of a quadratic equation to give:
$$
h^2 = \frac{T - T_R \sqrt{1 -(1-\frac{T}{i})(T^2-T^2_R)}}{a_R(i+(i-T)T^2_R)}
$$
Although, this equation looks complicated, in fact everything on the right hand side again can be calculated from two observations $K$ and $K_R$, and $a_R$ (and holds as long as $a_R$ < 1), and if ascertainment is based on the disease status of one proband. A special case is when ascertainment is based on both parents being affected, hence the $K$ and $K_{2PAR}$ is estimated, with risk ratio $RR_{2PAR} = \frac{K_{2PAR}}{K}$
In this case it can be shown in [Wray & Gottesman (2012)](https://www.frontiersin.org/articles/10.3389/fgene.2012.00118/full) that
$$
T_{2PAR} = \frac{T-ih^2}{\sqrt{1-0.5h^4i(i-T)}}
$$
and solving this quadratic equation gives:
$$
h^2 = 2T-\sqrt{2} T_{2PAR} \frac{\sqrt{2 -(T^2 - T_{2PAR})(1-\frac{T}{i})}}{2i+(i-T)T_{2PAR}}
$$
\
\
To visualize the increase in liability for relatives having one or two affected parents consider examples of schizophrenia with $K=0.01$, $h^2=0.7$ or major depressive disorder with $K=0.15$, $h^2=0.35$ a sample of 100 individuals drawn from the general population, and a sample of 100 children who have one affected parent, or a sample of 100 children who have two affected parent(s).
```{r, liability_model2, fig.height=7, fig.width=8, fig.align="center", echo=T}
h2x = c(0.7,0.35)
Kx = c(0.01,0.15)
a = 0.5
disorder = c("Schizophrenia","Major Depressive Disorder")
#################################################################################################
liability <- c()
liability_function <- function(h2x, Kx, a, disorder){
Tx = -qnorm(Kx, 0,1)
z = dnorm(Tx)
i = z/Kx
#One affected parent
Tx1 = (Tx - a * i * h2x) / (sqrt(1 - a * a * h2x * h2x * i * (i-Tx)))
Kx1 = 1 - pnorm(Tx1)
RRx1 = Kx1 / Kx
#Two affected parents
Tx2 = (Tx -i*h2x) / sqrt(1-(0.5*h2x*h2x*i)*(i-Tx))
Kx2 = 1 - pnorm(Tx2)
RRx2 = Kx2 / Kx
x <- as.data.frame(c(disorder,Tx,Tx1, Tx2, Kx, Kx1, Kx2, NA,RRx1, RRx2 ))
return(x)
}
#Run function for number of included disorders
for(i in 1:length(disorder)){
liability[i] <- liability_function(h2x[i], Kx[i], a, disorder[i])
}
#convert list to dataframe
df <- data.frame(matrix(unlist(liability), nrow=length(liability), byrow=T), stringsAsFactors = F)
#Threshold extraction
Threshold <- df[2:4]
Threshold_row <-cbind(Threshold[1,], Threshold[2,])
Threshold_row <- as.numeric(Threshold_row)
# Prevalence extraction
K_prevalence <- df[5:7]
K_prevalence_row <- as.numeric(cbind(K_prevalence[1,], K_prevalence[2,]))
K100_prevalence_row <- round((K_prevalence_row*100),0)
#Make plot
layout(matrix(c(1,2,3,7,8,9,4,5,6,10,11,12), 3, 4, byrow = F))
par(mar=c(5.1, 4.1, 4.1, 2.2))
for(j in 1:6){
#Normal distribution Threshold
Ks <- as.matrix(sapply((K100_prevalence_row/100), as.numeric))
K_print <- c(as.expression(bquote(italic(K)[SCZ]~'='~.(Ks[j]))),
as.expression(bquote(italic(K)[1][PAR][-SCZ]~'='~.(Ks[j]))),
as.expression(bquote(italic(K)[2][PAR][-SCZ]~'='~.(Ks[j]))),
as.expression(bquote(italic(K)[MDD]~'='~.(Ks[j]))),
as.expression(bquote(italic(K)[1][PAR][-MDD]~'='~.(Ks[j]))),
as.expression(bquote(italic(K)[2][PAR][-MDD]~'='~.(Ks[j]))))
heritability1 <- c(paste0("= ",h2x[1]))
Phenotype1<- c("","",as.expression((bquote(bold(bold(.(disorder[1])~'('*bold(italic(h)^2)~.(heritability1)*')')))),"",""))
heritability2 <- c(paste0("= ",h2x[2]))
Phenotype2 <- c("","","","","",as.expression((bquote(bold(bold(.(disorder[2])~'('*bold(italic(h)^2)~.(heritability2)*')')))),"",""))
Txp = Threshold_row[j]
z = dnorm(Txp)
i = z / Ks[j]
mean = -(Txp) ;sd=1
lb=0; ub=4
x <- seq(Txp-6,Txp+3,length=1000)*sd + mean
hx <- dnorm(x,mean,sd)
plot(x, hx, type="n", xlab="", ylab="",
main="", axes=FALSE)
i <- x >= lb & x <= ub
l <- x <= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="darkblue")
polygon(c(-5,x[l],lb), c(0,hx[l],0), col="grey")
axis(1, at=seq(-5, 2, 1), pos=0, cex.axis =1)
abline(h=0)
abline(v=0,h=-2)
mtext(K_print[j], side = 3, line = -1.1,at = 10,font = 2, cex = 0.8, adj=1)
mtext(Phenotype1[j], side = 3, line = -12,at = -2.5,font = 2, cex = 0.7, adj = 0)
mtext(Phenotype2[j], side = 3, line = -12,at = -2.1,font = 2, cex = 0.7, adj = 0)
}
############################
##########################
#Square indicating number of affected individuals
result <- matrix(nrow = 6, ncol = 100)
#layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow = F))
for(j in 1:6){
result[j,] <- sample(as.matrix(c(rep("b",K100_prevalence_row[j]),rep("a",100-K100_prevalence_row[j]))))
#sum(result[1,]=="a")
# input parameters - nr * nc should equal length(x)
cols <- c("grey", "darkblue")
nr <- 10
nc <- 10
# create data.frame of positions and colors
m <- matrix(cols[factor(result[j,])], nr, nc)
DF <- data.frame(row = c(row(m)), col = c(col(m)[, nc:1]), value = c(m), gender = sample(c(rep("male",50),rep("female",50))),
stringsAsFactors = FALSE)
title_plot <- c("Sample drawn from \n the population", "Sample drawn from \n those with one affected parent", "Sample drawn from \n those with two affected parents","Sample drawn from \n the population", "Sample drawn from \n those with one affected parent", "Sample drawn from \n those with two affected parents")
plot(col ~ row, DF, col = DF$value, pch = 15, cex = 1, asp = 1,
xlim = c(0, nr), ylim = c(0, nc),
axes = FALSE, xlab = "", ylab = "")
title(xlab=title_plot[j], line=0.2, cex.lab=1)
}
```
For the major psychiatric disorders the increased risk ratio are:
```{r, results= 'asis', echo = T}
h2x = c(0.70,0.65,0.35,0.75,0.5,0.8)
Kx = c(0.01,0.01,0.15,0.05,0.01,0.01)
a = 0.5
disorder = c("Schizophrenia","Bipolar Disorder", "Major Depressive Disorder", "ADHD", "Anorexia Nervosa", "Autism Spectrum Disorder")
liability <- c()
liability_function <- function(h2x, Kx, a, disorder){
Tx = -qnorm(Kx, 0,1)
z = dnorm(Tx)
i = z/Kx
#One affected parent
Tx1 = (Tx - a * i * h2x) / (sqrt(1 - a * a * h2x * h2x * i * (i-Tx)))
Kx1 = 1 - pnorm(Tx1)
RRx1 = round(Kx1 / Kx,1)
#Two affected parents
Tx2 = (Tx -i*h2x) / sqrt(1-(0.5*h2x*h2x*i)*(i-Tx))
Kx2 = 1 - pnorm(Tx2)
RRx2 = round(Kx2 / Kx,0)
x <- as.data.frame(c(disorder,(h2x*100), round(Kx,2), round(Kx1,2), round(Kx2,2),round(RRx1,2), round(RRx2,2)))
return(x)
}
#Run function for number of included disorders
for(i in 1:length(disorder)){
liability[i] <- liability_function(h2x[i], Kx[i], a, disorder[i])
}
#convert list to dataframe
Threshold_model_df <- data.frame(matrix(unlist(liability), nrow=length(liability), byrow=T), stringsAsFactors = F)
colnames(Threshold_model_df) <- c("Disorder","heritability","K", "K1", "K(2par)", "RR", "RR(2par)")
library(knitr)
kable(Threshold_model_df, caption ="Risk ratio in relatives")
```
*$h^2$ estimates taken from family studies except ADHD ($h^2_{twin}$)*
The R code to calculate $T$, $K$, and $RR$ is provided below
```{r T-K-RR}
h2x=0.7
Kx=0.01
a=0.5
disorder="Schizophrenia"
Liability_threshold <- function(h2x, Kx, a, disorder){
cat("Input Parameters:\n")
cat("----------------------------\n")
cat("Disorder:", disorder, "\n")
cat("h2x : \t",h2x,"\t heritability \n")
cat("Kx : \t",Kx,"\t lifetime risk of disease \n")
cat("aR : \t",a, "\t coefficient of relationship \n")
cat("----------------------------\n")
Tx = -qnorm(Kx, 0,1)
z = dnorm(Tx)
i = z/Kx
Tx1 = (Tx - a * i * h2x) / (sqrt(1 - a * a * h2x * h2x * i * (i-Tx)))
Kx1 = 1 - pnorm(Tx1)
RR = Kx1 / Kx
Tx2 = (Tx -i*h2x) / sqrt(1-(0.5*h2x*h2x*i)*(i-Tx))
Kx2 = 1 - pnorm(Tx2)
RR2 = Kx2 / Kx
# return(matrix(data = c(Tx,Kx,NA, Tx1, Kx1,RR, Tx2, Kx2,RR2), nrow = 3, ncol = 3 ))
cat("Output Parameters:\n")
cat("----------------------------\n")
cat("K(1) : ",round(Kx1,2),"\t lifetime risk in 1st degree relatives of affected person \n")
cat("K(2par) : ",round(Kx2,2), "\t lifetime risk in children with 2 affected parents \n")
cat("RR : ", round(RR,1), "\t relative risk in 1st degree relatives of affected person \n")
cat("RR(2par): ", round(RR2,0), "\t relative risk in children with 2 affected parents\n")
cat("T : ",round(Tx,2),"\t normal distribution threshold corresponding to proportion K \n")
cat("T(1) : ",round(Tx1,2),"\t normal distribution threshold corresponding to proportion K(1) \n")
cat("T(2par) : ",round(Tx2,2), "\t normal distribution threshold corresponding to proportion K(2par) \n")
cat("----------------------------\n")
}
Liability_threshold(h2x,Kx,a, disorder)
```
## Different views of the liability threshold distribution
The liability model assumes that genetic and non-genetic factors contribute additively to liability to disease. This implies a non-linear relationship between phenotypic and genetic liability with probability of disease.
If we assume that individuals either have disease (probability of disease =1) or do not have disease (probability of disease =0), then when considering the relationship between phenotypic liability and disease, a vertical line will correspond to the threshold $T$ corresponding to the lifetime risk of the disease, $K$, that bisects risk of disease (as visualized above). For schizophrenia (red), and major depressive disorder (black) with $K = 0.01$ and $K=0.15$ the relationship between phenotypic liability and probability of developing the disorder looks like:
```{r, phenotypic_liability, echo=T, fig.width=4, fig.height=4, fig.align="center"}
r2x <- 1
Kxp <- 0.01
r2y <- 1
Kyp <- 0.15
a <- 0.5
disorder1 <- "Schizophrenia"
disorder2 <- "Major Depressive Disorder"
xrange <- seq(-4,+6,len=100)
ProbDisease<- function(K,r2,xrange){
sapply(xrange,function(x) pnorm( (x-qnorm(1-K))/sqrt(1-r2) ))
}
par(mar=c(5.1, 4.1, 4.1, 2.1))
l3<- c(as.expression((bquote(SCZ~'('*italic(K)~'= 0.01'*')'))))
l4<- c(as.expression((bquote(MDD~'('*italic(K)~'= 0.15'*')'))))
xrange <- seq(-4,+6,len=100)
SCZ<- ProbDisease(K=Kx,r2=r2x,xrange)
MDD <- ProbDisease(K=Ky,r2=r2y,xrange)
matplot(xrange,cbind(SCZ,MDD), mgp=c(3,1,0),
frame.plot = FALSE, type="l",lty=1:2,col=1:2,lwd=3,
xlab = "Liability (phenotypic sd units)",cex.axis=0.6, cex.lab =0.5,
ylab = "Probability developing disorder")
legend("topleft", legend = c(l3,l4) ,
col =1:2, lty = 1:2, lwd=3, cex = 0.5,bty = "n")
```
Now, if we visualize the relationship between genetic liability and risk/probability of disease you will observe a very non-linear relationship with the gradient of relationship becomes steeper when heritability becomes higher. The position along the x-axis when the rise in probability starts is higher for diseases that are less common. Slatkin (2008) in his paper [*Exchangeable models of complex inherited disease*](https://www.genetics.org/content/179/4/2253.article-info) nicely demonstrates that any models of polygenic disease have to lead to this non-linear relationship between genetic liability and probability of disease, and that the liability threshold model is the most convenient and mathematically tractable because it use the nice proprties of the normal distribution theory and is based on only two parameters, $h^2$ and $K$.
```{r, probability, echo=T, fig.width=4, fig.height=4, fig.align="center"}
r2x <- 0.7
Kxp <- 0.01
r2y <- 0.34
Kyp <- 0.15
a <- 0.5
disorder1 <- "Schizophrenia"
disorder2 <- "Major Depressive Disorder"
xrange <- seq(-4,+6,len=100)
ProbDisease<- function(K,r2,xrange){
sapply(xrange,function(x) pnorm( (x-qnorm(1-K))/sqrt(1-r2) ))
}
par(mar=c(5.1, 4.1, 4.1, 2.1))
l3<- c(as.expression((bquote(SCZ~'('*italic(K)~'= 0.01, h2 = 0.70'*')'))))
l4<- c(as.expression((bquote(MDD~'('*italic(K)~'= 0.15, h2 = 0.34'*')'))))
xrange <- seq(-4,+6,len=100)
SCZ<- ProbDisease(K=Kx,r2=r2x,xrange)
MDD <- ProbDisease(K=Ky,r2=r2y,xrange)
matplot(xrange,cbind(SCZ,MDD), mgp=c(3,1,0),
frame.plot = FALSE, type="l",lty=1:2,col=1:2,lwd=3,
xlab = "Liability (genetic sd units)",cex.axis=0.6, cex.lab =0.5,
ylab = "Probability developing disorder")
legend("topleft", legend = c(l3,l4) ,
col =1:2, lty = 1:2, lwd=3, cex = 0.5,bty = "n")
```
# SNP-based heritability
Heritability is the ratio of additive genetic variance $V_A$ divided by the sum of $V_A$ and the residual variance $V_E$ (and where phenotypic variance, by definition is $V_P$ = $V_A$ +$V_E$. NB. Before we used $\sigma^2$ for variance, because we were thinking about parameters, we use V for variance because we are now considering estimates from data):
$$
h^2 = \frac{V_A}{V_A+V_E}
$$
Estimates of heritability are derived from knowledge of sharing between relatives, therefore, the genetic factors unique to an individual but which impact of disease status for that individual (i.e., *de novo* mutations) would be partitioned into the residual variance. Estimates of $h^2$ made from family record represent contributions from both rare and common genetic variants. In contrast $h^2_{SNP}$ represents only the proportion of variance associated with measured common DNA variants (typically of allele frequency >1%):
$$
{h^2}_{SNP} = \frac{V_{A_{SNP}}}{V_A+V_E}
$$
Hence $h^2_{SNP}$ is, by definition, lower than heritability gained from family and twin designs. SNP-based heritability is estimated from genome-wide association study (GWAS) data, either directly from individual level genotype data or from GWAS summary statistics
## GREML
To estimate the proportion of phenotypic variance captured by genotyped SNPs from individual-level data, Genome-based Restricted Maximum-likelihood analysis (GREML) is frequently used applied to a linear mixed model:
$y$ = covariates + $g$ + $e$, with V($y$)= **A**$g\sigma^2_g$ + **I**$\sigma^2_e$
Where $y$ is the phenotype, covariates could include age, sex and ancestry principal components (PCs) *etc*., $g$ is the aggregate effect of all genome-wide SNPs for an individual, and $e$ is the residual effects for an individual, which includes genetic factors not captured by common SNPs. $V~(y)$ is the variance of $y$, $\sigma^2_g$ is the variance of the g values, with the matrix **A**$g$ being the genetic relationship matrix (GRM) (which can be estimated from SNP data) since $g$ values are correlated between people, and **I** is the identity matrix, since residual effects are assumed to be uncorrelated between individuals. Assuming that indivdiduals are unrelated in the classical sense (< ~3rd degree relatives) then $h^2_{SNP}$ is then estimated as $h^2_{SNP} = \frac{\sigma^2_g}{\sigma^2_g + \sigma^2_e}$.
## GRM matrix
A key component of the GREML approach is the GRM (**A**g) matrix:
Humans have diploid genomes, so at each (biallelic) SNP individuals can be homozygous (AA) for the reference allele, heterozygous (AC) or homozygous (CC) for the non-reference (or alternative) allele, and genotypes for each SNP for an individual can be coded as 0,1,2 with respect to the count of alternative alleles. Under a random mating assumption, the number of alternative alleles for a SNP and individual follows a binomial distribution with 2 draws (since we are diploid) and probability equal to the minor allele frequency of that SNP.
If we assume that the minor allele frequency (MAF) of a given number of SNPs ($M$) come from a uniform distribution between 0 and 0.5. We then can generate genotypes for one person using:
```{r, echo =T}
set.seed(666)
m <- 1000 # number of SNPs
maf <- runif(m, 0, .5) # random MAF for each SNP
x012 <- rbinom(m, 2, maf)
n <- 500 # number of individuals
X012 <- t(replicate(n, rbinom(m, 2, maf))) # n x m genotype matrix (500 individuals X 1000 SNPs)
```
Monomorphic SNPs (SNPs of which frequencies do not vary between individuals in the sample) can cause problems in GRM analyses and should be removed. Therefore, to end up with only polymorphics SNPS, more SNPs are generated below and subsequently monomorphic SNPs will be removed.
```{r}
X012 = t(replicate(n, rbinom(2*m, 2, c(maf, maf))))
polymorphic = apply(X012, 2, var) > 0
X012 = X012[,polymorphic][,1:m]
maf = c(maf, maf)[polymorphic][1:m]
```
```{r}
X012[1:5, 1:10]
```
A GRM is constructed from scaled genotype matrix in which the genotypes of each SNP in our sample have mean 0 and variance 1, therefore the genotype matrix created above has to be scaled
```{r}
X = scale(X012, scale=TRUE)
```
With the scaled genotype matrix, a GRM can be calculated as follows, where $M$ is the number of SNPs:
$$
A = \frac{XX^t}{M}
$$
```{r}
grm = (X %*% t(X))/m
round(grm[1:5,1:10],3)
```
Here, the offdiagonal elements represent the relative relatedness between two indivuals. Positive values indicate closer genetic relationship than on average, and negative values indicate a less close genetic relationship than average. The diagonal elements represents the relatedness of an individual with itself, which is 1+ the average homozygosity or the level of inbreeding. If we visualize this:
```{r}
layout(matrix(c(1,2),1,2))
hist(grm[upper.tri(grm)], ylab = "count", xlab = "value", col ="darkblue", main = "relatedness")
hist(diag(grm), ylab = "count", xlab = "value", col = "darkblue", main = "average homozygosity")
```
We see that the relatedness (GRM off-diagonals) are centered around 0 (left panel) and the relatedness with itself (GRM diagonals) is centered around 1.
## LD Score regression
Linkage disequilibrium (LD) score regression (LDSC) was the first of now many methods to estimate SNP-based heritability from GWAS summary statistics. LDSC assumes a polygenic model of genetic effects on a trait. Under this assumption, and recognising that there is local correlation between DNA variants, along chromosomes (a reflection of past population growth, selection, drift), a SNP that is correlated to many other DNA variants (has a high linkage LD score, and) is expected to have a higher association test statistic, on average, compared to SNPs with low LD scores.
Theory shows that the expected ($E[~]$) relationship between the chi-square test statistics $(\chi^2)$) of SNPs and the LD score of SNPs is a function of the SNP-based heritability ($h^2_{SNP}$), sample size ($N$) and the number of SNPs ($M$)
$$
E[\chi^2] = \frac{Nh^2_{SNP}}{M}l_j + intercept
$$
Using GWAS summary statistics, $h^2_{SNP}$ can be estimated from the regression coefficient (e.g. $\frac{Nh^2_{SNP}}{M}$) when the observed GWAS $\chi^2$ test statistics are regressed on the LD scores ($l_j$).
## Linkage disequilibrium (LD) Matrix
SNPs are often correlated with one another. Especially when they are physically close, since it is unlikely that recombination breaks up any correlation between them. This correlation between a pair of SNPs (two columns in our genotype matrix created earlier) is called linkage disequilibrium (LD) and can be estimated by calculating the correlation coefficient between the SNP genotypes.
The LD matrix contains the correlations of all SNP pairs in the genotype matrix and has therefore dimensions $M$×$M$.
$$
cor(x,y) = \frac{cov(x,y)}{\sqrt{var(x)var(y)}}
$$
Since correlation implicitly scales by the variance it actually does not matter whether the scaled (X) or unscaled (X_012) genotype matrix is used. Further, when $X$ is already scaled, the covariance matrix is the same as the correlation matrix, the LD matrix can be calculated as:
$$
LD = \frac{X^tX}{N}
$$
Therefore, if we calculate the LD matrices from our scaled or unscaled genotype matrices, we end up with the same results:
```{r}
ld1 = cor(X012)
ld2 = cor(X)
mean(ld1)
mean(ld2)
```
These values are close to zero, because we simulated the SNPs to be independent
## LD Scores
LD scores. $l_j$ are defined as the sum of squared correlations of a SNP $j$ with al other SNPs within a predefined region (usually, 1-Cm):
$$
l_j = \sum_{k= 1}^{M} cor^2(X_j,X_k)
$$
As $X$ is standardized, we can calculate the sum of the squared sample correlations:
$$
\tilde{l_j} =\frac{1}{N^2}X^t_jXX^tX_j
$$
To correct for bias (just as in the LDSC formula) we can:
$$
l_j = \frac{\tilde{l_j}N -M}{N + 1}
$$
```{r,fig.width=7, fig.height=4, fig.align="center"}
ldscores_sample = colSums(ld1^2)
ldscores = (ldscores_sample*n - m) / (n + 1)
layout(matrix(c(1,2),1,2))
hist(ldscores, col = "darkblue", xlab = "Corrected scores", main = "")
hist(ldscores_sample, col = "lightgrey", xlab = "Uncorrected scores", main = "")
```
# Scale Transformation
## Transformation of heritability estimates from observed to the liability scale
SNP-based heritability estimates can be calculated from GWAS summary statistics. The estimates are made on the case/control scale $(h^2_{occ})$. To be interpretable they are converted to the liability scale $(h^2_l)$, which depends on the lifetime risk of disease (K) and the proportion of the sample that are controls ($P$). The standard transformation equation is
$$
h^2_l = \frac{h^2_{occ}K^2(1-K)^2}{P(1-P)z^2}
$$
where $z$ is the height of the normal curve at the threshold derived from $K$ (see section 2.2). See [Lee et al., 2010](https://www.sciencedirect.com/science/article/pii/S0002929711000206), [Zhou et al., 2013](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1003264), [Golan et al., 2015](https://www-pnas-org.vu-nl.idm.oclc.org/content/111/49/E5272) for derivations.
```{r}
h2occtoh2l<-function(h2occ, K, ncase, ncont) {
P = ncase/(ncase+ncont)
T0<--qnorm(K,0,1) # threshold
z<-dnorm(T0) #height of normal curve at threshold
h2l=h2occ*K*(1-K)*K*(1-K)/(P*(1-P)*z*z) #heritability on liability scale
cat("SNP-based heritability transformation:\n")
cat("----------------------------\n")
cat("h2occ : ",round(h2occ[1],2),"\t SNP-based heritability as estimated from a linear model\n")
cat("h2occ_se : ",round(h2occ[2],2),"\t SNP-based heritability standard error\n")
cat("K : ",round(K,2), "\t assumed lifetime risk of disease \n")
cat("ncase : ", round(ncase,1), "\t number of cases in GWAS used to estimate SNP-based heritability \n")
cat("ncontrol : ", round(ncont,0), "\t number of controls in GWAS used to estimate SNP-based heritability\n")
cat("P : ",round(P,2),"\t proportion of the GWAS sample that are cases \n")
cat("h2liab : ",round(h2l[1],2),"\t SNP-based heritability on the liability scale \n")
cat("h2liab_se: ",round(h2l[2],2),"\t SNP-based heritability standard error\n")
cat("----------------------------\n")
}
h2occ=c(0.45,0.06)
ncase=34241
ncont=45064
K=0.01
h2occtoh2l(h2occ, K, ncase, ncont)
```
## Screened and unscreened controls
It is noteworthy that the standard scale transformation for SNP-based heritability [Lee et al., (2011)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059431/)) assumes that controls are screened. When $K$ is small (for example, for schizophrenia, bipolar disorder or autism) there is little impact on estimates if controls are unscreened. However, for more common disorders the impact can be substantial and updated transformations are needed to account for unscreened or for super-screened controls.
Consider the following Liability Threshold model for a disorder with life time risk $(K=0.1)$.
```{r,screened/unscreened, echo=T, fig.width=6, fig.height=4, fig.align="center"}
mx=4
z <- seq( from=-mx, to=+mx, by=.01)
dens <- dnorm(z)
KU=0.10
KL=0.5
plot( z, dens, type="l", lwd=1,ylab="",yaxt="n",xlab="",
frame.plot=F,main="")
t=qnorm(1-KU)
x.shade <- seq(t,mx,0.01)
n=length(x.shade)
polygon(c(rev(x.shade),x.shade),c(rep(0,n),dnorm(x.shade,0,1)),col="darkblue")
text(1.55,0.07,"KU", col = "white")
t=qnorm(1-KL)
x.shade <- seq(-mx,-t,0.01)
n=length(x.shade)
polygon(c(rev(x.shade),x.shade),c(rep(0,n),dnorm(x.shade,0,1)),col="lightgrey")
text(-0.75,0.07,"KL")
```
Here:
$K_U$ = proportion of the general population that are cases (Corresponding to $K$) \
$K_L$ = proportion of the general population selected at the lower-end as controls.
Using $K_U$ and $K_L$ in general, for example:
If $K_L$=1, then controls are **Unscreened**
If $K_L$=1-$K_U$ then controls are **Screeened**
If $K_L$ = 0.5, then controls are **Heavily screened**
If $K_L$=$K_U$, then controls are **Super-screened**
$$
h^2_l=h^2_{occ}P(1-P)(\frac{Z_U}{K_U} +\frac{Z_L}{K_L})^2
$$
See [Gianola (1979)](https://europepmc.org/article/med/17248982), [Golan et al.,2015](www.pnas.org/cgi/doi/10.1073/pnas.1419064111) and [Yap et al., 2018](https://www.nature.com/articles/s41467-018-04807-3) for derivations. For a given value estimate of SNP-based heritability calculated from a linear model on the observed scale, we can calculate the SNP-based heritability according to the different designs from which it could be estimated.
```{r}
h2occtoh2l_screen<-function(h2occ, K_U, ncase, ncont) {
P = ncase/(ncase+ncont)
z_U=dnorm( qnorm(K_U))
K_L=1
z_L=dnorm( qnorm(K_L))
trans= P*(1-P)*(z_U/K_U + z_L/K_L)^2
h2l_uns= h2occ / trans
K_L=1-K_U
z_L=dnorm( qnorm(K_L))
trans= P*(1-P)*(z_U/K_U + z_L/K_L)^2
h2l_scr= h2occ / trans
K_L=0.5
z_L=dnorm( qnorm(K_L))
trans= P*(1-P)*(z_U/K_U + z_L/K_L)^2
h2l_heavy= h2occ / trans
K_L=K_U
z_L=dnorm( qnorm(K_L))
trans= P*(1-P)*(z_U/K_U + z_L/K_L)^2
h2l_super= h2occ / trans
cat("SNP-based heritability transformation:\n")
cat("----------------------------\n")
cat("h2occ : ",round(h2occ,2),"\t SNP-based heritability as estimated from a linear model\n")
cat("K : ",round(K,2), "\t assumed lifetime risk of disease, K=KU \n")
cat("ncase : ", round(ncase,1), "\t number of cases in GWAS used to estimate SNP-based heritability \n")
cat("ncontrol : ", round(ncont,0), "\t number of controls in GWAS used to estimate SNP-based heritability\n")
cat("P : ",round(P,2),"\t proportion of the GWAS sample that are cases \n")
cat("h2 on liability scale when controls are: \n")
cat("unscreend :",round(h2l_uns,2),"\t \n")
cat("screened :",round(h2l_scr,2),"\t standard transformation \n")
cat("heavily screened:",round(h2l_heavy,2),"\t KL=0.5 \n")
cat("super-screened :",round(h2l_super,2),"\t KL=KU \n")
cat("NB: In a real analysis only one transformation would apply depending on the ascertainment criteria for controls")
cat("----------------------------\n")
}
h2occ=0.45
ncase=34241
ncont=45064
K=0.01
h2occtoh2l_screen(h2occ, K, ncase, ncont)
```
## Nagelkerke's $R^2$
Careful consideration of the scale of estimates is required in SNP-based genetic studies of disease traits. For example, estimates of SNP-based heritability are derived from a linear model, and a transformation must be applied, in order for them to be interpretable on the liability scale, which is a scale interpretable across studies. This transformation can also be applied in the context of polygenic risk score prediction, where polygenic risk scores are the weighted count of the DNA variants, for which both the DNA variants to include in the score and their weight are derived from GWAS summary statistics.
\
\
The predictive ability of polygenic risk scores are evaluated in case-control cohorts that are independent of the samples contributing to the GWAS summary statistics. For quantitative traits, the predictive ability is evaluated by the $R^2$ statistic (the proportion of phenotypic variance explained by the risk score). For disease traits the Nagelkerke’s $R^2$ derived from a logistic regression is commonly reported. However, for a given variance in liability explained by a predictor, the measured Nagelkerke’s $R^2$ depends on both $P$ and $K$. \
\
The relationship between Nagelkerke's R^2 calculation and R2 on the liability scale was described by: [Lee et al., (2012)](https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.21614) \
Here, Nagelkerke's R^2 is calculated as:
$$
R^2_N = \frac{R^2_{C\&S}}{R^2_{max}}
$$
where $R^2_{C\&S}$ is the Cox and Snell's $R^2$ on the observed scale for which the theoretical expectation can be estimated as:
$$
R^2_{C\&S} = h^2_l\frac{z^2}{K(1-K)}
$$
where $h^2_l$ is the heritability on the liability scale, $z$ is the height of the normal curve of threshold $T$, and $K(1-K)$ is the phenotypic variance (in the absence of covariates), from binominal theory
Subsequently, $R^2_{max}$ is the maximum value $R^2_{C\&S}$ can ever attain and is estimated as:
$$
R^2_{max} = 1-K^{2K} (1-K)^{2(1-K)}
$$
The code below shows the effect of the proportion of cases in the target sample $(P)$ on the estimated Nagelkerke's $R^2$.
For simplicity, we assume that the variance in liability explained by the predictor (sometimes called $h^2_{PRS}$) is constant at $R^2=0.1$ and we vary the proportion of cases between 0.1 and 0.9. Furthermore, we choose two scenarios, in which $K<P$, with $K=0.01$, $K=0.05$, and $=0.15$), and one scenario where $K=P$.
```{r ,Nagelkerke_R2, echo=T, fig.width=5, fig.height=5, fig.align="center"}
xrange <- seq(0.1,0.9,len=100)
h2l = 0.1
P = 0
#Function to calculate Nagelkerke's R2
nagelkerke <- function(h2l,K){
P=P+xrange
# K=P
x= qnorm(1-K)
z= dnorm(x)
i=z/K
C= K*(1-K)*K*(1-K)/(z^2*P*(1-P))
theta= i*((P-K)/(1-K))*(i*((P-K)/(1-K))-x)
CS=h2l/(C-h2l*theta*C)
rmax=1-P^(2*P)*(1-P)^(2*(1-P))
NR2=CS/rmax
}
#K=0.01
K = 0.01
K.0.01 <- sapply(0,function(x) nagelkerke(h2l,K))#[,1]
#K=0.1
K = 0.05
K.0.05 <- sapply(0,function(x) nagelkerke(h2l,K))
#K=0.1
K = 0.15
K.0.15 <- sapply(0,function(x) nagelkerke(h2l,K))
#K=P
nagelkerkeKP <- function(h2l,K){
P=P+xrange
K=P
x= qnorm(1-K)
z= dnorm(x)
i=z/K
C= K*(1-K)*K*(1-K)/(z^2*P*(1-P))
theta= i*((P-K)/(1-K))*(i*((P-K)/(1-K))-x)
CS=h2l/(C-h2l*theta*C)
rmax=1-P^(2*P)*(1-P)^(2*(1-P))
NR2=CS/rmax
}
KP <- sapply(0,function(x) nagelkerkeKP(h2l,K))
#Visualize Nagelkerke's R2
l3<- c(as.expression((bquote("K = 0.01"))),
as.expression((bquote("K = 0.05"))),
as.expression((bquote("K = 0.15"))),
as.expression((bquote("K = P"))))
par(mar=c(5.1, 5.1, 4.1, 2.1))
matplot(xrange,cbind(K.0.01,K.0.05,K.0.15,KP),
mgp=c(3,1,0),
frame.plot = FALSE, type="l",lty=1:10,col=1:4,lwd=2,
xlab = "Proportion of cases in the target sample (P)", cex.lab =1,
ylab = "", xaxt="n", yaxt="n") #, ylim = c(0,0.3)
axis(1, at = seq(0,1,0.1), cex.axis=1)
axis(2, at = seq(0,0.3,by = 0.05) , cex.axis=1, las=2)
mtext(as.expression((bquote("Nagelkerke's"~R^2))),side = 2, line = 2.5,at = 0.15, cex =1)
legend("topleft", legend = l3 ,
col =1:6, lty = 1:6, lwd=4, cex = 0.7,bty = "n")
```
As shown in the above figure, estimates of Nagelkerke's $R^2$, are highest when the proportion of cases is 0.5, and this effect is stronger in disorders with a low lifetime prevalence (e.g. $K=0.01$) compared to disorders that are more frequent in the population (e.g. $K=0.15$) \
Instead of Nagelkerke's R, the transformation derived for SNP-based heritability estimates described in [Lee et al. (2011)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3059431/) can be applied to the $R^2$ from a linear regression, allowing the predictive ability of polygenic risk scores to be evaluated on the liability scale and hence be comparable with both heritabilities and SNP-based heritabilities [see Lee et al. (2012)](https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.21614).
\
# Cross-disorder risk and genetic correlation
## Cross-disorder risk in relatives
Just as epidemiological studies can investigate the increased risk of disorder *x* in relatives of those with disorder *x*, they can also collect the data to estimate the increased risk of disorder *y* in relatives of those with disorder *x*.
Calculating the cross-disorder risk in relatives can be done in a rather similar way as described in section 2 above following [Falconer, 1965](https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1469-1809.1965.tb00500.x) and [Wray and Gottesman, (2012)](https://www.frontiersin.org/articles/10.3389/fgene.2012.00118/full) . Here, the threshold $T_{Rx,y}$ bisects the normal distribution for the proportion $K_{Rx,y}$ which is the lifetime risk of disease $y$ in relatives of probands with disease $x$, and which relates to the threshold for disease *x* ($T_x) defined as:
$$
T_{Rx,y} = \frac{T_x - a_{R}r_{g}h_{x}h_{y}i_{y}}{\sqrt{1-a_{R}^2 r^2_{g}h^2_{x}h^2_{y}i_y(i_{y}-T_{y})}},
$$
where $r_{g}$ is the genetic correlation between the traits, and $r_{g}h_{x}h_{y}$ is the co-heritability between the traits (sometimes denoted $h_{x,y}$), and where and $i_x$ and $i_y$ , are the mean phenotypic liabilities of the two diseases. If the heritabilities of two diseases, their lifetime risks and the genetic correlation between them are known, then this equation can be used to estimate $K_{Rx,y}$ as:
$$
K_{Rx,y} = \Phi^{-1}(T_{Rx,y})
$$
where $\Phi^{-1}(x)$ is the inverse of the cummulative standard normal distribution function. And the risk ratio for relatives is defined as:
$$
RR_{Rx,y} = \frac{K_{Rx,y}}{K_x}
$$
In R, you can calculate the cross disorder life-time risk and risk ratio of disease $x$ in relatives of probands with disease $y$ using the code below
```{r function get_CDRR}
get_CDRR = function(h2x, h2y, Kx, Ky, rg, a, disorder1, disorder2){
cat("Input Parameters:\n")
cat("----------------------------\n")
cat("aR :\t",a, "\t coefficient of relationship, e.g. 0.5 for parent/offspring \n")
cat("rg :\t",rg, "\t genetic correlation between disorders \n")
cat("disorder-x: \t", disorder1, "\n")
cat("h2x :\t",h2x,"\t heritability of disorder-x \n")
cat("Kx :\t",Kx,"\t lifetime risk of disorder-x \n")
cat("disorder-y: \t", disorder2, "\n")
cat("h2y : \t",h2y,"\t heritability of disorder-y \n")
cat("Ky : \t",Ky,"\t lifetime risk of disorder-y \n")
cat("----------------------------\n")
Tx = -qnorm(Kx, 0, 1)
Ty = -qnorm(Ky, 0, 1)
zx = dnorm(Tx)
zy = dnorm(Ty)
ix = zx / Kx
iy = zy / Ky
#risk of disorder x in relatives of those with disorder x
Txx = (Tx - a * ix * h2x) / (sqrt(1 - a * a * h2x * h2x * ix * (ix-Tx)))
Kxx = 1 - pnorm(Txx)
RRxx = Kxx / Kx
#risk of disorder y in relatives of those with disorder y
Tyy = (Ty - a * iy * h2y) / (sqrt(1 - a * a * h2y * h2y * iy * (iy-Ty)))
Kyy = 1 - pnorm(Tyy)
RRyy = Kyy / Ky
# calculate the genetic covariance:
covg = rg * sqrt(h2x * h2y)
# calculate risk ratio for relatives
Txy = (Tx - a * covg * iy) / sqrt(1 - a^2 * covg^2 * iy * (iy-Ty))
Kxy = 1 - pnorm(Txy)
RRxy = Kxy/Kx # relative risk of disease x given that parent has disease y
# calculate risk ratio for relatives other way around
Tyx = (Ty - a * covg * ix) / sqrt(1 - a^2 * covg^2 * ix * (ix-Tx))
Kyx = 1 - pnorm(Tyx)
RRyx = Kyx/Ky # relative risk of disease x given that parent has disease y
cat("Output Parameters:\n")
cat("----------------------------\n")
cat("Disorder x, consistent with heritability",h2x," and lifetime risk",Kx,":\n")
cat("Kxx :\t",round(Kxx,2), "\t lifetime risk of disorder x in relatives (aR) of those with disorder x \n")
cat("RRxx :\t",round(RRxx,2), "\t lifetime risk of disorder in relatives (aR) of those with disorder x \n")
cat("Disorder y, consistent with heritability",h2y," and lifetime risk",Ky,":\n")
cat("Kyy :\t",round(Kyy,2), "\t lifetime risk of disorder x in relatives (aR) of those with disorder y \n")
cat("RRyy :\t",round(RRyy,2), "\t lifetime risk of disorder x in relatives (aR) of those with disorder y \n")
cat("Cross-disorder, consistent with above and rg",rg,": \n")
cat("Kxy :\t",round(Kxy,2), "\t lifetime risk of disorder y in relatives (aR) of those with disorder x \n")
cat("RRxy :\t",round(RRxy,2), "\t lifetime risk of disorder y in relatives (aR) of those with disorder x \n")
cat("Kyx :\t",round(Kyx,2), "\t lifetime risk of disorder x in relatives (aR) of those with disorder y \n")
cat("RRyx :\t",round(RRyx,2), "\t lifetime risk of disorder x in relatives (aR) of those with disorder y \n")
cat("----------------------------\n")
}
h2x <- 0.7
h2y <- 0.35
Kx <- 0.01
Ky <- 0.15
rg <- 0.34
a <- 0.5
disorder1 <- "schizophrenia"
disorder2 <- "major depressive disorder"
get_CDRR(h2x, h2y, Kx, Ky, rg, a, disorder1, disorder2)
```