-
Notifications
You must be signed in to change notification settings - Fork 5
/
prep_data_code_notes.Rmd
187 lines (111 loc) · 7.56 KB
/
prep_data_code_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
title: "Data preparation for GPLS article"
author: "Derek Beaton"
date: "`r toupper(format(Sys.time(), '%Y%b%d'))`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This document provides instructions and code to help others reproduce the manuscript. The data come from two sources in the Alzheimer's Disease Neuroimaging Initiative (ADNI)'s database: the TADPOLE challenge and the genome-wide PLINK files for the ADNI-GO/2 cohort.
Below I provide some instructions, screenshots, and links to obtain data and additional software, and I embed code here in this RMarkdown document with explanations of what each chunk does. To note: the code in this RMarkdown file is not run because it references specific folders. Please make sure to make those changes for your own use.
# ADNI
The ADNI project's primary website is here: http://adni.loni.usc.edu/. If you want access to the ADNI data see this link: http://adni.loni.usc.edu/data-samples/access-data/. Once you have been granted access to ADNI (if qualified), you can access data here: https://ida.loni.usc.edu/login.jsp?project=ADNI
Once logged in you will see the following page
![](./ADNI_page.PNG)
## Genetic data
The genetic data are found by clicking "Genetic data" in the above image. You will then go to a page with all of the genetic data available. You need to click ADNIGO/2 and download two archived/compressed PLINK files. The links in the next screenshot refer to those files as "sets 1-9" and "sets 10-15"
![](./ADNIGO2_genetics.PNG)
When downloaded, these two files will be named "ADNI_GO_2_OmniExpress.zip" and "ADNI_GO2_2nd.tar.gz" respectively. Next you'll need to extract these files from their archives. You will then see 3 files from each archive: \*.bim, \*.bed, and \*.fam. These are called "PLINK binary files".
Next you will need to download the PLINK software for processing these data: http://zzz.bwh.harvard.edu/plink/index.shtml
Once you have the PLINK software and the 3 files per archive, you will need to combine the two data sets, then perform some standard quality control on the data, extract specific SNPs, and then export to a new file. This requires only two PLINK commands and one additional external file (provided in this directory: "GETSNPS.txt")
The first PLINK command combines the two data sets:
```{bash, eval = FALSE}
./plink
--noweb
--bfile "E:\Data\ADNI\Genetics\ADNI_GO_2_Forward_Bin"
--bmerge
"E:\Data\ADNI\Genetics\ADNI_GO2_GWAS_2nd_orig_BIN.bed"
"E:\Data\ADNI\Genetics\ADNI_GO2_GWAS_2nd_orig_BIN.bim"
"E:\Data\ADNI\Genetics\ADNI_GO2_GWAS_2nd_orig_BIN.fam"
--make-bed
--out "E:\Data\ADNI\Genetics\ADNI2MERGE"
```
The second PLINK command will perform some quality control and extract a smaller data set for you to work with.
```{bash, eval = FALSE}
./plink
--noweb
--bfile "E:\Data\ADNI\Genetics\ADNI2MERGE"
--extract "E:\Data\Publications\PLSCAR\GETSNPS.txt"
--maf 0.05
--mind 0.1
--geno 0.1
--recode
--tab
--out "E:\Data\Publications\PLSCAR\ADNIGo2_CANDIDATE_GENES"
```
Note that the above requires you use the text file GETSNPS.txt available here in this directory. Next, we can use `R` to perform some additional preparation of the data. We start by reading in the smaller genetics set.
```{r eval = FALSE}
## get reduced and preprocessed genetic data
library(data.table)
ped.file <- as.matrix(fread('E:/Data/Publications/PLSCAR/ADNIGo2_CANDIDATE_GENES.ped'))
ped.file <- gsub("[[:space:]]","",ped.file)
ped.ptids <- ped.file[,2]
ped.file <- ped.file[,-c(1:6)]
ped.file <- gsub("00",NA,ped.file)
ped.file <- cbind(ped.ptids,ped.file)
map.file <- as.matrix(fread('E:/Data/Publications/PLSCAR/ADNIGo2_CANDIDATE_GENES.map'))
colnames(ped.file) <- c("PTID",map.file[,2])
## clean up
genetic.data <- ped.file
rm(ped.file)
rm(map.file)
```
The above code reads in two PLINK files: a .ped and a .map. Each contains some important pieces of information for the genetic data. So we read them both in and create `genetic.data` which is are the genotypes for all the SNPs of interest, with the SNP names as column names, and the participant IDs as rownames. We also drop some of the columns used in PLINK format but not required for our own analyses.
## TADPOLE data
You will also require the TADPOLE challenge data set (for more information on the challenge, see here: https://tadpole.grand-challenge.org/ and here: http://adni.loni.usc.edu/tadpole-challenge-dataset-available/). From the main ADNI data page, click "Study data" and then navigate to the "Test data" tab, and download the TADPOLE data
![](./ADNI_TADPOLE.png)
Once downloaded, extract the .zip file. You will see a number of .csv, .txt, and .py files. We only need two of those going forward. Read in the "D1_D2" dictionary and data files
```{r, eval = FALSE}
TADPOLE_D1_D2_Dict <- read.csv('E:/Data/ADNI/Challenges/TADPOLE/TADPOLE_D1_D2_Dict.csv',header=T,stringsAsFactors = F)
TADPOLE_D1_D2 <- read.csv('E:/Data/ADNI/Challenges/TADPOLE/TADPOLE_D1_D2.csv',header=T,stringsAsFactors = F)
```
The next chunks of code are to extract common participants across the two data sources (genetics, TADPOLE), and to reduce the TADPOLE data set to the measures used in the paper.
# Putting it together
First we need to identify common participants.
```{r, eval = FALSE}
overlapping.ids <- intersect(genetic.data[,1],TADPOLE_D1_D2$PTID)
```
Following that, we want to do some reduction of the data and a bit of clean up.
```{r, eval = FALSE}
ADNIGO.init.locs <- which( TADPOLE_D1_D2$ORIGPROT=="ADNIGO" & TADPOLE_D1_D2$VISCODE=="bl" & (TADPOLE_D1_D2$PTID %in% overlapping.ids) )
ADNI2.init.locs <- which( TADPOLE_D1_D2$ORIGPROT=="ADNI2" & TADPOLE_D1_D2$VISCODE=="bl" & (TADPOLE_D1_D2$PTID %in% overlapping.ids) )
TADPOLE_AGO2_BL <- TADPOLE_D1_D2[c(ADNIGO.init.locs,ADNI2.init.locs),]
TADPOLE_AGO2_BL <- TADPOLE_AGO2_BL[-c(which(colSums(is.na(TADPOLE_AGO2_BL))==nrow(TADPOLE_AGO2_BL)))]
TADPOLE_AGO2_BL <- TADPOLE_AGO2_BL[,-c(which((colSums(is.na(TADPOLE_AGO2_BL)) / nrow(TADPOLE_AGO2_BL)) > .1))]
```
Next we focus on just specific measures from the TADPOLE data
```{r, eval = FALSE}
AV45.vars <- TADPOLE_D1_D2_Dict[which(TADPOLE_D1_D2_Dict$TBLNAME=="UCBERKELEYAV45"),"FLDNAME"]
STRUCT.MRI.vars <- c("Ventricles_bl","Hippocampus_bl","WholeBrain_bl","ICV_bl")
KEY.DEMOG.vars <- c("RID","PTID","VISCODE","SITE","COLPROT","ORIGPROT","EXAMDATE","DX_bl","DX","DXCHANGE","AGE","PTGENDER","PTEDUCAT","PTETHCAT","PTRACCAT","APOE4")
CLIN.vars <- c("CDRSB","ADAS11","ADAS13","MMSE","MOCA",paste0(c("CDRSB","ADAS11","ADAS13","MMSE","MOCA"),"_bl"))
BROAD.NIMG.vars <- c("FDG","AV45",paste0(c("FDG","AV45"),"_bl"))
CTX.uptake.vars <- AV45.vars[which(grepl("CTX",AV45.vars) & (!grepl("SIZE",AV45.vars)))]
fin.vars <- c(KEY.DEMOG.vars,CLIN.vars,BROAD.NIMG.vars,CTX.uptake.vars,STRUCT.MRI.vars)
```
Next, we make our final data sets with the appropriate labels (column and row names).
```{r, eval = FALSE}
TADPOLE.fin <- TADPOLE_AGO2_BL[,fin.vars]
rownames(TADPOLE.fin) <- TADPOLE.fin$PTID
rownames(genetic.data) <- genetic.data[,"PTID"]
genetic.data <- genetic.data[rownames(TADPOLE.fin),]
TADPOLE_Dict.fin <- TADPOLE_D1_D2_Dict[which(TADPOLE_D1_D2_Dict$FLDNAME %in% colnames(TADPOLE.fin)),]
```
Finally, save the these matrices in .rda format, which are then opened and used in the RMarkdown for the paper.
```{r, eval = FALSE}
save(TADPOLE.fin,file="E:/Data/Publications/PLSCAR/TADPOLE.fin_2020DEC17.rda")
save(genetic.data,file="E:/Data/Publications/PLSCAR/genetic.data_Data_2020DEC17.rda")
save(TADPOLE_Dict.fin,file="E:/Data/Publications/PLSCAR/TADPOLE_Dict.fin_Data_2020DEC17.rda")
```