This repository houses and documents the code used to generate the results in the study Ritchie SC et al. Integrative analysis of the plasma proteome and polygenic risk of cardiometabolic diseases. bioRxiv, doi: 10.1101/2019.12.14.876474 (https://www.biorxiv.org/content/10.1101/2019.12.14.876474v3).
This code has not been designed to regenerate the results as-is for third-parties. It has been written to run on a high-performance computing cluster at the University of Cambridge - it includes job submission scripts that are written specifically for this cluster's setup, and cannot which cannot be generalised. The underlying data is also not provided as part of this repo - these must be downloaded separately (see Underlying Data section below) and stored in the locations given in the hard-coded filepaths in these scripts. After several rounds of revision over multiple years the code base has expanded to cover many analyses not included in the paper and analyses have evolved over time.
All scripts are housed under the src/ directory in this repository. At the top level, scripts are conceptually organised by analysis task, with one job submission script per task. For example, src/01_calculate_all_grs.sh is a job submission script that submits a sequence of jobs to the cluster, with the individual job scripts located in src/01_job_scripts/.
These are designed to be part of a wider pipeline to test associations between an arbitrary number of polygenic scores (not included in this paper) with molecular measurements in INTERVAL from various high-throughput platforms (not included in this paper). Components of this pipeline that were not used in this paper are not included in this repositor, hence while the job scripts in this pipeline are sequential in run order, some numeric steps are not included in this repo. Some of these job dispatch scripts accept a "view file" as an argument (e.g. src/05_sensitivity_associations.sh), which contains a list of polygenic scores to analyse. The relevant view file needed for this paper is provided in the views/ folder, listing the five polygenic risk scores analysed.
The majority of results were generated by scripts under src/pubs/cardiometabolic_proteins/review2/ and/or src/pubs/cardiometabolic_proteins/review3/. These act as the best starting point to replicate the results, but these scripts still rely on some of the pipeline code above, particularly for dataset cleaning, PGS level calculation, cis-pQTL calling, and mapping pQTL and GWAS summary statistics for Mendelian randomisation.
The following software and versions were used to run these scripts:
- Scientific Linux release 7.7 (Nitrogen) (HPC operating system)
- slurm version 19.05.5 (HPC queue manager and job submission system)
- GNU bash version 4.2.46(2) (shell environment used to run bash scripts)
- PLINK v1.90b6.10 64-bit (17 Jun 2019) (www.cog-genomics.org/plink/1.9/), aliased as plink1.9 in the scripts.
- PLINK v2.00a2LM AVX2 Intel (24 Jul 2019) (www.cog-genomics.org/plink/2.0/), aliased as plink2 in the scripts.
- R versions 3.6 and 4.0.3, along with R packages:
- data.table version 1.12.8, 1.13.2
- foreach version 1.4.4, 1.5.1
- doMC version 1.3.5, 1.3.7
- XML version 3.98-1.20, 3.99-0.5
- biomaRt version 2.40.3, 2.46.0 (Bioconductor package)
- openxlsx version 4.1.0.1, 4.2.3
- ggplot2 version 3.3.0, 3.3.2
- MendelianRandomization version 0.4.1, 0.5.0
- ggrepel version 0.8.1, 0.8.2
- ggrastr version 0.1.7, 0.2.1 (github package, https://github.com/VPetukhov/ggrastr)
- ggnewscale version 0.3.0, 0.4.3
- RColorBrewer version 1.1-2
- pheatmap version 1.0.12 (development version, https://github.com/raivokolde/pheatmap)
- impute version 1.57.0, 1.64.0 (Bioconductor package)
- WGCNA version 1.68, 1.69
- RNOmni version 0.7.1, 1.0.0
- cowplot version 1.0.0, 1.1.0
- lubridate version 1.7.9.2
- R.utils version 2.10.1
- powerMediation version 0.3.2
- scales version 1.1.1
- mma version 10.3.2
- survival version 3.2-7
- mediation version 4.5.0
- coloc version 3.2-1
- medflex version 0.6-7
- gridExtra version 2.3
- seriation version 1.2-9
- httr version 1.4.2
- jsonlite version 1.7.1
- xml2 version 1.3.2
- The BGEN software suite (https://www.well.ox.ac.uk/~gav/bgen_format/software.html) including:
- bgenix version 1.1.4
- qctool version 2.0.5, alised as qctool2 in the scripts
- ldstore version 1.1
- SQLite version 3.30.1, aliased as sqlite3 in the scripts.
For R and R packages, version 3.6 was primarily used for the main pipeline and scripts run under src/pubs/cardiometabolic_proteins/ and src/pubs/cardiometabolic_proteins/review1/, while R version 4.0.3 was used for scripts run under src/pubs/cardiometabolic_proteins/review2/ and src/pubs/cardiometabolic_proteins/review3/. For R packages where two versions are listed, the first is the version used in R version 3.6 and the second is the version used in R 4.0.3.
Inkscape version 0.92.3 was used to layout and annotate figures from the figure components generated within the R scripts. Microsoft Office Professional Plus 2016 was used to draft the manuscript (Microsoft Word) and curate supplemental tables (Microsoft Excel) on Windows 10 Enterprise edition.
With the exception of electronic hospital records, data used in this study is publicly available or deposited in a public repository. Genetic data, proteomic data, and basic cohort characteristics for the INTERVAL cohort are available via the European Genotype-phenome Archive (EGA) with study accession EGAS00001002555 (https://www.ebi.ac.uk/ega/studies/EGAS00001002555). Dataset access is subject to approval by a Data Access Committee: these data are not publicly available as they contain potentially identifying and sensitive patient information. Linked electronic hospital records are currently only available to researchers at the University of Cambridge UK, however, may become more widely available in the future. Contact the data access committee for further details. All other data used in this study is publicly available without restriction. The PGS used in this study are available to download through the Polygenic Score Catalog (https://www.pgscatalog.org/) with accession numbers PGS000727 (atrial fibrillation), PGS000018 (coronary artery disease), PGS000728 (chronic kidney disease), PGS000039 (ischaemic stroke), and PGS000729 (type 2 diabetes). GWAS summary statistics used to generate new PGS in this study are available to download through the GWAS Catalog (https://www.ebi.ac.uk/gwas/) with study accessions GCST008065 (chronic kidney disease), GCST007517 (type 2 diabetes), and GCST006414 (atrial fibrillation). Summary statistics for all statistical tests are available in Supplementary Data 3. Full pQTL summary statistics published by Sun et al. 2018 for all SomaLogic SOMAscan aptamers are available to download from https://www.phpc.cam.ac.uk/ceu/proteins/. A listing of cis-pQTLs mapped for this study are provided in Supplementary Data 4. GWAS summary statistics used for Mendelian randomisation are available to download through the GWAS Catalog (https://www.ebi.ac.uk/gwas/) with study accessions GCGCST004787 (coronary artery disease), GCST008065 (chronic kidney disease), GCST006906 (ischaemic stroke) and GCST007518 (type 2 diabetes). The DrugBank database is publicly available to download at https://www.drugbank.ca/releases/latest.