Skip to content
This repository has been archived by the owner on Mar 5, 2023. It is now read-only.

RajLabMSSM/ggLocusZoom

Repository files navigation

⚠️ Deprecation warning ⚠️

ggLocusZoom has since been deprecated and is now replaced by echoplot as part of the broader echoverse suite. The source code for ggLocusZoom is being kept here simply for posterity.

ggLocusZoom

Why?

LocusZoom is a very useful and commonly used tool for visualizing genomic signals in GWAS/QTL data. But when you're making many plots for many loci, it makes sense to do this programmatically.

There is a standalone version of LocusZoom for R and Python, but unfortunately it requires downloading 23GB of data to get all of the databases it relies on. Even if you supply your own data (summary stats, LD, etc.), it's not particularly user friendly. Nor can you customize it further once you have it.

So I made a user-friendly version of it in R that allows you to customize it as a ggplot-like object.

Note: my.LocusZoom now offers the ability to upload all your sum stats at once and make plots for each locus, which helps reduce the redundancy of making each plot individually. That said, it currently doesn't let you customize the resulting figures.


How?

The following tutorial can also be downloaded as an Rscript here.

1. Clone this GitHub repo

In command line:
git clone https://github.com/RajLabMSSM/ggLocusZoom.git
cd ggLocusZoom

1.5 Install conda environment

Optional: You can avoid the hassle of installing different dependencies by creating a conda environment from this yaml file:
conda env create -f gglocuszoom.yml
conda activate gglocuszoom

2. Prepare your data

GWAS/QTL summary statistics

  • This file can be comma-separated (.csv/.txt) or tab-separated (.tsv/txt).
  • It can also be compressed (.gz).
  • Required columns in summary stats data (one SNP per row):
    • SNP : RSID (e.g. rs4698412)
    • P : Uncorrected p-value from GWAS (e.g. 2.058e-28)
    • CHR : Chromosome (e.g. 4)
    • POS : Genomic position in terms of bp (e.g. 15737348)

LD matrix

Method A: Supply your own LD matrix
  • An n x n matrix where n is the number of SNPs in your summary stats file. Column and row names must be the RSIDs of the SNPs.
  • Can be .RDS, .csv, .tsv, or .txt format.
  • If using this method, you can skip right to step 3. below.
Method B: Use the LD.UKBibank() function to download the LD matrix
  • If you don't have your own LD matrix, you can download it via the LD.UKBibank() function provided here. It will return the path to where the LD matrix was saved.

  • NOTE: This can take a very long time (~10-15m/locus) because it has to first import a much larger LD matrix (in .npz format) from the Alkes Price lab and then subset it to just the SNPs in your summary stats file. To save space, only the subset that includes your summary stat SNPs will saved saved (as a .RDS file).

    • axel: You can drastically speed up this process by installing the program axel (download instructions here), which uses multithreading to rapidly download a single file. To do this, set download_method="axel" (default) AND download_full_ld=T.
    • wget: If you don't want to install axel, you can still use wget (comes with most computers) to download the .npz/.gz files locally. To do this, set download_method="wget" (default) AND download_full_ld=T.
    • For Mount Sinai employees and affiliates only: That said, you can speed up this process a lot by logging into Chimera (Mount Sinai computing cluster). If you set chimera=T, then LD.UKBibank() will automatically find the UKB LD files that I've pre-downloaded, and imports the one you need.
source("./functions/ggLocusZoom.R")

# Optional:
## If you're using the conda enviroment (step 1.5 above), you'll need to tell R to use it with reticulate
reticulate::use_condaenv("")

LD_path <- LD.UKBiobank(# Specify where the summary stats file is.
                        sumstats_path="./example_data/BST1_Nalls23andMe_2019_subset.txt", 
                         
                        # The folder where you want to save the LD matrix 
                        # (defaults to ./LD)
                        results_path="./LD",
                       
                        # If =T , will overwrite a pre-existing LD file with the same name.
                        force_new_LD=F,
                       
                        # The same of the locus 
                        # (defaults to "_")
                        locus="BST1",
                        
                        # Use pre-downloaded LD files on Chimera computing cluster 
                        ## (Mount Sinai employees and affiliates only). 
                        ## Must be logged onto Chimera and have 
                        ## access to the 'pd-omics' project.
                        chimera=F, 
                         
                        # [** WARNING **]: 
                        ## Only change these defaults if you have plenty of extra storage. 
                        ## Each of these files is ~1-3GB.
                        ## Download and save the full .npz/.gz files.
                        download_full_ld=F,   
                        
                        # You can use either 'wget' or 'axel' to download the files
                        download_method = "axel",
                        
                        # Delete the full .gz/.npz ld files after you're done converting them into .RDS
                        remove_tmps=T)
print(LD_path)

3. Run ggLocusZoom() in R

# Import the functions
source("./functions/ggLocusZoom.R")

# Run 
gglz <- ggLocusZoom(# Specify where the summary stats file is.
                    sumstats_path="./example_data/BST1_Nalls23andMe_2019_subset.txt",
                    
                    # Specify where the pre-computed LD matrix is.
                    LD_path="./example_data/BST1_UKB-LD.RDS",
                    
                    # Is the LD matrix in units of r? (as opposed to r2)
                    LD_units="r",
                    
                    # Show the plot when it's ready?
                    show_plot=T,
                    
                    # Save the plot? (=F if you don't want to)
                    save_plot="./BST1_ggLocusZoom.png",
                    
                    # Saved plot height.
                    height=5, 
                    
                    # Saved plot width.
                    width=10,
                    
                    # Optional plot title.
                    title="",
                    
                    # Draw a vertical line at the position of the lead SNP.
                    leadSNP.line=T,
                    
                    # Plot LD with the leda SNP as a categorical variable 
                    # (very low, low, medium, high, very high) or a continuous variable (0-1).
                    categorical_r2=T,
                    
                     # By default (leadGWAS), it will pick the SNP with the lowest p-value 
                     as the index SNP for which to extract LD.  
                     ## Alternatively, you can provide an RSID contained within the 
                     dataset as the index SNP (e.g. "rs6849244").
                     index_snp="leadGWAS",
                     
                     # Define the limits of the xlim (e.g. c(15239141,16236140)).  
                     ## If NULL (default), then the min/max positions in the sumstats will be used.  
                     xlims = NULL,
                     
                      # The maximum number of transcripts to show per gene in the Gene track  
                      max_transcripts = 3  
                    )

ggLocusZoom_example


Wut?

Example data sources

Dependencies

R

Python

Command Line


Who?

Brian M. Schilder
Raj Lab
Department of Neuroscience
Department of Genetics & Genomic Sciences
Ronald M. Loeb Center for Alzheimer's Disease
Icahn School of Medicine at Mount Sinai