Genome-wide association studies (GWAS) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype.
We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predic-tive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hy-potheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated.
- Set channells
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge
-
Install from Bioconda
conda create -n motifraptor_env python=3.6 conda activate motifraptor_env conda install -c bioconda motifraptor
-
Simple test
Activate the conda environment before running the program.
source activate motifraptor_env
Motif Raptor supports two different ways to run it.
(1) Run from the command line (recommended)
MotifRaptor --version
(2) Load as a module
python >>>import MotifRaptor >>>MotifRaptor.__version__
If you see the version number, congratulations!
MotifRaptor --help
usage: MotifRaptor [-h] [--version]
{preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
...
Analyze motifs and SNPs in the dataset.
positional arguments:
{preprocess,preprocess_ukbb_v3,celltype,snpmotif,snpfeature,motiffilter,motifspecific,snpspecific,snpmotifradar,snpindex,snpscan,set,info}
help for subcommand: celltype, snpmotif, snpfeature,
motiffilter, motifspecific, snpspecific
preprocess Pre-process the summary statistics
preprocess_ukbb_v3 Pre-process the summary statistics from UKBB version 3
TSV files
celltype cell type or tissue type analysis help
snpmotif snp motif test help
snpfeature snp feature help
motiffilter motifs filtering help
motifspecific motifs specific analysis help
snpspecific SNP specific analysis help
snpmotifradar SNP motif radar plot help
snpindex index the SNPs (with flanking sequences) help
snpscan scan SNP database (already indexed) help
set Set Path and Global Values
info Get Informationa and Print Global Values
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Database contains essential data for general analysis, including DHS tracks, TF RNA-seq expressions, TF motifs, and TF pre-calucated scores.
Please download the Database.zip from the links shown below. You can choose to download either a testing database or a complete database. The testing database contains all necessary files except that it only includes a handful number of TFs, in order for you to test the tutorial example on a regular machine. The complete database contains a complete list of TFs used in our real-world study, however, you need to have at least 220 G disk space, and we recommend running the programs on a cluster.
You may also choose to download the testing database, and then refer to the section Build complete motif database to compute a full TF list on your own, rather than downloading a big file. In this case, we also recommend running the programs on a cluster.
# A testing database can be downloaded here.
wget https://www.dropbox.com/s/gxeyzgl5m0u55w8/Database.zip
unzip Database.zip
# A complete database can be downloaded here. Make sure you have 220 G disk space.
wget https://www.dropbox.com/s/kp5r82x55tfgawf/Database.zip
unzip Database.zip
Configuration for Motif-Raptor is to set up the absolute paths for general database and motif database.
MotifRaptor set -pn databasedir -pv $PWD/Database/hg19/
MotifRaptor set -pn motifdatabasedir -pv $PWD/Database/hg19/motifdatabase/
Double check the paths are correctly set up.
MotifRaptor info
*For the current database, we only support human genome hg19. The genomic sequence is originally downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.2bit). *
Skip the next section if you are only doing an example tutorial without changing the TF motif set.
You need this section only when you haven't downloaded the full version of the database with large amount of data in motifdatabase (from previous steps); rather, you would like to pay computing hours to calculate from scratch
(1) Download pfm files and change the motif database directory immediately. (Motif database is a folder that contains a set of TF motif files with JASPAR PFM format.)
wget https://www.dropbox.com/s/hx9av7o16efxmus/motifdatabase.zip
unzip motifdatabase.zip
MotifRaptor set -pn motifdatabasedir -pv $PWD/motifdatabase
Motif-Raptor expects in input TF motif matrices in JASPAR PFM format. TF matrices in this format are stored as simple text flat files e.g.:
A [13 13 3 1 54 1 1 1 0 3 2 5 ]
C [13 39 5 53 0 1 50 1 0 37 0 17 ]
G [17 2 37 0 0 52 3 0 53 8 37 12 ]
T [11 0 9 0 0 0 0 52 1 6 15 20 ]
However TF matrices in other formats (e.g. MEME, TRANSFAC ) can be easily converted to this format with a simple text editor or in batch using the excellent Biopython library. This library offers several well documented parser that can be used for this task, see https://biopython.org/docs/1.75/api/Bio.motifs.html for more information.
(2) Download SNP list from 1000 Genome project
wget https://www.dropbox.com/s/9gztf4mdblc44jo/1000G.EUR.QC.plink.simple.vcf
The SNP list VCF file needs to have the first five columns ('CHR','POS','ID','REF','ALT') as follows:
CHR | POS | ID | REF | ALT |
---|---|---|---|---|
1 | 2483961 | rs2258734 | A | G |
(3) Index the SNP list (genome_index is the output folder name which will be also used in next step.) In this step, you will retrieve the flanking sequences centered by each SNP in order to generate indices files using the hg19 genome from the database (currently we only support hg19 as described before, so the vcf file should be also consistent with hg19)
MotifRaptor snpindex -vcf 1000G.EUR.QC.plink.simple.vcf -gi genome_index -p 4
(4) Scan motifs usnig pfm files on this SNP list
MotifRaptor snpscan -gi genome_index -pfm ./motifdatabase/pfmfiles -mo ./motifdatabase/motifscanfiles -p 4
In the output folder, only '.scale' and '.score' files are useful. You may delete intermediate results in those folders.
cd motifdatabase/motifscanfiles/
find ./ -type d -exec rm -rf '{}' \;
Input: GWAS Summary Statistics Run Motif-Raptor from GWAS summary statistics. You may get summary statistics from UKBiobank, published paper, or other resources. These files may provide diffrent information. Please make sure the file contains the following columns. For the score column, Motif-Raptor currently supports pvalue, zscore, or chisquare.
ID | CHR | POS | REF | ALT | SCORE(pvalue, zscore, or chisquare) |
---|---|---|---|---|---|
rs2258734 | 1 | 2483961 | A | G | 0.003 |
Example: Download the original data file from (Okada et al. 2010 Nature), and applying your own p-value cut-offs to define hits and nonhits. By default, p-value cutoff is 5E-8. This data file is ~450M. If your internet is limited, please download the zip file (~100M) and unzip it.
wget https://www.dropbox.com/s/jnmpu63vqnlc0ig/RA_GWASmeta_TransEthnic_v2.txt
#alternative zip file
wget https://www.dropbox.com/s/c194x1z0bhntfbs/RA_GWASmeta_TransEthnic_v2.zip
unzip RA_GWASmeta_TransEthnic_v2.zip
In this file, columns 1,2,3,4,5,9 are ID,CHR,POS,REF,ALT,SCORE as defined above. Here the score is pvalue.
MotifRaptor preprocess -gwas RA_GWASmeta_TransEthnic_v2.txt -cn 1,2,3,4,5,9 -st pvalue -th 5E-8
Output: Information for SNP hits and non-hits. hitSNP_list.txt nonhitSNP_list.txt and hitSNP_list.vcf For the example from (Okada et al. 2010 Nature), you can also download our processed results, if you haven't run on your own.
wget https://www.dropbox.com/s/gpnudp1ba4d2gq3/hitSNP_list.txt
wget https://www.dropbox.com/s/7dfrph1dnrad894/nonhitSNP_list.txt
wget https://www.dropbox.com/s/73seqi42hgodupg/hitSNP_list.vcf
Double Check Output Format: hitSNP_list.txt and nonhitSNP_list.txt are two files with the following format:
ID | CHR | POS |
---|---|---|
rs2258734 | 1 | 2483961 |
hitSNP_list.vcf is the file with the following format (with two more columns of the polymorphism information:
ID | CHR | POS | REF | ALT |
---|---|---|---|---|
rs2258734 | 1 | 2483961 | A | G |
Usage:
usage: MotifRaptor preprocess [-h] [-gwas SUMMARYSTATSFILE]
[-cn COLUMN_NUMBERS] [-st SCORE_TYPE]
[-th SCORE_PVALUE_THRESHOLD]
optional arguments:
-h, --help show this help message and exit
-gwas SUMMARYSTATSFILE, --gwas_summary SUMMARYSTATSFILE
GWAS summary statistic file
-cn COLUMN_NUMBERS, --column_numbers COLUMN_NUMBERS
provide six column numbers for information in such
order: ID,CHR,POS,A1,A2,SCORE eg. 1,2,3,4,5,6
-st SCORE_TYPE, --score_type SCORE_TYPE
Score type in GWAS summary statistic file: pvalue or
zscore or chisquare
-th SCORE_PVALUE_THRESHOLD, --threshold SCORE_PVALUE_THRESHOLD
threads for pvalue - default 5E-8
Bonus function:
Motif-Raptor allows user to preprocess from the UKBB (v3) summary statistics TSV files directly using a much simpler command.
wget https://www.dropbox.com/s/axthfv12j7pbav4/30690_raw.gwas.imputed_v3.both_sexes.tsv
MotifRaptor preprocess_ukbb_v3 -gwas 30690_raw.gwas.imputed_v3.both_sexes.tsv -th 5E-8
Example:
MotifRaptor celltype -vcf hitSNP_list.vcf -sh hitSNP_list.txt -sn nonhitSNP_list.txt -wd step1_out/ -p 3
Input: Three input files from step0. Output: This visualization ranks the associated cell or tissue types by both p-values (bar length) and odd ratio (numbers behind the bar). Original text file is "all_sorted.pvalue"
Optional step:
Motif raptor also provide module interface to run it in your own python code or Jupyter notebook for lightweight task. For example, you can re-plot the figure after running the previous steps.
#optional: plot figure using module CellTypeAnlayzer after running the command
from MotifRaptor.CellTypeAnalyzer import CellTypeAnalysis
CellTypeAnalysis.plotfigure_main("step1_out/testcelltype/all_sorted.pvalue","plot_all_cell_type.pdf")
Usage:
MotifRaptor celltype --help
usage: MotifRaptor celltype [-h] [-vcf SNP_HIT_VCF] [-sh SNP_HIT]
[-snh SNP_NON_HIT] [-wd WORKDIR] [-p THREAD_NUM]
optional arguments:
-h, --help show this help message and exit
-vcf SNP_HIT_VCF, --snp_hit_withseq SNP_HIT_VCF
need header and columns in this text file with
sequence (CHR is only a number): ID CHR POS REF ALT
-sh SNP_HIT, --snp_hit SNP_HIT
need header and columns in this text file (CHR is only
a number): ID CHR POS
-snh SNP_NON_HIT, --snp_non_hit SNP_NON_HIT
need header and columns in this text file (CHR is only
a number): ID CHR POS
-wd WORKDIR, --workdir WORKDIR
Working directory
-p THREAD_NUM, --threads THREAD_NUM
number of threads
#run through the motif scan on all the SNPs
MotifRaptor snpmotif -wd step2_out -c "CD8-positive, alpha-beta T cell" \
-sb step1_out/hitSNP_DHSunion_list.bed -se step1_out/hit_snps_seq_pickle.df.txt -bg genome \
-m all -p 3
Input: The bed file for SNP hits and the sequence information are obtained from Step1 "cell type characterization". But you need to determine the background and the motifs you want to test.
You may use "genome" in "-g" to use genome wide SNPs as the baseline distribution. You may use "all" in "-m" to test all of the Transcription Factors collected in the database. But you may specificy a file test_motif.txt to run a test for only a few Transcription Factors. Each motif should take a line, with the format of "motifID__motifname" which pwm files can be found in the Database. For example:
MA0105.1__NFKB1
MA0518.1__Stat4
Output: Scores for SNP hits are calculated as text files, including a huge background folder. The outcome of this partial step is not viewable. You need to run the next step to get a summazied text and figures.
Usage:
MotifRaptor snpmotif --help
usage: MotifRaptor snpmotif [-h] [-wd WORKDIR] [-c CELL_TYPE]
[-sb HIT_SNP_UNION_BED] [-se HIT_SNP_UNION]
[-bg BG_SNPS] [-m MOTIF_LIST] [-p THREAD_NUM]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-c CELL_TYPE, --cell_type CELL_TYPE
Cell type or Tissue type ID
-sb HIT_SNP_UNION_BED, --snp_hit_bed HIT_SNP_UNION_BED
hit snps on union bed file, usually from step1
-se HIT_SNP_UNION, --snp_hit_seq HIT_SNP_UNION
hit snps with sequence information, usually from step1
-bg BG_SNPS, --bg_snp BG_SNPS
background snp list file or (genome)
-m MOTIF_LIST, --motifs MOTIF_LIST
motif list file, no header, or (all)
-p THREAD_NUM, --threads THREAD_NUM
number of threads
Example:
#calculate statistics for each transcription factor and apply filtering (i.e. expression and pvalue)
MotifRaptor motiffilter -wd step2_out/motif_result -ms step2_out/motif_result/all_motifs.pvalue
Input: Just specify the working directory from previous step.
Output: The visualization shows the distribution and scoring for the Transcription Factors and narrow down (zoomed in) with the significant/interesting ones.
all motifs
zoomed in motifs
Usage:
MotifRaptor snpfeature -h
usage: MotifRaptor snpfeature [-h] [-wd WORKDIR] [-c CELL_TYPE]
[-cb SNP_BED_FILES]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-c CELL_TYPE, --cell_type CELL_TYPE
Cell type or Tissue type ID
-cb SNP_BED_FILES, --snp_bed_files SNP_BED_FILES
SNP cell type bed file folder, usually from step1
#calculate SNP wise features (prepare for step3)
MotifRaptor snpfeature -wd step2_out -c "CD8-positive, alpha-beta T cell" -cb step1_out/testcelltype/bedfiles
Input: Just specify the working directory from previous step.
Output: The SNP-wise annotations will be added to the folder.
Usage:
MotifRaptor snpfeature -h
usage: MotifRaptor snpfeature [-h] [-wd WORKDIR] [-c CELL_TYPE]
[-cb SNP_BED_FILES]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-c CELL_TYPE, --cell_type CELL_TYPE
Cell type or Tissue type ID
-cb SNP_BED_FILES, --snp_bed_files SNP_BED_FILES
SNP cell type bed file folder, usually from step1
Example:
# per motif analysis
MotifRaptor motifspecific \
-wd step3_out \
-sm step2_out/result_new_df_motifs_ENCFF512IML.txt \
-md MA0105.1__NFKB1 \
-bs step2_out/motif_result/background_files/
Input: Just specify the output directory from step2.
Output: Scatter plot for both SNP hits and non-hits for a picked motif.
Usage:
usage: MotifRaptor motifspecific [-h] [-wd WORKDIR] [-sm SNP_MOTIF_FILE]
[-md MOTIF_ID_NAME] [-bs BG_SCORE_FOLDER]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-sm SNP_MOTIF_FILE, --snp_motif_file SNP_MOTIF_FILE
SNP motif pair-wise list File, usually from step2
-md MOTIF_ID_NAME, --motif_id MOTIF_ID_NAME
motif id with name, in the format of motifID__NAME
-bs BG_SCORE_FOLDER, --bg_score_folder BG_SCORE_FOLDER
background score folder, usually from step2
Example:
# per SNP analysis
MotifRaptor snpspecific \
-wd step3_out \
-sm step2_out/result_new_df_motifs_ENCFF512IML.txt \
-snp rs7528684
Input: Just specify the output directory from step2.
Output: Scatter plot for transcription factors for a picked SNP.
Usage:
usage: MotifRaptor snpspecific [-h] [-wd WORKDIR] [-sm SNP_MOTIF_FILE]
[-snp SNP_ID]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-sm SNP_MOTIF_FILE, --snp_motif_file SNP_MOTIF_FILE
SNP motif pair-wise list File, usually from step2
-snp SNP_ID, --snp_id SNP_ID
SNP id
Example:
# draw radar plot for instereing motif and SNP events
MotifRaptor snpmotifradar \
-wd step3_out \
-sm step2_out/result_new_df_motifs_ENCFF512IML.txt \
-sf step2_out/SNP_ENCFF512IML_features.txt \
-pid rs7528684:MA0105.1__NFKB1
Input: Just specify the output directory from step2.
Output: Radar plot for the features/scores for interesting SNP-motif events.
Usage:
usage: MotifRaptor snpmotifradar [-h] [-wd WORKDIR] [-sm SNP_MOTIF_FILE]
[-sf SNP_FEATURE_FILE] [-pid SNP_MOTIF_ID]
optional arguments:
-h, --help show this help message and exit
-wd WORKDIR, --workdir WORKDIR
Working directory
-sm SNP_MOTIF_FILE, --snp_motif_file SNP_MOTIF_FILE
SNP motif pair-wise list File, usually from step2
-sf SNP_FEATURE_FILE, --snp_feature_file SNP_FEATURE_FILE
SNP feature file, usually from step2
-pid SNP_MOTIF_ID, --snp_motif_id SNP_MOTIF_ID
SNP motif pair-wise ID
We also provide a Jupyter Notebook to show how to integrate DeepBind (or other external tools) in the step 2 of Motif-Raptor.
All the supporting files can be downloaded from this archive:
https://github.com/pinellolab/MotifRaptor/raw/master/Document/deepbind_example.zip
In summary, Motif-Raptor is a computational toolkit to test the significance of the effects of genetic variants from GWAS analyses on transcription factor binding sites. We believe that its adoption will help the genomic community in prioritizing potential cell type-specific, causal variants from GWAS summary statistics and to generate important hypotheses and insights to the mechanisms of action of genetic variants in complex disease.
Luca Pinello: lpinello at mgh.harvard.edu
Qiuming Yao: yao.ornl at gmail.com