GAUSS: Gene-set Association analysis Using Sparse Signals
GAUSS provides a powerful and computationally effective tool to perform gene-set (pathway) association analysis. The methods tests for any self-contained association between a phenotype and a gene-set and produces a p-value for the association using results from gene-based tests (e.g. SKAT, SKAT-O, SKAT-Common-Rare, Burden test, prediXcan etc.).In addition, if there is a significant association, GAUSS identifies the genes within the gene-set that might be driving the association.
If you use GAUSS, please cite our article now publishd in AJHG: A powerful subset-based method identifies gene set associations and improves interpretation in UK Biobank
You can find a preprint version of the paper on biorXiv.
If you have any questions/issues about the package, please email me at diptavo21@jhu.edu
GAUSS requires two input files
-
Summary gene-based p-value file in flat-text format, without headers. The file should contain at least two columns: Gene-name and the corresponding p-values. An example has been provided: example_gene_pval.txt. The columns do not need to be in a particular order and there can be additional columns as well. Please use Gene-Symbols for the genes.
-
GMT file in flat text format, with headers. This file should be in the format as specified in example_gmt.txt. The first column contains the name of the gene-set, second column contains information on it (possibly URLs) and the thrid column contains the list of genes in the gene-set comma separated.
GAUSS was built using R (v 3.6.0). In general any version of R >= 3.5
should work.
The GAUSS codes depend on several existing R packages: optparse
, data.table
, gPdtest
, POT
, MASS
. Please download and install them prior to running GAUSS.
The repository can be cloned and installed as:
git clone https://github.com/diptavo/GAUSS.git
cd GAUSS/
R CMD INSTALL GAUSS_1.0.tar.gz
If you face trouble in installing the tar zipped file in the above procedure, please download the file from this link
Several example files are provided with the package for the user to verify the formats and for toy-examples.
- GMT files:
example_gmt.txt
andexample_gmt2.txt
- gene-based summary p-value files:
example_gene_pval.txt
Once installed, GAUSS can be run with following commands (assuming the path to GAUSS repository is ~/GAUSS/
)
library(GAUSS)
GAUSS_All(summary_file = "~/GAUSS/example_gene_pval.txt", gene_name = 1, pv_name = 2, output_file = "example_out", gmt = "~/GAUSS/example_gmt.txt", ags = "def",verbose = TRUE,parallel = FALSE)
closeAllConnections()
### This will produce two files: example_out.log and example_out.out in about 0.7 minutes.
GAUSS_All(summary_file = "~/GAUSS/example_gene_pval.txt", gene_name = 1, pv_name = 2, output_file = "example_out", gmt = "~/GAUSS/example_gmt2.txt", ags = "def",verbose = TRUE,parallel = FALSE)
closeAllConnections()
### This will produce two files: example_out.log and example_out.out in about 6 minutes.
The output files will contain the following information:
example_out.out
: Flat text file containing gene-set, GAUSS p-value and selected CS genes in an R readable format usingread.table()
.example_out.log
: Log file containing run-time information for each gene-set and the overall run-time.
To run GAUSS on a subset of gene-sets present in the GMT file, use the following command:
GAUSS_All(summary_file = "~/GAUSS/example_gene_pval.txt", gene_name = 1, pv_name = 2, output_file = "example_out", gmt = "~/GAUSS/example_gmt2.txt", ags = "def",verbose = TRUE,parallel = TRUE,start = 21,stop = 50)
closeAllConnections()
### This will run GAUSS on the 21st gene-set (GO_FOREBRAIN_NEURON_DEVELOPMENT) through 50th gene-set (GO_ACYLGLYCEROL_HOMEOSTASIS)
-
summary_file
: Summary file containing the gene-names and corresponding p-values -
gene_name
: Column number for gene names in summary file -
pv_name
: Column number for p-value in summary file -
output_file
: prefix for output file names -
gmt
: GMT file containing the list of gene-set, one gene-set per line -
verbose
: Print extra output; default = TRUE -
method
: Method/test used to generate the gene-based p-values. The reference Vh correlation will be loaded according to this. Currently the supported options are "SKAT-CR" (for SKAT Common-Rare) or "TWAS.X" (for TWAS on tissue X. e.g: "TWAS.Whole_Blood"). For full list of available reference Vh, seedata/
directory. -
ags
: settings for control arguments for running GAUSS; options are "def" or "prec"; "def" should be used for a quick initial scan and the significant associations can be followed up using "prec". -
parallel
: logical indicating whether parallel jobs are to be created; default = FALSE -
jobs
: number of parallel jobs to be created if parallel = TRUE -
jobfile
: name of the output file with parallel jobs for GAUSS if parallel = TRUE -
start
: starting point of the GMT file if parallel = TRUE -
stop
: stopping point of the GMT file if parallel = TRUE -
is.appx
: GPD approximation invoked. default = TRUE -
gpd.est.res
: Number of resampling iterations to estimate GPD parameters. default = 250 -
pv.null.wt1
: User input reference dataset for calculating Vh. Please refer to the manuscript and the next section for more details.
For a GMT file containing many gene-set definitions, it is much easier to divide it into multiple jobs as:
Rscript ~/GAUSS/utils/run_GAUSS_All.R --summary ~/GAUSS/example_gene_pval.txt --gmtFile ~/GAUSS/example_gmt.txt --out ex1 --pvalue 2 --geneName 1 --parallel TRUE --jobs 10 -f ~/AJ.txt
This will create 10 Rscript
jobs in the file ~/AJ.txt
.
head -2 ~/AJ.txt
Rscript run_GAUSS_All.R --summary /Users/diptavo/GAUSS/example_gene_pval.txt --geneName 1 --pvalue 2 --out ex1_1 --gmtFile /Users/diptavo/GAUSS/example_gmt.txt --verbose TRUE --ags def --parallel FALSE --start 1 --stop 20
Rscript run_GAUSS_All.R --summary /Users/diptavo/GAUSS/example_gene_pval.txt --geneName 1 --pvalue 2 --out ex1_2 --gmtFile /Users/diptavo/GAUSS/example_gmt.txt --verbose TRUE --ags def --parallel FALSE --start 21 --stop 40
You can run each line of this file appropriately in batch mode, HPC etc.
--summary
: Gene-based p-value file in flat-text format without header.--geneName
: the column number in the above file that contains the names of the genes; defualt value is1
.--pvalue
: the column number in the above file that contains the p-values for the genes; defualt value is2
.--out
: the prefix of the output file; default isout
. Two files will be generated:.log
contains some information on running and.out
contains the gene-set, p-value and the corresponding CS genes.--gmtFile
: GMT file containing the list of gene-set, one gene-set per line in the format mentioned above, with a header line.
There are a few other options that can be used:
--verbose
: Outputs some run-time messages in.log
file; default isTRUE
--ags
: A path to a file that contains some control options for GAUSS. Please don't change this. This will be updated in a later version.
For full list of options please run
Rscript ~/GAUSS/utils/run_GAUSS_All.R --help
We performed the association analysis of 1,403 binary phenotypes from UK-Biobank with C2
(Curated pathways) and C5
(GO pathways) from MSigDB v6.2 using GAUSS. The results can be visualized using a PheWeb-like visual server.
You can create your own reference data for conducting GAUSS test if you are not using any of the standard tests included in GAUSS. To create a compatible Vh dataset follow these steps:
-
You need the individual level genotype files and an annotation file. In absence of individual level genotypes, an acceptable set of GWAS variants can be obtained from the 1000 Genomes genotypes obtained via the FUSION or LDSCore packages
wget https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2
-
The annotation file would contain information on the assignment of variants to genes. The format needs to be compatible with the gene-based test software you are going to use.
-
Independently generate a null phenotype from
N(0,1)
without any variant effects. -
Perform gene-based test using the test/software of your choice, annotation files and individual level genotype data and simulated phenotype data.
-
Record the p-values of the genes and convert them to z-values.
-
Repeat this step multiple times (100 or 500 or 1000) and obtain a matrix of z-values with dimensions
RxG
whereR
is the number of repeatations (= 100/500/1000 or others as specified) andG
is the number of genes. Store this matrix and use this as input with optionpv.null.wt1
inGAUSS_All(.)
An example workflow to create TWAS-FUSION reference Vh for a particular tissue:
-
Independently generate a null phenotype from
N(0,1)
without any variant effects. -
Run GWAS with simulated phenotype and variants from 1000 Genomes obtained at the (FUSION package)[https://data.broadinstitute.org/alkesgroup/FUSION/LDREF.tar.bz2] using GWAS pipelines like PLINK, GCTA/fastGWA or EPACTS
-
For a given tissue, run TWAS analysis using the summary statistics from the above GWAS and appropriate tissue-specific weights following the (instructions)[http://gusevlab.org/projects/fusion/#typical-analysis-and-output]
-
Repeat this multiple (
R
) times and store the resultant null z-values in a matrix format as described above.
- 06/13/2021: Bugs in options and parallel job creation fixed
- 01/04/2021: README updated
- 12/16/2020: TWAS-FUSION reference panel added
- 07/06/2020: GAUSS published as a R-package along with detailed documentations and utility function.