GitHub

Assoctool

Maintained by Roby Joehanes (roby.joehanes@gmail.com)

Location: Commons_Development:/assoctool_test/ (will be moved once this app is done)

Tool URL: https://github.com/robbyjo/assoctool

Tags: Statistics

Version: 0.0.1

What does this app do?

Assoctool is a very versatile single marker association analysis app. Current supported models:

Linear fixed or mixed effects model with or without pedigree
Generalized linear fixed or mixed effects model with or without pedigree with any link function
Non-linear fixed or mixed effect model
Robust linear fixed or mixed model
Survival model
Coxme (with or without pedigree)
Generalized Estimating Equation (GEE)
GEE-GLM
Ordinal GEE
Zero Inflated models (Poisson or Negative Binomial)
Generalized Additive Model (GAM) through GAMLSS
Multinomial regression
Truncated regression
Censored regression
Firth's regression
Rare event logistic regression
Ordinal regression
Beta regression
Quantile regression
Proportional Odds regression

Highly customized formula can be specified using R format. Anything that R can do, it can be done through assoctool.

Once the article for this tool is published, I will put it in.

What are typical use cases for this app?

Single marker association analysis (GWAS) or for transcriptomic (TWAS), epigenomic (EWAS), or microbiome.

What data are required for this app to run?

This app requires genotypic file in GDS format or txt / csv and phenotypic file in csv format.

Optional pedigree matrix can be inputted through csv or RDa format to account for family relatedness or cryptic relatedness. Pedigree matrix can also be constructed through kinship data.

Optional annotation file can be inputted for filtering markers

What does this app output?

This app outputs the effect, standard error, T or Z statistics, p-values, partial R^2 (when applicable) for the marker and all the fixed effect covariates. That is, anything that you see in summary(result) will be collected. This app allows specification which covariates to be output.

How does this app work?

This app runs single marker analysis in parallel.

Inputs

COMMON (Required)
`omics_file` file	Omics file. For GWAS, this should be a GDS file. For any others (TWAS / EWAS / microbiome), you can use txt / csv / RDa / RDs format
`pheno_file` file	Phenotypic file in txt / csv format containing all covariates to use in the model. Must contain an ID column (which is by default assumed to be named "ID" --- this default can be changed; see advanced options) to match with the omics file. Note that any columns not named according to R convention will be automatically renamed (e.g., column `age-orig` will be renamed `age.orig`, as per R convention)
`formula` string	The formula for your analysis. By default the marker is called "y" (again, default can be changed in advanced options). Example: `fnbmd ~ y + y*age + age + sex + height` Here is a typical GWAS on femoral neck BMD. The SNP is called y here and this model can account for interaction with age. `y ~ smoking + snp_chrna3 + age + sex + (1\|batch)` Here is a typical TWAS/EWAS analysis, where `y` is the expression / methylation levels. The batch term is specifically cast as random effect
`analysis_type` combo	Specify analysis type (one of GWAS, Epigenome, Transcriptome, Protein, Metabolite, Microbiome, Generic). For GWAS analysis, the app will also compute Minimum Allele Frequency information per variant. In version 0.0.1, analyses other than GWAS have no difference.
`method` combo	Specify which method to use: LM for linear fixed effect model LMM for linear mixed effect model GLM for generalized linear fixed effect model (any family except negative binomial) GLMM for generalized linear mixed effect model (any family) NLS for non-linear fixed effect model NLMM for non-linear mixed effect model PEDIGREEMM for linear mixed effect model with pedigree structure using pedigreemm package KINSHIP1 for linear mixed effect model with pedigree structure using kinship1 package KINSHIP2 for linear mixed effect model with pedigree structure using kinship2 package LOGISTF for Firth's logistic regression RLM for robust linear fixed effect model RLMM for robust linear mixed effect model POLR for proportional odds logistic regression SURVIVAL for fixed effect survival model COXME for fixed or mixed effect survival model with or without pedigree GEE for generalized estimating equation GEEGLM for GEE using GEEpack ORDGEE for Ordinal GEE ZEROINFL for zero inflated models (Poisson / Negative Binomial) GLMNB for generalized linear fixed effect model on Negative Binomial family CENSREG for censored regression TRUNCREG for truncated regression BETAREG for regression on beta-distributed responses QUANTREG for regression on quantile-based responses MLOGIT for multinomial logit regression model RELOGIT for rare event logistic model GAMLSS for generalized additive model ZELIG for any model that you can specify using Zelig package CUSTOM for custom model. In this case, you MUST specify your analysis code at minimum. See advanced options
Common (Optional)
`pedigree_file` file	File containing pedigree information
`pedigree_type` combo	Specify what the pedigree format is: whether it is kinship1, kinship2, or pedigree mm format or sparse / dense matrix. In case of matrix format, the pedigree file is expected to be in rds format. In case of the others (kinship1 or 2 or pedigreemm), they are expected to be in text files (txt / csv) and then you need to specify the ID column of the text file for participant, father, and mother. See advanced options.
`annot_file` file	File containing annotation information. If you want to filter the markers by annotation (e.g., by position), you can do so by loading annotation file
`annot_marker_id` string	The column name in the annotation file containing the ID for the marker (that you can link to your main omics file).
`annot_filter_criteria` string	A string containing an R expression to select which marker to include. Example: `Position >= 1000000 && Position <= 2000000`. This means that markers that lie within 1MB to 2MB position will be included in the analysis. Obviously, you will need to have a column called `Position` in your annotation file to make it work.
`annot_cols` string	If you wish your result file come pre-annotated using the annotation file, you can specify which columns to include. This is a comma-separated list specifying different columns in the annotation file.
Common for GWAS analysis only
`maf_threshold` float	Minimum Allele Frequency (MAF) threshold under which the marker will not be analyzed. By default it is not specified. Applicable only for GWAS analysis. If you want to analyze only markers with MAF>=5%, then you put 0.05 here.
`mac_threshold` integer	Minimum Allele Count (MAC) threshold under which the marker will not be analyzed. By default it is set to 1. Applicable only for GWAS analysis. If you want to analyze only markers with MAC>=5, then you put 5 here.
`chromosome` string	Specify which chromosome the markers are in. Applicable only for GWAS analysis. This information is used to correctly compute MAF. Enter X, Y, or MT for X, Y, or mitochondrial chromosome. Anything else will be treated as autosome. This is not needed if you have a GDS file format because GDS file contains chromosome field built-in (see GDS format section).
`sex` string	Column name in phenotype file that indicates sex. Only applicable for GWAS analyses; ignored for other analysis types. MUST be encoded as M or F to indicate males and females, respectively (again, only for GWAS case). This is only used to compute MAF / MAC.
`impute_genotype` boolean	Whether or not missing genotype data should be imputed or not. If set to true, then missing genotype will be set to MAF
ADVANCED (Essential for pedigree analysis)
`pedigree_id` string	The name of column in pedigree file indicating participant ID. By default this is also called "`ID`". Applicable only for kinship / pedigreemm pedigree format. In kinship / pedigreemm format, pedigree file contains at least three columns, the individual ID, father ID, and mother ID, followed optionally by sex information.
`pedigree_father` string	The name of column in pedigree file indicating the ID of the father of the subject. By default this is called "`father`". Applicable only for kinship / pedigreemm pedigree format.
`pedigree_mother` string	The name of column in pedigree file indicating the ID of the mother of the subject. By default this is called "`mother`". Applicable only for kinship / pedigreemm pedigree format.
`pedigree_sex` string	The name of column in pedigree file indicating the sex of the. By default this is called "`sex`". Applicable only for kinship / pedigreemm pedigree format.
`pedigree_id_col` string	If for some reason the pedigree file is specified using a different set of ID numbers than the one specified in the phenotypic data (a common situation in clinical setting), then you need to provide a linker ID column name in here. In case of Framingham Heart Study cohort, the pedigree is often provided using SabreID, while the genotype data is available in ShareID. The pedigree file contains the `pedigree_id`, `pedigree_father`, and `pedigree_mother` in SabreID, then we add another column ShareID to link to the phenotype file. In this case, we fill in ShareID here.
ADVANCED (For some analysis / models)
`fn_param_list` string	A string containing a list of parameters that you wish to pass to the main analysis function. Example use of this would be to specify the family and link for GLM / GLMM type analysis. If you want to do weighted binomial GLM with logit link, you would put in the following text: `family=binomial(link="logit"), weights=pdata[, "Weight"]` The pdata is the object in which phenotype data is stored. Certainly, for this example, you will need to make sure that the phenotype file contains a column called `Weight`. This string will be passed to the glm function (along with formula and the pdata).
`tx_fun` string	Transformation function, specified in R format. In EWAS / TWAS / Microbiome analysis, sometimes it is necessary to transform the marker due to non conformance to normality. One example of such transformation is the inverse normal transform (INT), which can be specified as: `function(x) qqnorm(as.numeric(x), plot.it=FALSE)$x` Or, Z-transform, which can be specified as: `function(x) scale(as.numeric(x), center=TRUE, scale=TRUE)`
`id_col` string	The name of column in phenotype file indicating subject ID. By default this is called "`ID`". If you name it something else (e.g., SabreID), then you should specify it here
`omics_var_name` string	By default, the omics will be called "`y`" in formula. If you wish to call the omics something else in the formula (for clarity), then you specify it here
`compute_fdr` boolean	If set to true, then Benjamini & Hochberg's False Discovery Rate (FDR) will be computed for each p-value column.
`result_var_name` string	A string containing a list of terms you wish to output into the result file. For example, you are running the following model: `fnbmd ~ y + yage + age + sex + height` If `result_var_name` is left blank, then assoctool will output the statistics for y (i.e., the marker), yage, age, sex, and height (i.e., everything). Suppose you are interested in the statistics of the marker and its interaction with age. Then, you would fill in "`y,y:age`" here.
`factors_list` string	A comma separated list of column names in phenotype file that should be treated as factors, rather than integers. By default, phenotype file will be loaded as data.frame. R usually guesses the data type for each column. If a column, say batch number, is entirely integer, then R will treat it as numeric. In regression equation, treating batch as numeric variable will be wrong. To override that behavior, you specify which columns should be treated as factors. In terms of R code, basically assoctool will iterate the list and apply `as.factor` function to each column.
ADVANCED (For custom GDS format)
`gds_var_id` string	Field name indicating variant ID. By default it is "`variant.id`". If you have a GDS file not prepared by seqVCF2GDS (of seqArray R package), then the resulting GDS file may be structured differently and thus you need to specify which field contains variant ID name.
`gds_sample_id` string	Field name indicating sample ID. By default it is "`sample.id`".
`gds_chrom_id` string	Field name indicating chromosome name for each variant. Default to "`chromosome`"
`gds_pos_id` string	Field name indicating position of each variant. Default to "`position`"
`gds_allele_id` string	Field name indicating the alleles for each variant. Default to "`allele`"
ADVANCED (For custom analysis model not covered above)
`analysis_code` file	If you choose a CUSTOM analysis method, you will need to upload your own R code in here. This app expects the code to contain a function called `doOne`, with two parameters: `i` and `mdata`. The variable `i` indicates the index of the marker in `mdata` matrix. This function must return the statistics for marker `i`. For an example, you can view assoctool's `lm.R` file.
preload_code file	If you have a custom data format or require a custom library, you can put it here. This code will be loaded right BEFORE the any of the data (main, phenotype, pedigree, annotation) are loaded and prepared. If you do choose to load the data yourself, here is the convention for assoctool: Omics data is loaded into mdata. Phenotype data is loaded into pdata, assumed a data frame. Annotation data is loaded into annot_data, assumed a data frame. Pedigree data is loaded into ped object. Variables mdata and ped can be of any object AS LONG AS the analysis code understands them. Assoctool's default for mdata is matrix object, and ped can be either matrix or kinship/pedigreemm object. If you custom-load the data, you can then skip assoctool's default loading data routine by setting appropriate flag: `opt$skip_loading_omics` --> If set to `TRUE`, assoctool will skip loading omics file. `opt$skip_loading_phenotype` --> If set to `TRUE`, assoctool will skip loading phenotype file. `opt$skip_loading_pedigree` --> If set to `TRUE`, assoctool will skip loading pedigree file. `opt$skip_loading_annotation` --> If set to `TRUE`, assoctool will skip loading annotation file.
`prologue_code` file	If you have a custom library to load or prepare the dataset differently, this is the place. The R code specified here will be run AFTER the loading of all datasets, but BEFORE any analysis run. An example use is to define the get function in assoctool, which is defined as follows: get <- function(mat, i) txFun(as.numeric(mat[i, ])); This function extracts marker i from omics data mat, assuming that mat is a matrix and that the markers are stored row wise. If you have different arrangement, this is where to override it. Another use of a prologue code is to define ID matching between omics data and phenotype data. By default, assoctool will attempt to match by ID as specified by `id_col`. However, this process is not foolproof and that you may have duplicate IDs that have to be resolved differently (such as ID joining of sort in repeat measures data, for example). If you define the data matching differently, then you will need to set `opt$skip_matching` to TRUE.
`epilogue_code` file	If filled, this is an R code that will be executed AFTER analysis is run. A good use case for this is to pare down unused columns, or to use your favorite False Discovery Rate routine on certain p-values.
ADVANCED
`compress_output` string	Set to none if you do not wish to compress the output. You can choose GZIP, BZ2, or XZ compression if you do.
`save_as_binary` boolean	Save the result file as a binary file (RDS format) that you can later load in R using `readRDS` function. If you choose to save as binary, please disable the compression option above because R automatically compress the binary file.
`from` integer; to integer	If you decide to only analyze a subset of markers, you can specify that here. This is useful to distribute the computational load of one file to several servers. Example: Server 1 compute markers 1 to 100000 --> `from=1, to=100000`; Server 2 compute markers 100001 to 200000, etc.
`block_size` integer	Assoctool tries to load the omics data file per block because the file tends to be very big and do not fit in RAM simultaneously. This parameter specifies how many markers assoctool may load at a time. By default it is set to 2500. If your run returns with an out of memory error, try reducing this block size.
`num_cores` integer	The number of cores that you want to use for computation. Defaults to the maximum number available in that server.
`load_all` boolean	If set to true, then assoctool will attempt to load the entirety of omics data to memory. Useful if the data file is relatively small. If this option is set to true, then `block_size` will be ignored.
`debug` integer	Set 0 to disable debug. Higher number increases verbosity of debug message.

Outputs

output_file string The desired name for your output file. This output will contain all the statistics of all markers desired.

To run this app

Click on Projects and select a project in which you wish to run this app. Then click the Start Analysis button and select this app.

To run this app from the command line

Example use case:

$ dx run Commons_Development:/assoctool_test/assoctool --watch --yes \
--destination Commons_Development:/assoctool_test/ --instance-type mem3_ssd1_x32 \
-iomics_file=cdbg:/Genotypes/freeze.4/GDS/freeze4.chr21.pass.gtonly.minDP10.genotypes.gds \
-ipheno_file=Commons_Development:/assoctool_test/dummy-pheno.txt \
-isex=Sex -ichromosome=autosome -iblock_size=2000 \
-ioutput_file=chr21-dummy-results.txt.bz2 -icompress_output=BZ2 -idebug=1 \
-iid_col="ID" -iformula="Pheno1~SNP+Sex+Age+Pheno2" \
-imethod="LM" -iomics_var_name="SNP" -iresult_var_name="SNP,Age"

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
resources/data/assoctool		resources/data/assoctool
src		src
.project		.project
LICENSE		LICENSE
README.md		README.md
dxapp.json		dxapp.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assoctool

What does this app do?

What are typical use cases for this app?

What data are required for this app to run?

What does this app output?

How does this app work?

To run this app

To run this app from the command line

About

Releases

Packages

Languages

License

robbyjo/assoctool

Folders and files

Latest commit

History

Repository files navigation

Assoctool

What does this app do?

What are typical use cases for this app?

What data are required for this app to run?

What does this app output?

How does this app work?

To run this app

To run this app from the command line

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages