The genipe
module (standing for GENome-wide Imputation
PipelinE) includes a script (named genipe-launcher
) that
automatically runs a genome-wide imputation pipeline using Plink, shapeit
and impute2.
If you use genipe
in any published work, please cite the paper describing the
tool:
Lemieux Perreault LP, Legault MA, Asselin G, Dubé MP: genipe: an automated genome-wide imputation pipeline with automatic reporting and statistical tools. Bioinformatics 2016, 32 (23): 3661-3663 (DOI:10.1093/bioinformatics/btw487).
Full documentation is available at http://pgxcentre.github.io/genipe/.
We recommend installing the package in a Python 3.4 (or latest) virtual
environment. There are two ways to install: pip
or conda
.
# Using pip
pip install genipe
# Using conda
conda install genipe -c http://statgen.org/wp-content/uploads/Softwares/genipe
The installation process should install all required dependencies to run the main imputation pipeline. Optional dependencies can also be installed manually in order to perform statistical analysis and data management (see below).
The complete installation procedure is available in the documentation.
The tool requires a standard Python 3.4 (or latest) installation with the following modules:
numpy
version 1.11.3 and latestJinja2
version 2.9 and latestpandas
version 0.19.2 and latestsetuptools
version 12.0.5 and latest
The tool requires the binaries for Plink, shapeit and impute2.
In order to perform data management and statistical analysis (linear, logistic
and Cox's regressions), genipe
requires the following Python modules:
Matplotlib
scipy
patsy
statsmodels
lifelines
Biopython
pyfaidx
drmaa
pyplink
Note that statsmodels
(specifically MixedLM analysis) version 0.6 is not
compatible with numpy
version 1.12 and latest.
Finally, the tool requires a LaTeX installation to compile the automatically generated report in PDF format.
Basic testing was implemented for the script to merge impute2 files resulting from different segments of the same chromosome along with some utility functions of the main package. To test the package (once installed), launch Python and execute the following command:
>>> import genipe
>>> genipe.test()
Some functionalities are difficult to test, since they mostly use external tools to perform analysis (i.e. Plink, shapeit and impute2) and it is assumed that they were properly tested by their author.
We will try to add further tests in the future.
The following options are available when launching a genome-wide imputation analysis.
$ genipe-launcher --help
usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
[--reference FILE] [--chrom CHROM [CHROM ...]]
[--output-dir DIR] [--bgzip] [--use-drmaa]
[--drmaa-config FILE] [--preamble FILE]
[--shapeit-bin BINARY] [--shapeit-thread INT]
[--shapeit-extra OPTIONS] [--plink-bin BINARY]
[--hap-template TEMPLATE] [--legend-template TEMPLATE]
[--map-template TEMPLATE] --sample-file FILE
[--hap-nonPAR FILE] [--hap-PAR1 FILE] [--hap-PAR2 FILE]
[--legend-nonPAR FILE] [--legend-PAR1 FILE]
[--legend-PAR2 FILE] [--map-nonPAR FILE]
[--map-PAR1 FILE] [--map-PAR2 FILE]
[--impute2-bin BINARY] [--segment-length BP]
[--filtering-rules RULE [RULE ...]]
[--impute2-extra OPTIONS] [--probability FLOAT]
[--completion FLOAT] [--info FLOAT]
[--report-number NB] [--report-title TITLE]
[--report-author AUTHOR]
[--report-background BACKGROUND]
Execute the genome-wide imputation pipeline. This script is part of the
'genipe' package, version 1.4.2.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
--debug set the logging level to debug
--thread THREAD number of threads [1]
Input Options:
--bfile PREFIX The prefix of the binary pedfiles (input data).
--reference FILE The human reference to perform an initial strand check
(useful for genotyped markers not in the IMPUTE2
reference files) (optional).
Output Options:
--chrom CHROM [CHROM ...]
The chromosomes to process. It is possible to write
'autosomes' to process all the autosomes (from
chromosome 1 to 22, inclusively).
--output-dir DIR The name of the output directory. [genipe]
--bgzip Use bgzip to compress the impute2 files.
HPC Options:
--use-drmaa Launch tasks using DRMAA.
--drmaa-config FILE The configuration file for tasks (use this option when
launching tasks using DRMAA). This file should
describe the walltime and the number of
nodes/processors to use for each task.
--preamble FILE This option should be used when using DRMAA on a HPC
to load required module and set environment variables.
The content of the file will be added between the
'shebang' line and the tool command.
SHAPEIT Options:
--shapeit-bin BINARY The SHAPEIT binary if it's not in the path.
--shapeit-thread INT The number of thread for phasing. [1]
--shapeit-extra OPTIONS
SHAPEIT extra parameters. Put extra parameters between
single or normal quotes (e.g. --shapeit-extra '--
states 100 --window 2').
Plink Options:
--plink-bin BINARY The Plink binary if it's not in the path.
IMPUTE2 Autosomal Reference:
--hap-template TEMPLATE
The template for IMPUTE2's haplotype files (replace
the chromosome number by '{chrom}', e.g.
'1000GP_Phase3_chr{chrom}.hap.gz').
--legend-template TEMPLATE
The template for IMPUTE2's legend files (replace the
chromosome number by '{chrom}', e.g.
'1000GP_Phase3_chr{chrom}.legend.gz').
--map-template TEMPLATE
The template for IMPUTE2's map files (replace the
chromosome number by '{chrom}', e.g.
'genetic_map_chr{chrom}_combined_b37.txt').
--sample-file FILE The name of IMPUTE2's sample file.
IMPUTE2 Chromosome X Reference:
--hap-nonPAR FILE The IMPUTE2's haplotype file for the non-
pseudoautosomal region of chromosome 23.
--hap-PAR1 FILE The IMPUTE2's haplotype file for the first
pseudoautosomal region of chromosome 23.
--hap-PAR2 FILE The IMPUTE2's haplotype file for the second
pseudoautosomal region of chromosome 23.
--legend-nonPAR FILE The IMPUTE2's legend file for the non-pseudoautosomal
region of chromosome 23.
--legend-PAR1 FILE The IMPUTE2's legend file for the first
pseudoautosomal region of chromosome 23.
--legend-PAR2 FILE The IMPUTE2's legend file for the second
pseudoautosomal region of chromosome 23.
--map-nonPAR FILE The IMPUTE2's map file for the non-pseudoautosomal
region of chromosome 23.
--map-PAR1 FILE The IMPUTE2's map file for the first pseudoautosomal
region of chromosome 23.
--map-PAR2 FILE The IMPUTE2's map file for the second pseudoautosomal
region of chromosome 23.
IMPUTE2 Options:
--impute2-bin BINARY The IMPUTE2 binary if it's not in the path.
--segment-length BP The length of a single segment for imputation. [5e+06]
--filtering-rules RULE [RULE ...]
IMPUTE2 filtering rules (optional).
--impute2-extra OPTIONS
IMPUTE2 extra parameters. Put the extra parameters
between single or normal quotes (e.g. --impute2-extra
'-buffer 250 -Ne 20000').
IMPUTE2 Merger Options:
--probability FLOAT The probability threshold for no calls. [<0.9]
--completion FLOAT The completion rate threshold for site exclusion.
[<0.98]
--info FLOAT The measure of the observed statistical information
associated with the allele frequency estimate
threshold for site exclusion. [<0.00]
Automatic Report Options:
--report-number NB The report number. [genipe automatic report]
--report-title TITLE The report title. [genipe: Automatic genome-wide
imputation]
--report-author AUTHOR
The report author. [Automatically generated by genipe]
--report-background BACKGROUND
The report background section (can either be a string
or a file containing the background. [General
background]
The documentation provides a quick and easy tutorial for the complete pipeline. Refer to this page
The pipeline provides a report containing relevant imputation statistics. This
report is located in the report
directory. It can be compiled into a PDF file
using the following make
command:
$ make && make clean
...
This report uses LaTeX and the *.tex
file can be modified to add project
specific information.
Once the genome-wide imputation analysis is performed and the quality metrics
provided by the automatic report have been reviewed, it is possible to
perform different statistical analyses (e.g. linear or logistic regression,
or survival analysis using Cox's proportional hazard model) using the provided
script named imputed-stats
.
$ imputed-stats --help
usage: imputed-stats [-h] [-v] {cox,linear,logistic,mixedlm,skat} ...
Performs statistical analysis on imputed data (either SKAT analysis, or
linear, logistic or survival regression). This script is part of the 'genipe'
package, version 1.4.2.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Statistical Analysis Type:
The type of statistical analysis to be performed on the imputed data.
{cox,linear,logistic,mixedlm,skat}
cox Cox's proportional hazard model (survival regression).
linear Linear regression (ordinary least squares).
logistic Logistic regression (GLM with binomial distribution).
mixedlm Linear mixed effect model (random intercept).
skat SKAT analysis.
See the tutorials for more information: http://pgxcentre.github.io/genipe/tutorials.html
This project was initiated at the Beaulieu-Saucier Pharmacogenomics Centre of the Montreal Heart Institute. The aim was to speed up (and automatize) the imputation process for the whole genome.