Skip to content

Commit

Permalink
1.4.1 release
Browse files Browse the repository at this point in the history
  • Loading branch information
sigven committed Dec 7, 2020
1 parent db93d30 commit ffa9a5c
Show file tree
Hide file tree
Showing 8 changed files with 74 additions and 47 deletions.
40 changes: 21 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,15 @@

### Overview

The germline variant annotator (*gvanno*) is a simple software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
The germline variant annotator (*gvanno*) is a simple software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.

*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.

### News

* December 7th 2020 - **1.4.1 release**
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
* Software update (VEP 102)
* Skipped DisGenet annotations (Open Targets serve similar purpose)
* September 29th 2020 - **1.4.0 release**
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
* Software updates (VEP 101)
Expand All @@ -36,17 +39,16 @@ The germline variant annotator (*gvanno*) is a simple software package intended

### Annotation resources

* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v101 (GENCODE v35/v19 as the gene reference dataset)
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v102 (GENCODE v36/v19 as the gene reference dataset)
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (August 2020)
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_09, September 2020)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_04, August 2020)
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (December 2020)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_11, November 2020)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_06, December 2020)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (September 9th 2020)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (December 2nd 2020)


### Getting started
Expand Down Expand Up @@ -80,15 +82,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha

#### STEP 2: Download *gvanno* and data bundle

1. Download and unpack the [latest software release (1.4.0)](https://github.com/sigven/gvanno/releases/tag/v1.4.0)
1. Download and unpack the [latest software release (1.4.1)](https://github.com/sigven/gvanno/releases/tag/v1.4.1)
2. Download and unpack the assembly-specific data bundle in the gvanno directory
* [grch37 data bundle](https://drive.google.com/file/d/1VnABjA3ZCJLlQxhQKcIGaC17MD0kItVd) (approx 16Gb)
* [grch38 data bundle](https://drive.google.com/file/d/13fbKtAFzcUGDnPfruzgK43PvAKiFc8XL/) (approx 17Gb)
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20201206.tgz) (approx 16Gb)
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20201206.tgz) (approx 17Gb)
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`

A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
3. Pull the [gvanno Docker image (1.4.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
* `docker pull sigven/gvanno:1.4.0` (gvanno annotation engine)
3. Pull the [gvanno Docker image (1.4.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.3Gb):
* `docker pull sigven/gvanno:1.4.1` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand Down Expand Up @@ -117,7 +119,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
--query_vcf QUERY_VCF
VCF input file with germline query variants (SNVs/InDels).
--gvanno_dir GVANNO_DIR
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.0
Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.1
--output_dir OUTPUT_DIR
Output directory
--genome_assembly {grch37,grch38}
Expand Down Expand Up @@ -149,10 +151,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

python ~/gvanno-1.4.0/gvanno.py
--query_vcf ~/gvanno-1.4.0/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.4.0
--output_dir ~/gvanno-1.4.0
python ~/gvanno-1.4.1/gvanno.py
--query_vcf ~/gvanno-1.4.1/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.4.1
--output_dir ~/gvanno-1.4.1
--sample_id example
--genome_assembly grch37
--container docker
Expand Down
8 changes: 4 additions & 4 deletions gvanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@
import toml
from argparse import RawTextHelpFormatter

GVANNO_VERSION = '1.4.0'
DB_VERSION = 'GVANNO_DB_VERSION = 20200928'
VEP_VERSION = '101'
GENCODE_VERSION = '35'
GVANNO_VERSION = '1.4.1'
DB_VERSION = 'GVANNO_DB_VERSION = 20201206'
VEP_VERSION = '102'
GENCODE_VERSION = '36'
VEP_ASSEMBLY = "GRCh38"
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)

Expand Down
24 changes: 14 additions & 10 deletions src/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ WORKDIR /

ENV PACKAGE_BIO="libhts2 bedtools"
ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev"
ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2 toml"
ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9 toml==0.10.1"
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
nano ed locales vim-tiny fonts-texgyre \
Expand Down Expand Up @@ -243,12 +243,16 @@ RUN apt-get update \
USER root
WORKDIR /

RUN git clone https://github.com/atks/vt.git
WORKDIR vt
RUN make
RUN make test
RUN cp vt /usr/local/bin
RUN export PATH=/usr/local/bin:$PATH
## vt - variant tool set - use conda version
## primary use in PCGR/CPSR: decomposition of multiallelic variants in a VCF file
RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \
&& chmod 0755 miniconda.sh
RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"]
RUN rm miniconda.sh

# update conda & install vt
RUN /conda/bin/conda update conda
RUN /conda/bin/conda install -c bioconda vt

## Clean Up
RUN apt-get clean autoclean
Expand All @@ -268,9 +272,9 @@ WORKDIR /
RUN rm -rf $HOME/src/ensembl-vep/t/
RUN rm -f $HOME/src/v335_base.tar.gz
RUN rm -f $HOME/src/release-1-6-924.zip
RUN rm -rf /vt
RUN rm -rf /samtools-1.9.tar.bz2
RUN rm -rf /samtools-1.10.tar.bz2
RUN rm -f /conda/bin/python

ADD gvanno.tgz /
ENV PATH=$PATH:/gvanno
ENV PATH=$PATH:/conda/bin:/gvanno
ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
2 changes: 1 addition & 1 deletion src/buildDocker.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
tar czvfh gvanno.tgz gvanno/
echo "Build the Docker Image"
TAG=`date "+%Y%m%d"`
docker build -t sigven/gvanno:$TAG --rm=true .
docker build --no-cache -t sigven/gvanno:$TAG --rm=true .

Binary file modified src/gvanno.tgz
Binary file not shown.
41 changes: 30 additions & 11 deletions src/gvanno/gvanno_summarise.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,36 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
w = Writer(out_vcf, vcf)
current_chrom = None
num_chromosome_records_processed = 0
gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'ENSEMBL_PROTEIN_ID':2,
'SYMBOL':3, 'SYMBOL_ENTREZ':4,'ENTREZ_ID':5, 'UNIPROT_ID':6, 'APPRIS':7,
'UNIPROT_ACC':8,'REFSEQ_MRNA':9, 'CORUM_ID':10,'TUMOR_SUPPRESSOR':11,
'TUMOR_SUPPRESSOR_EVIDENCE':12, 'ONCOGENE':13, 'ONCOGENE_EVIDENCE':14,'DISGENET_CUI':15,
'MIM_PHENOTYPE_ID':16, 'OPENTARGETS_DISEASE_ASSOCS':17,
'OPENTARGETS_TRACTABILITY_COMPOUND':18, 'OPENTARGETS_TRACTABILITY_ANTIBODY':19,
'PROB_HAPLOINSUFFICIENCY': 20,'PROB_EXAC_LOF_INTOLERANT':21,'PROB_EXAC_LOF_INTOLERANT_HOM':22,
'PROB_EXAC_LOF_TOLERANT_NULL':23,'PROB_EXAC_NONTCGA_LOF_INTOLERANT':24,
'PROB_EXAC_NONTCGA_LOF_INTOLERANT_HOM':25, 'PROB_EXAC_NONTCGA_LOF_TOLERANT_NULL': 26,
'PROB_GNOMAD_LOF_INTOLERANT':27, 'PROB_GNOMAD_LOF_INTOLERANT_HOM': 28, 'PROB_GNOMAD_LOF_TOLERANT_NULL':29,
'ESSENTIAL_GENE_CRISPR': 30, 'ESSENTIAL_GENE_CRISPR2': 31}
gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0,
'ENSEMBL_GENE_ID':1,
'ENSEMBL_PROTEIN_ID':2,
'SYMBOL':3,
'SYMBOL_ENTREZ':4,
'ENTREZ_ID':5,
'UNIPROT_ID':6,
'UNIPROT_ACC':7,
'REFSEQ_MRNA':8,
'CORUM_ID':9,
'TUMOR_SUPPRESSOR':10,
'TUMOR_SUPPRESSOR_EVIDENCE':11,
'ONCOGENE':12,
'ONCOGENE_EVIDENCE':13,
'MIM_PHENOTYPE_ID':14,
'OPENTARGETS_DISEASE_ASSOCS':15,
'OPENTARGETS_TRACTABILITY_COMPOUND':16,
'OPENTARGETS_TRACTABILITY_ANTIBODY':17,
'PROB_HAPLOINSUFFICIENCY': 18,
'PROB_EXAC_LOF_INTOLERANT':19,
'PROB_EXAC_LOF_INTOLERANT_HOM':20,
'PROB_EXAC_LOF_TOLERANT_NULL':21,
'PROB_EXAC_NONTCGA_LOF_INTOLERANT':22,
'PROB_EXAC_NONTCGA_LOF_INTOLERANT_HOM':23,
'PROB_EXAC_NONTCGA_LOF_TOLERANT_NULL': 24,
'PROB_GNOMAD_LOF_INTOLERANT':25,
'PROB_GNOMAD_LOF_INTOLERANT_HOM': 26,
'PROB_GNOMAD_LOF_TOLERANT_NULL':27,
'ESSENTIAL_GENE_CRISPR': 28,
'ESSENTIAL_GENE_CRISPR2': 29}

vcf_info_element_types = {}
for e in vcf.header_iter():
Expand Down
6 changes: 4 additions & 2 deletions src/gvanno/lib/annoutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,10 +372,12 @@ def map_variant_effect_predictors(rec, algorithms):
rec.INFO['PRIMATEAI_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('list_s2:'):
rec.INFO['LIST_S2_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('gerp_rs:'):
rec.INFO['GERP_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('bayesdel_addaf:'):
rec.INFO['BAYESDEL_ADDAF_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('clinpred:'):
rec.INFO['CLINPRED_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('aloft:'):
rec.INFO['ALOFTPRED_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('splice_site_rf:'):
rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('splice_site_ada:'):
Expand Down
Binary file modified src/loftee_1.0.3.tgz
Binary file not shown.

0 comments on commit ffa9a5c

Please sign in to comment.