1.4.1 release

sigven · Dec 7, 2020 · ffa9a5c · ffa9a5c
1 parent db93d30
commit ffa9a5c
Show file tree

Hide file tree

Showing 8 changed files with 74 additions and 47 deletions.
diff --git a/README.md b/README.md
@@ -10,12 +10,15 @@
 
 ### Overview
 
-The germline variant annotator (*gvanno*) is a simple software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+The germline variant annotator (*gvanno*) is a simple software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
 
-*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
+*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
 
 ### News
-
+* December 7th 2020 - **1.4.1 release**
+ * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
+ * Software update (VEP 102)
+ * Skipped DisGenet annotations (Open Targets serve similar purpose)
 * September 29th 2020 - **1.4.0 release**
  * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
  * Software updates (VEP 101)
@@ -36,17 +39,16 @@ The germline variant annotator (*gvanno*) is a simple software package intended
 
 ### Annotation resources
 
-* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v101 (GENCODE v35/v19 as the gene reference dataset)
+* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v102 (GENCODE v36/v19 as the gene reference dataset)
 * [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
 * [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
 * [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
 * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (August 2020)
-* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
-* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_09, September 2020)
-* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_04, August 2020)
+* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (December 2020)
+* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_11, November 2020)
+* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_06, December 2020)
 * [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (September 9th 2020)
+* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (December 2nd 2020)
 
 
 ### Getting started
@@ -80,15 +82,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Download and unpack the [latest software release (1.4.0)](https://github.com/sigven/gvanno/releases/tag/v1.4.0)
+1. Download and unpack the [latest software release (1.4.1)](https://github.com/sigven/gvanno/releases/tag/v1.4.1)
 2. Download and unpack the assembly-specific data bundle in the gvanno directory
- * [grch37 data bundle](https://drive.google.com/file/d/1VnABjA3ZCJLlQxhQKcIGaC17MD0kItVd) (approx 16Gb)
- * [grch38 data bundle](https://drive.google.com/file/d/13fbKtAFzcUGDnPfruzgK43PvAKiFc8XL/) (approx 17Gb)
+ * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20201206.tgz) (approx 16Gb)
+ * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20201206.tgz) (approx 17Gb)
  * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
  A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (1.4.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
- * `docker pull sigven/gvanno:1.4.0` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (1.4.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.3Gb):
+ * `docker pull sigven/gvanno:1.4.1` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -117,7 +119,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
  --query_vcf QUERY_VCF
   VCF input file with germline query variants (SNVs/InDels).
  --gvanno_dir GVANNO_DIR
-  Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.0
+  Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.1
  --output_dir OUTPUT_DIR
   Output directory
  --genome_assembly {grch37,grch38}
@@ -149,10 +151,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
 
- python ~/gvanno-1.4.0/gvanno.py
- --query_vcf ~/gvanno-1.4.0/examples/example.grch37.vcf.gz
- --gvanno_dir ~/gvanno-1.4.0
- --output_dir ~/gvanno-1.4.0
+ python ~/gvanno-1.4.1/gvanno.py
+ --query_vcf ~/gvanno-1.4.1/examples/example.grch37.vcf.gz
+ --gvanno_dir ~/gvanno-1.4.1
+ --output_dir ~/gvanno-1.4.1
  --sample_id example
  --genome_assembly grch37
  --container docker

diff --git a/gvanno.py b/gvanno.py
@@ -12,10 +12,10 @@
 import toml
 from argparse import RawTextHelpFormatter
 
-GVANNO_VERSION = '1.4.0'
-DB_VERSION = 'GVANNO_DB_VERSION = 20200928'
-VEP_VERSION = '101'
-GENCODE_VERSION = '35'
+GVANNO_VERSION = '1.4.1'
+DB_VERSION = 'GVANNO_DB_VERSION = 20201206'
+VEP_VERSION = '102'
+GENCODE_VERSION = '36'
 VEP_ASSEMBLY = "GRCh38"
 DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
 

diff --git a/src/Dockerfile b/src/Dockerfile
@@ -201,7 +201,7 @@ WORKDIR /
 
 ENV PACKAGE_BIO="libhts2 bedtools"
 ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev"
-ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2 toml"
+ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9 toml==0.10.1"
 RUN apt-get update \
  && apt-get install -y --no-install-recommends \
  nano ed locales vim-tiny fonts-texgyre \
@@ -243,12 +243,16 @@ RUN apt-get update \
 USER root
 WORKDIR /
 
-RUN git clone https://github.com/atks/vt.git
-WORKDIR vt
-RUN make
-RUN make test
-RUN cp vt /usr/local/bin
-RUN export PATH=/usr/local/bin:$PATH
+## vt - variant tool set - use conda version
+## primary use in PCGR/CPSR: decomposition of multiallelic variants in a VCF file
+RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \
+ && chmod 0755 miniconda.sh
+RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"]
+RUN rm miniconda.sh
+
+# update conda & install vt
+RUN /conda/bin/conda update conda
+RUN /conda/bin/conda install -c bioconda vt
 
 ## Clean Up
 RUN apt-get clean autoclean
@@ -268,9 +272,9 @@ WORKDIR /
 RUN rm -rf $HOME/src/ensembl-vep/t/
 RUN rm -f $HOME/src/v335_base.tar.gz
 RUN rm -f $HOME/src/release-1-6-924.zip
-RUN rm -rf /vt
-RUN rm -rf /samtools-1.9.tar.bz2
+RUN rm -rf /samtools-1.10.tar.bz2
+RUN rm -f /conda/bin/python
 
 ADD gvanno.tgz /
-ENV PATH=$PATH:/gvanno
+ENV PATH=$PATH:/conda/bin:/gvanno
 ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH}
diff --git a/src/buildDocker.sh b/src/buildDocker.sh
@@ -4,5 +4,5 @@ cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
 tar czvfh gvanno.tgz gvanno/
 echo "Build the Docker Image"
 TAG=`date "+%Y%m%d"`
-docker build -t sigven/gvanno:$TAG --rm=true .
+docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
 
diff --git a/src/gvanno.tgz b/src/gvanno.tgz
diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py
@@ -50,17 +50,36 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
  w = Writer(out_vcf, vcf)
  current_chrom = None
  num_chromosome_records_processed = 0
- gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'ENSEMBL_PROTEIN_ID':2, 
- 'SYMBOL':3, 'SYMBOL_ENTREZ':4,'ENTREZ_ID':5, 'UNIPROT_ID':6, 'APPRIS':7,
- 'UNIPROT_ACC':8,'REFSEQ_MRNA':9, 'CORUM_ID':10,'TUMOR_SUPPRESSOR':11,
- 'TUMOR_SUPPRESSOR_EVIDENCE':12, 'ONCOGENE':13, 'ONCOGENE_EVIDENCE':14,'DISGENET_CUI':15,
- 'MIM_PHENOTYPE_ID':16, 'OPENTARGETS_DISEASE_ASSOCS':17,
- 'OPENTARGETS_TRACTABILITY_COMPOUND':18, 'OPENTARGETS_TRACTABILITY_ANTIBODY':19,
- 'PROB_HAPLOINSUFFICIENCY': 20,'PROB_EXAC_LOF_INTOLERANT':21,'PROB_EXAC_LOF_INTOLERANT_HOM':22,
- 'PROB_EXAC_LOF_TOLERANT_NULL':23,'PROB_EXAC_NONTCGA_LOF_INTOLERANT':24,
- 'PROB_EXAC_NONTCGA_LOF_INTOLERANT_HOM':25, 'PROB_EXAC_NONTCGA_LOF_TOLERANT_NULL': 26,
- 'PROB_GNOMAD_LOF_INTOLERANT':27, 'PROB_GNOMAD_LOF_INTOLERANT_HOM': 28, 'PROB_GNOMAD_LOF_TOLERANT_NULL':29,
- 'ESSENTIAL_GENE_CRISPR': 30, 'ESSENTIAL_GENE_CRISPR2': 31}
+ gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 
+ 'ENSEMBL_GENE_ID':1, 
+ 'ENSEMBL_PROTEIN_ID':2, 
+ 'SYMBOL':3, 
+ 'SYMBOL_ENTREZ':4,
+ 'ENTREZ_ID':5, 
+ 'UNIPROT_ID':6, 
+ 'UNIPROT_ACC':7,
+ 'REFSEQ_MRNA':8, 
+ 'CORUM_ID':9,
+ 'TUMOR_SUPPRESSOR':10,
+ 'TUMOR_SUPPRESSOR_EVIDENCE':11, 
+ 'ONCOGENE':12, 
+ 'ONCOGENE_EVIDENCE':13,
+ 'MIM_PHENOTYPE_ID':14, 
+ 'OPENTARGETS_DISEASE_ASSOCS':15,
+ 'OPENTARGETS_TRACTABILITY_COMPOUND':16, 
+ 'OPENTARGETS_TRACTABILITY_ANTIBODY':17,
+ 'PROB_HAPLOINSUFFICIENCY': 18,
+ 'PROB_EXAC_LOF_INTOLERANT':19,
+ 'PROB_EXAC_LOF_INTOLERANT_HOM':20,
+ 'PROB_EXAC_LOF_TOLERANT_NULL':21,
+ 'PROB_EXAC_NONTCGA_LOF_INTOLERANT':22,
+ 'PROB_EXAC_NONTCGA_LOF_INTOLERANT_HOM':23, 
+ 'PROB_EXAC_NONTCGA_LOF_TOLERANT_NULL': 24,
+ 'PROB_GNOMAD_LOF_INTOLERANT':25, 
+ 'PROB_GNOMAD_LOF_INTOLERANT_HOM': 26, 
+ 'PROB_GNOMAD_LOF_TOLERANT_NULL':27,
+ 'ESSENTIAL_GENE_CRISPR': 28, 
+ 'ESSENTIAL_GENE_CRISPR2': 29}
 
  vcf_info_element_types = {}
  for e in vcf.header_iter():

diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py
@@ -372,10 +372,12 @@ def map_variant_effect_predictors(rec, algorithms):
  rec.INFO['PRIMATEAI_DBNSFP'] = str(algo_pred.split(':')[1])
  if algo_pred.startswith('list_s2:'):
  rec.INFO['LIST_S2_DBNSFP'] = str(algo_pred.split(':')[1])
+ if algo_pred.startswith('gerp_rs:'):
+ rec.INFO['GERP_DBNSFP'] = str(algo_pred.split(':')[1])
  if algo_pred.startswith('bayesdel_addaf:'):
  rec.INFO['BAYESDEL_ADDAF_DBNSFP'] = str(algo_pred.split(':')[1])
- if algo_pred.startswith('clinpred:'):
- rec.INFO['CLINPRED_DBNSFP'] = str(algo_pred.split(':')[1])
+ if algo_pred.startswith('aloft:'):
+ rec.INFO['ALOFTPRED_DBNSFP'] = str(algo_pred.split(':')[1])
  if algo_pred.startswith('splice_site_rf:'):
  rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1])
  if algo_pred.startswith('splice_site_ada:'):

diff --git a/src/loftee_1.0.3.tgz b/src/loftee_1.0.3.tgz