Fixes in mtVariantCaller.py
to avoid issues with multiallelic positions with homoplasmic insertions or deletions. Variant Calling was failing and not generating a VCF or providing None
in place of the expected genotypes in the GT field of the VCF.
This fix should solve issue #84
Update to MToolBox v.1.2
-
A bug in consensus fasta sequence generation has been fixed. The bug caused heteroplasmic variants to be included in the consensus fasta sequence (generating many IUPAC ambiguity and incorrect haplogroup predictions).
-
Fix to heteroplasmy calculation of insertions and deletions that previously returned underestimated values.
-
Changes in the
mapExome.py
to remove reads that show a number of soft-clipped bases > 1/3 read length. This is to remove reads that show unique best alignments to mtDNA but still carry a small portion of the non-aligned read that can putatively map on nuclear homologous (NumtS) sequences. -
Strand-specific read depth has been added as a new feature to the VCF, under the new field
SDP
. This field reports the number of forward and reverse reads supporting the alternative allele, expressed asF;R
, whereF
is the number of forward reads andR
is the number of reverse reads. -
Additional options can be now specified in the configuration file (or default values will be considered).
Options to define the consensus fasta:
hf_min = float [DEFAULT = 0.2]
is the lower bound of the heteroplasmy range for ALT alleles
hf_max = float [DEFAULT = 0.8]
is the upper bound of the heteroplasmy range for ALT alleles
If there is one ALT allele with HF > hf_max, this will be reported in the consensus. If all ALT alleles have hf_min <= HF <= hf_max, a IUPAC ambuguity will be reported in the consensus. If there is no ALT allele with HF >= hf_min (i.e. ALT HF allele < hf_min) the REF allele will be reported in the consensus
Other options:
minrd = integer [DEFAULT = 5]
is the minimum read depth to call an ALT allele in the VCF file
minqual = integer [DEFAULT = 25]
is the minimum per-base quality score required to consider base calling variants in a certain mtDNA position.
Fix in mtVariantCaller.py
to solve issues #64 and #72.
Fix in MToolBox.sh
to handle multiple haplogroup predictions. This solves issue #73.
Tag release of MToolBox.v.1.1
A bug in the phylotree build 17 tree parsing has been fixed, causing replacement of haplogroups.txt
and phylotree_r17.pickle
files. The bug was causing wrong predictions mostly for X haplogroups. Please update your MToolBox installation with git pull
.
A bug reported in github issues #64 and #56 has been fixed. The bug was affecting the parsing of insertions/deletions performed by the mtVariantCaller.py
script.
A bug in the MToolBox.sh
script has been identified and fixed. The pipeline was failing when input folders with "." in the filename were used, pointing at incorrect input directory path.
Update to MToolBox v.1.1
- Sitevar nucleotide variability reported in the
annotation.csv
file has been updated based on 30,806 healthy complete genomes stored in HmtDB database.
Values of nucleotide variability might have slightly changed for each mt position. Positions with biggest changes compared to the previous sitevar calculation are:
3328
10956
14598
16135
Please note that sorting order of variant alleles in the annotation.csv
can slightly change based on this nucleotide variability update.
vcf_name
is a new option now supported by MToolBox that can be specified in the configuration to assign a name to the VCF file generated by the pipeline. If not specified,unknown_sample_name.vcf
filename will be created instead.
Changes in the install.sh
of MToolBox to allow installation on Mac OS X. To run installation for Mac OS please run:
./install.sh -o
Update to Phylotree build 17 (http://www.phylotree.org/tree/index.htm). Update of the patho_table.txt file used for functional annotation with nucleotide variablity values calculated on human genomes available at HmtDB as of July 2017. Null values have been also substituted by NAs to improve readability of the file.
An error in the MToolBox.sh
script has been fixed. The pipeline was failing is case of multiple annotation.csv files.
A new test has been added to the test
directory. Old simulated files have been moved to sim_data
folder.
An error in the assembleMTgenome.py has been fixed. The pipeline was failing in case of a deletion present in the pileup file but already included in a gap by the mtDNA assembly process
changes default zlib version to 1.2.11
Upload of new files used by GATK IndelsRealigner based on rCRS, with changes in reference name. chrRCRS
was changed into chrM
for all the following files in the MToolBox/data directory:
- MITOMAP_HMTDB_known_indels_chrM.vcf
- chrM.fa
- chrM.fa.fai
- chrM.dict
- intervals_file_chrM.list
GenomeAnalysisTK.jar
was removed from the MToolBox/ext_tools
directory as it was an outdated version of the tool.
Users that would like to run GATK IndelRealigner are now asked to download a newer version of GATK and place it in the MToolBox/ext_tools
folder:
cp GenomeAnalysisTK.jar /path/to/MToolBox/MToolBox/ext_tools/
To run GATK IndelRealigner users have to specify UseIndelRealigner=true
in the config.sh file used to run MToolBox.
Update to MToolBox v.1.0. The full installation of the pipeline is now possible by running the install.sh
script provided here: https://github.com/mitoNGS/MToolBox/blob/MToolBox_devel/install.sh. This script will install all the MToolBox dependancies and creates a setup.sh
file in the MToolBox directory with all the paths to executables and GSNAP databases and references needed by MToolBox, that is sourced by the MToolBox.sh
file. The users are just asked to fill in the config.sh file, which is the only mandatory argument required by MToolBox.sh
. The users must assign a value to the mandatory options in the config.sh file and, optionally, they can change other MToolBox arguments. For the full list of MToolBox arguments that can be specified through the config file, please have a look at the test_config.sh
file provided in the MToolBox github repository (https://github.com/mitoNGS/MToolBox/blob/master/test_rCRS_config.sh). However, mapExome.py
and assemblyMTgenome.py
options can be still changed within the MToolBox command line, using the -m
and -a
option, respectively.
Update to MToolBox version 0.3.2 with the following change:
- A bug in patho-table.txt has been fixed. 313 new stop-gain mutations and 6 new missense variants are now included.
New fields added to the annotation.csv output file:
- tRNA annotation: specific information regarding mitochondrial tRNA genes (position in tRNA; tRNA type; cloverleaf secondary region; mature nucleotide; involvement of the specific position in tRNA folding).
- RNA predictions: score added for 49 variants in rRNA genes (Smith PM et al, 2014, PMID:24092330) and 207 variants in tRNA genes (Yarham JW et al, 2011, PMID:21882289; Blakely EL et al, 2013, PMID:23696415). Scores were retrieved from literature and correlated on a scale from 0 to 1. Threshold for rRNAs=0.51. Threshold for tRNAs= 0.31. Low pathogenicity under the fixed thresholds.
- ClinVar: ClinVar annotation of associated disease(s) (January 21, 2015 update)
- PhastCons20Way: PhastCons conservation score calculated on 20 vertebrates using hg38+rCRS as reference sequence
- PhyloP20Way: PhyloP conservation score calculated on 20 vertebrates using hg38+rCRS as reference sequence
Fields updated in the annotation.csv output file:
- Nt variability: SiteVar variability value calculated on 22,691 complete healthy genomes in HmtDB database (May 2015 update)
- Aa variability: MitVarProt variability value calculated on 22,691 complete healthy genomes in HmtDB database (May 2015 update)
- Mitomap associated disease(s), Mitomap Homoplasmy, Mitomap Heteroplasmy: July 20, 2015 update
- Mitomap somatic mutations, SM Homoplasmy, SM Heteroplasmy and Mitomap associated disease(s) only RNA mutations: July 29, 2015 update
- dbSNP ID: release 144, May 26, 2015
- OMIM link: August 4, 2015 update
Further details for this update can be found at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5123245/.
Update to MToolBox version 0.3.1 with the following change:
- A bug in the MToolBox.sh script has been fixed. Empty paired-end fastq files generated during SAM/BAM to fastq conversion are now removed.
Update to MToolBox version 0.3 with the following new options and changes:
- fastq.gz is a further possible input format file. Installation of zlib libraries is therefore required.
- users can specify the path of the working directories using
–p
(path to input folder) and–o
(path to output folder) options. - users can specify a list of files to be used as input through -l option. It accepts a text file containing one sample name for each line. This list should be named as "list.txt" and placed in the input folder. Alternatively, users can provide comma-separated names with the same option. It is mandatory to report in such list the filename extension (e.g. mysample.sam or mysample.R1.fastq).
- users can use -X option to allow the extraction from a BAM file of mitochondrial reads mapped onto a mitochondrial reference sequence. This option can be useful when using Whole Genome or Exome sequencing BAM files containing a huge amount ofnuclear reads. The option works only with the BAM format.
new fields added to the annotation.csv output file:
- Disease Score: "% Disease", "% Neutral" and "% Unclassified" fields have been replaced with an overall Disease Score, generated as a weighted average of pathogenicity prediction scores for non-synonymous variants, derived from a training dataset of 53 non synonymous variants selected among mitochondrial diseases or cancer associated mutations. Weights have been calculated by taking into account the right prediction and the best probability to predict a truly pathogenic mtDNA variants generated by the pathogenicity prediction algorithms currently implemented in MToolBox.
- MutPred Pred: MutPred prediction (Low pathogenicity, High pathogenicity).
- dbSNP ID: Variant ID in dbSNP.
for users convenience, new scripts to help the generation of reports about the annotated and prioritized variants have been added to the suite of tools provided by MToolBox:
- prioritization.py, which generates the prioritized_variants.txt file, reporting annotation only for prioritized variants for each sample analyzed, defined as variants recognized by the three reference sequences (rCRS, RSRS and MHCS), sorted per increasing variability.
- summary.py, which generates the summary.txt file, reporting statistics about the coverage of reconstructed mitochondrial genomes, number of homoplasmic and heteroplasmic variants (for NGS data), haplogroup prediction and number of prioritized variants.
- An error occurred during the generation of circularized mitochondrial chromosome, used for the gsnap db generation. Hg19RCRS/hg19RSRS and chrRCRS/chrRSRS gsnap indexed databases have been replaced with those using the linearized mitochondrial chromosome. We apologize with the MToolBox users for this inconvenient.
Update to MToolBox version 0.2.2:
- an error encountered during the analysis of hard clipping mapped reads has been fixed in the mtVariantCaller.py
- the mtVariantCaller.py has been improved to better manage sites with multiple alleles.
- hidden files included in the package and generating problems with the mt-classifier.py have been eliminated.
Update to MToolBox version 0.2.1:
- an error encountered during the sam to fastq extraction has been fixed in the MToolBox.sh file.
- RCRS, hg19RCRS, RSRS and hg19RSRS gmap indexed databases have been regenerated using the -c option for circularized chromosomes.
- update to Phylotree Build 16 for haplogroup prediction.
- an error fixed in the -t parameter of assembleMTgenome.
- an error encountered during the bam to fastq extraction has been fixed. Empty unpaired fastq files are now removed.
- changed the method for estimation of the heteroplasmy confidence interval (CI). For sites with coverage depth <= 40, the heteroplasmy CI is estimated with the Wilson score interval; for larger coverage depth values, the Agresti-Coull interval is used.
- added the possibility to use fasta inputs to perform haplogroup prediction and functional annotation.
- added the possibility to use the revised Cambridge Reference Sequence (rCRS) as reference sequence for read mapping. By using rCRS as reference sequence, the VCF output will be rCRS-based.