vcf-annotator uses the reference GenBank file to add more details to the variant calls in a VCF.
Using a reference GenBank file, vcf-annotator adds biological annotations to variants in a VCF file. A full list of annotations is descibed below, but these include amino acid changes, gene information, synonymous vs nonsynonymous, locus tag information, among many more.
For each mutation, if applicable, the following annotations are added to the INFO column of the VCF.
Annotation | Description |
---|---|
RefCodon | Reference codon |
AltCodon | Alternate codon |
RefAminoAcid | Reference amino acid |
AltAminoAcid | Alternate amino acid |
CodonPosition | Codon position in the gene |
SNPCodonPosition | SNP position in the codon |
AminoAcidChange | Amino acid change |
IsSynonymous | 0:nonsynonymous, 1:synonymous, 9:N/A or Unknown |
IsTransition | 0:transversion, 1:transition, 9:N/A or Unknown |
IsGenic | 0:intergenic, 1:genic |
IsPseudo | 0:not pseudo, 1:pseudo gene |
LocusTag | Locus tag associated with gene |
Gene | Name of gene |
Note | Note associated with gene |
Inference | Inference of feature. |
Product | Description of gene |
ProteinID | Protein ID of gene |
Comments | Example: Negative strand: T->C |
VariantType | Indel, SNP, Ambiguous_SNP |
FeatureType | The feature type of variant. |
vcf-annotator is available from BioConda
conda install -c bioconda vcf-annotator
git@github.com:rpetit3/vcf-annottor.git
cd vcf-annottor
pip3 install -r requirements.txt
python3 vcf-annottor.py YOUR_VCF.vcf REFERENCE.gb
Nothing much else to it, just a simple to read in a VCF and GenBank file and output an annotated VCF. Feel free to drop it in your $PATH somewhere!
vcf-annotator requires an uncompressed VCF file and the corresponding reference GenBank file. It then outputs the annotated variants, by default to STDOUT, but this can be changed on runtime.
python3 vcf-annotator.py
usage: vcf-annotator.py [-h] [--output STRING] [--version]
VCF_FILE GENBANK_FILE
Annotate variants from a VCF file using the reference genome's GenBank file.
positional arguments:
VCF_FILE VCF file of variants
GENBANK_FILE GenBank file of the reference genome.
optional arguments:
-h, --help show this help message and exit
--output STRING File to write VCF output to (Default STDOUT).
--version show program's version number and exit
python3 vcf-annotator.py --version
vcf-annotator.py 0.5
A VCF and GenBank file are included in the example-data directory. You can use these two files to verify the script is working properly.
python3 vcf-annotator.py example-data/example.vcf example-data/example.gb
This script has been developed only for microbial variant analysis. I've only tested on VCF files output from GATK, but I would assume if the VCF format is followed other VCF files should work as well. Currently for a ~3mb genome with ~20k mutations it takes about 10s to annotate the VCF file. Based on this information, I'm not sure how well it would work on larger genomes (if it would even work at all!).