A tool to create a draft genome file out of a GATK VCF file
Download via GitHub Releases or via Bioconda.
Author: Alexander Herbig herbig@shh.mpg.de (v0.84), Alexander Peltzer (v0.90+).
Contact Alexander Peltzerpeltzer@shh.mpg.de for questions regarding the tool or via GitHub and/or open a ticket here.
You can see a help when running the tool with -h
. This generates the following help message:
Option "-draft" is required
-draft VAL : draft contains Ns where no call can be made. RefMod contains reference calls instead at
these positions.
-draftname DRAFT_SEQ_NAME : Name of the draft sequence.
-h : Display this help information and exit. (default: true)
-in VAL : input VCF file
-minc MIN_COVERAGE_FOR_SNP : Minimum coverage / reads confirming the call.
-minfreq MIN_SNP_FREQUENCY : Minimum fraction of reads supporting the called nucleotide.
-minq MIN_QUAL_SCORE : Minimum quality score. For UG: Phred scaled quality score. For HC genome quality score.
-ref VAL : reference genome in FastA format
-refMod VAL : More precise uncertainty encoding. N: Not covered or ambiguous. R: Low coverage but looks
like Ref. a,c,t,g (lower case): Low coverage but looks like SNP.
-uncertain VAL : Special 1234 encoded FastA output.
Example: java -jar VCF2Genome.jar -draft VAL -draftname DRAFT_SEQ_NAME -in VAL -minc MIN_COVERAGE_FOR_SNP -minfreq MIN_SNP_FREQUENCY -minq MIN_QUAL_SCORE -ref VAL -refMod VAL -uncertain VAL
java -jar VCF2Genome.jar -draft my_output_genome.fasta -draftname "My_Fancy_Genome_Name" -in my_input.vcf -minc 5 -minfreq 0.8 -minq 30 -ref myreference_genome.fasta -refMod output.refMod -uncertain 1234_output.fasta
Name of the output file to which the FastA genome sequence should be written. Contains Ns where no call can be made.
Name of the draft sequence inside the FastA file (header of the FastA entry that is created).
Name of the input VCF file in VCF4.0/4.1 format.
Minimum coverage / reads confirming the call required.
Minimum quality threshold used for filtering the calls.
Minimum fraction of reads supporting the called nucleotide.
Reference genome used in FastA format.
Path to refMod format output file. This contains a more detailed output encoding than just including N
at unclear positions. Useful for further investigation of some sites for example.
N: Not covered or ambiguous. R: Low coverage but looks like Reference call. a,c,t,g (lower case): Low coverage but looks like SNP.
Path to uncertainty encoded output file in a special 1234 format for some downstream tools.
Note that this tool was written a couple of years ago for reconstructing genomes from GATK UnifiedGenotyper VCF output files. It may work with other genotypers providing the same kind of VCF4.0/VCF4.1 format, but might not work well with data originating for example from GATK HaplotypeCaller. The tool requires an EMIT_ALL_SITES
compatible VCF input file.
This tool is currently unable to handle indels properly due to the index handling procedure in the software itself. SNPs are fine.