This repository has been archived by the owner on Jan 31, 2020. It is now read-only.
Speed improvements, especially for sorry genomes #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
pindel2vcf runs very slowly on plant genomes that aren't in the very best of shape, the current version can take many days to process pindel output. The changes in this pull request let us process results in a reasonable amount of time.
Changes in this pull request are:
use fai fasta file index to avoid parsing entire reference file multiple times, it had been at least once + once per contig in results. The fai file is currently required by pindel so I believe it's reasonable to assume it exists.
Index first occurrence of each chromosome in each result file pindel _D,_INT etc... during first pass scan in GetSampleNamesAndChromosomeNames. Then use that to avoid reparsing entire pindel output files on every new contig.
limit calls to isSVSummarizingLine by checking if line starts with digit first.
use std::getline instead of read by char; I've tested std::getline with fasta sequence up to 400mb on a single line, it has no issues. I'm guessing the version note about getline having issues referred to std::istream::getline which needs buffer management.
timing;
kitaake - 12 chromosomes followed by 1300 scaffolds ( ~400mb )
v 0.6.3 56 minutes
v 0.6.0 5 minutes
v this 30 seconds
nipponbare - 12 chromosomes and 2 organelles ( ~400mb )
v 0.6.3 241 seconds
v 0.6.0 55 seconds
v this 50 seconds
panicum - 9 chromosomes followed by 8400 scaffolds ( ~550 mb )
result files pre-grepped for ChrID lines
v 0.6.3 killed after 3 days. Estimate over a month.
v 0.6.0 22 hours 46 minutes
v this 41 minutes
clostridium - 1 contig, 3.5mb
v 0.6.3 2 seconds
v 0.6.0 1 second
v this 1 second