Speed improvements, especially for sorry genomes #92

joelmartin · 2018-05-30T04:22:42Z

pindel2vcf runs very slowly on plant genomes that aren't in the very best of shape, the current version can take many days to process pindel output. The changes in this pull request let us process results in a reasonable amount of time.

Changes in this pull request are:
use fai fasta file index to avoid parsing entire reference file multiple times, it had been at least once + once per contig in results. The fai file is currently required by pindel so I believe it's reasonable to assume it exists.

Index first occurrence of each chromosome in each result file pindel _D,_INT etc... during first pass scan in GetSampleNamesAndChromosomeNames. Then use that to avoid reparsing entire pindel output files on every new contig.

limit calls to isSVSummarizingLine by checking if line starts with digit first.

use std::getline instead of read by char; I've tested std::getline with fasta sequence up to 400mb on a single line, it has no issues. I'm guessing the version note about getline having issues referred to std::istream::getline which needs buffer management.

timing;
kitaake - 12 chromosomes followed by 1300 scaffolds ( ~400mb )
v 0.6.3 56 minutes
v 0.6.0 5 minutes
v this 30 seconds

nipponbare - 12 chromosomes and 2 organelles ( ~400mb )
v 0.6.3 241 seconds
v 0.6.0 55 seconds
v this 50 seconds

panicum - 9 chromosomes followed by 8400 scaffolds ( ~550 mb )
result files pre-grepped for ChrID lines
v 0.6.3 killed after 3 days. Estimate over a month.
v 0.6.0 22 hours 46 minutes
v this 41 minutes

clostridium - 1 contig, 3.5mb
v 0.6.3 2 seconds
v 0.6.0 1 second
v this 1 second

joelmartin added 3 commits May 24, 2018 18:37

getline for get

42329c6

avoid isSVSummarizingLine() if line could not be one

4c01c39

use fai for fasta index, index input during first pass

f82bce5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvements, especially for sorry genomes #92

Speed improvements, especially for sorry genomes #92

joelmartin commented May 30, 2018

Speed improvements, especially for sorry genomes #92

Are you sure you want to change the base?

Speed improvements, especially for sorry genomes #92

Conversation

joelmartin commented May 30, 2018