Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Speed improvements, especially for sorry genomes #92

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

joelmartin
Copy link

pindel2vcf runs very slowly on plant genomes that aren't in the very best of shape, the current version can take many days to process pindel output. The changes in this pull request let us process results in a reasonable amount of time.

Changes in this pull request are:
use fai fasta file index to avoid parsing entire reference file multiple times, it had been at least once + once per contig in results. The fai file is currently required by pindel so I believe it's reasonable to assume it exists.

Index first occurrence of each chromosome in each result file pindel _D,_INT etc... during first pass scan in GetSampleNamesAndChromosomeNames. Then use that to avoid reparsing entire pindel output files on every new contig.

limit calls to isSVSummarizingLine by checking if line starts with digit first.

use std::getline instead of read by char; I've tested std::getline with fasta sequence up to 400mb on a single line, it has no issues. I'm guessing the version note about getline having issues referred to std::istream::getline which needs buffer management.

timing;
kitaake - 12 chromosomes followed by 1300 scaffolds ( ~400mb )
v 0.6.3 56 minutes
v 0.6.0 5 minutes
v this 30 seconds

nipponbare - 12 chromosomes and 2 organelles ( ~400mb )
v 0.6.3 241 seconds
v 0.6.0 55 seconds
v this 50 seconds

panicum - 9 chromosomes followed by 8400 scaffolds ( ~550 mb )
result files pre-grepped for ChrID lines
v 0.6.3 killed after 3 days. Estimate over a month.
v 0.6.0 22 hours 46 minutes
v this 41 minutes

clostridium - 1 contig, 3.5mb
v 0.6.3 2 seconds
v 0.6.0 1 second
v this 1 second

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant