-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567
Comments
@davidbenjamin Your thoughts on this? |
@droazen My mind immediately jumped to the assembly/genotyping windows PR, #6358, but that was in 4.1.5 and in any case I profiled that thoroughly. Spending more time in pairHMM could only mean more haplotypes or longer haplotypes and I'm having trouble seeing which of my PRs since 4.1.5 would do that. However, it doesn't look like any one else's PRs could be responsible. @jamesemery is the regression in 4.1.6 or just 4.1.7? Can you binary search for the commit where the runtime jumps? |
I ran a git bisect with this quick and dirty runtime experiment. This is the branch that popped out. Its worth noting that there was a fairly wide variance for results but even so this branch seemed perhaps moderately faster than master with runtimes in the range of:
|
I have not had time to do any profiling, but I have looked at a lot of commits. I think it's likely that my recent changes cause some haplotypes with leading indels to be kept when previously they may have been dropped. It's hard to believe that this could cause a 10-20% slowdown via a commensurate increase in the number of haplotypes assembled. However, haplotypes with leading indels would have a disproportionate pair HMM cost since they would spoil caching of the read-haplotype pair HMM matrix at the very beginning of the matrix. That is, in addition to being particularly expensive haplotypes because they would diverge from the previous haplotype at the first position and therefore not benefit from caching at all, they would also completely destroy whatever caching the previous haplotype would have gotten. We ought to think about haplotypes that start or end with indels. It seems to me that they are bad news and very likely artifacts of assembly windows and/or reads that end in the middle of an STR. I would worry about discarding them outright, because what if all the real variation is attached to haplotypes like this. Therefore, I think the best thing to do is to choose assembly windows more carefully and increase or decrease padding to avoid ending in an STR. Avoiding assembly windows that end in STRs is a wise thing to do regardless, so how about I make a branch for that and we can see if the performance regression goes away? |
In doing some performance evaluation work for some other HaplotypeCaller work I have noticed that there is apparently a performance regression on the order of perhaps 10-20% of runtime. Running locally I find that running over the same section of a WGS chromosome 15 on the current master 78a9ecd vs the release 4.1.5.0 i get the following results:
Master:
real 12m19.765s
user 13m49.276s
sys 0m8.571s
4.1.5.0:
real 9m50.558s
user 11m11.924s
sys 0m10.193s
Doing some very cursory digging it would appear that the culprit is in the HMM adjacent code being slowed down. (Note the relative runtime of HMM vs SW)
Master:
4.1.5.0:
The text was updated successfully, but these errors were encountered: