Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

Open
jamesemery opened this issue Apr 23, 2020 · 4 comments
Open

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

jamesemery opened this issue Apr 23, 2020 · 4 comments

Comments

@jamesemery
Copy link
Collaborator

In doing some performance evaluation work for some other HaplotypeCaller work I have noticed that there is apparently a performance regression on the order of perhaps 10-20% of runtime. Running locally I find that running over the same section of a WGS chromosome 15 on the current master 78a9ecd vs the release 4.1.5.0 i get the following results:

Master:
real 12m19.765s
user 13m49.276s
sys 0m8.571s
4.1.5.0:
real 9m50.558s
user 11m11.924s
sys 0m10.193s

Doing some very cursory digging it would appear that the culprit is in the HMM adjacent code being slowed down. (Note the relative runtime of HMM vs SW)
Master:
Screen Shot 2020-04-23 at 1 28 52 PM

4.1.5.0:
Screen Shot 2020-04-23 at 1 28 33 PM

@droazen droazen added this to the GATK-Priority-Backlog milestone Apr 23, 2020
@droazen
Copy link
Contributor

droazen commented Apr 23, 2020

@davidbenjamin Your thoughts on this?

@davidbenjamin
Copy link
Contributor

@droazen My mind immediately jumped to the assembly/genotyping windows PR, #6358, but that was in 4.1.5 and in any case I profiled that thoroughly. Spending more time in pairHMM could only mean more haplotypes or longer haplotypes and I'm having trouble seeing which of my PRs since 4.1.5 would do that. However, it doesn't look like any one else's PRs could be responsible.

@jamesemery is the regression in 4.1.6 or just 4.1.7? Can you binary search for the commit where the runtime jumps?

@jamesemery
Copy link
Collaborator Author

jamesemery commented Apr 24, 2020

I ran a git bisect with this quick and dirty runtime experiment. This is the branch that popped out. Its worth noting that there was a fairly wide variance for results but even so this branch seemed perhaps moderately faster than master with runtimes in the range of:
real 10m55.401s
user 12m6.348s
sys 0m8.147s
as opposed to the 12 minute range from my first trials. I would have to dig further to determine if there isn't another branch that added to this effect.

Author: David Benjamin <davidben@broadinstitute.org>
Date:   Mon Mar 23 13:48:06 2020 -0400

    Fix edge case when haplotypes have leading insertion after trimming (#6518)

:040000 040000 82b682988b62ac4b49e0b123912d488a32045e9e a1368ea7f1837168745853739877253369d73fe8 M src

@davidbenjamin
Copy link
Contributor

I have not had time to do any profiling, but I have looked at a lot of commits. I think it's likely that my recent changes cause some haplotypes with leading indels to be kept when previously they may have been dropped. It's hard to believe that this could cause a 10-20% slowdown via a commensurate increase in the number of haplotypes assembled. However, haplotypes with leading indels would have a disproportionate pair HMM cost since they would spoil caching of the read-haplotype pair HMM matrix at the very beginning of the matrix. That is, in addition to being particularly expensive haplotypes because they would diverge from the previous haplotype at the first position and therefore not benefit from caching at all, they would also completely destroy whatever caching the previous haplotype would have gotten.

We ought to think about haplotypes that start or end with indels. It seems to me that they are bad news and very likely artifacts of assembly windows and/or reads that end in the middle of an STR. I would worry about discarding them outright, because what if all the real variation is attached to haplotypes like this. Therefore, I think the best thing to do is to choose assembly windows more carefully and increase or decrease padding to avoid ending in an STR.

Avoiding assembly windows that end in STRs is a wise thing to do regardless, so how about I make a branch for that and we can see if the performance regression goes away?

@droazen droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants