Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

jamesemery · 2020-04-23T17:31:04Z

In doing some performance evaluation work for some other HaplotypeCaller work I have noticed that there is apparently a performance regression on the order of perhaps 10-20% of runtime. Running locally I find that running over the same section of a WGS chromosome 15 on the current master 78a9ecd vs the release 4.1.5.0 i get the following results:

Master:
real 12m19.765s
user 13m49.276s
sys 0m8.571s
4.1.5.0:
real 9m50.558s
user 11m11.924s
sys 0m10.193s

Doing some very cursory digging it would appear that the culprit is in the HMM adjacent code being slowed down. (Note the relative runtime of HMM vs SW)
Master:

4.1.5.0:

The text was updated successfully, but these errors were encountered:

droazen · 2020-04-23T18:25:45Z

@davidbenjamin Your thoughts on this?

davidbenjamin · 2020-04-24T05:02:42Z

@droazen My mind immediately jumped to the assembly/genotyping windows PR, #6358, but that was in 4.1.5 and in any case I profiled that thoroughly. Spending more time in pairHMM could only mean more haplotypes or longer haplotypes and I'm having trouble seeing which of my PRs since 4.1.5 would do that. However, it doesn't look like any one else's PRs could be responsible.

@jamesemery is the regression in 4.1.6 or just 4.1.7? Can you binary search for the commit where the runtime jumps?

jamesemery · 2020-04-24T16:13:03Z

I ran a git bisect with this quick and dirty runtime experiment. This is the branch that popped out. Its worth noting that there was a fairly wide variance for results but even so this branch seemed perhaps moderately faster than master with runtimes in the range of:
real 10m55.401s
user 12m6.348s
sys 0m8.147s
as opposed to the 12 minute range from my first trials. I would have to dig further to determine if there isn't another branch that added to this effect.

Author: David Benjamin <davidben@broadinstitute.org>
Date:   Mon Mar 23 13:48:06 2020 -0400

    Fix edge case when haplotypes have leading insertion after trimming (#6518)

:040000 040000 82b682988b62ac4b49e0b123912d488a32045e9e a1368ea7f1837168745853739877253369d73fe8 M src

davidbenjamin · 2020-05-03T05:03:17Z

I have not had time to do any profiling, but I have looked at a lot of commits. I think it's likely that my recent changes cause some haplotypes with leading indels to be kept when previously they may have been dropped. It's hard to believe that this could cause a 10-20% slowdown via a commensurate increase in the number of haplotypes assembled. However, haplotypes with leading indels would have a disproportionate pair HMM cost since they would spoil caching of the read-haplotype pair HMM matrix at the very beginning of the matrix. That is, in addition to being particularly expensive haplotypes because they would diverge from the previous haplotype at the first position and therefore not benefit from caching at all, they would also completely destroy whatever caching the previous haplotype would have gotten.

We ought to think about haplotypes that start or end with indels. It seems to me that they are bad news and very likely artifacts of assembly windows and/or reads that end in the middle of an STR. I would worry about discarding them outright, because what if all the real variation is attached to haplotypes like this. Therefore, I think the best thing to do is to choose assembly windows more carefully and increase or decrease padding to avoid ending in an STR.

Avoiding assembly windows that end in STRs is a wise thing to do regardless, so how about I make a branch for that and we can see if the performance regression goes away?

jamesemery added HaplotypeCaller performance labels Apr 23, 2020

droazen added this to the GATK-Priority-Backlog milestone Apr 23, 2020

droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

jamesemery commented Apr 23, 2020

droazen commented Apr 23, 2020

davidbenjamin commented Apr 24, 2020

jamesemery commented Apr 24, 2020 •

edited

Loading

davidbenjamin commented May 3, 2020

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

Potential performance regression in HaplotypeCaller since 4.1.5.0 #6567

Comments

jamesemery commented Apr 23, 2020

droazen commented Apr 23, 2020

davidbenjamin commented Apr 24, 2020

jamesemery commented Apr 24, 2020 • edited Loading

davidbenjamin commented May 3, 2020

jamesemery commented Apr 24, 2020 •

edited

Loading