-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624
Conversation
0192a6d
to
b079c79
Compare
b079c79
to
a8b9577
Compare
@fleharty forgot about this short PR. Might be nice to get it in before release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee Although a small change, can you add a simple test that passes with this change, and fails on previous code?
.filter(intervalIndex -> !filterIntervals[intervalIndex] && readCounts.getEntry(sampleIndex, intervalIndex) == 0.) | ||
.count(); | ||
if (numZerosInSample > maxZerosInSample) { | ||
if (numZerosInSample / numPassingIntervals >= maximumZerosInSamplePercentage / 100.) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused, why is numZerosInSample a double rather than an int?
If you need it to be a double so that the fraction is a double, why not cast at the point of computing the fraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In both cases, the cast happens in the next line and the variable is not used elsewhere, so I'm OK keeping it like this. Certainly it's valid to represent an integer with a double, at least here...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samuelklee It's fine the way it is.
a8b9577
to
7c75993
Compare
Hmm, thanks for suggesting the addition of a regression test @fleharty. This caused me to realize that I actually missed another gap in the previous filtering logic that might have yielded NaNs (resulting from division by zero interval medians) in this particular edge case, which actually takes effect before the rounding error I originally fixed. However, because of how HDF5 writes NaN values as 0, this apparently doesn't lead to any catastrophic failures. We should definitely check that behavior is reasonable in this case (i.e., when interval medians are zero); I've filed #6878. In the end, I added a regression test that only passes with the changes to address the rounding error. This was a bit of a pain because we use simulated data in the tests that cover this code, and the filters are applied in sequential order only on those elements that passed the previous filter. Note that there are many other possible filtering combinations that would be impractical to test. I think that all of this filtering logic was ported over from GATK CNV (I only rewrote the code to perform the filtering in-place to improve memory usage), and I'm not sure that all edge-case behavior was well defined by the original logic (which probably implicitly assumed typical, well formed data, i.e., using more than one sample, without too many uncovered intervals). Fortunately, these edge-case usages (i.e., using a single sample to build the PoN, mistakenly including too many uncovered intervals, and/or disabling various filters) are probably not too common. |
Hmm, actually, let me take a second look at this. I think I got a bit confused looking back at the original forum post by the fact that two different users were posting about slightly different scenarios. I'll do some more testing of edge cases just to make sure I'm not missing anything. Apologies, but it's been a while since I opened this, so I need to refresh my memory! EDIT: OK, I think I understand things now and edited the previous comment to remove confusing/misleading remarks. I'm OK with merging this for this release if you are, and we can look at the NaN issue later---doesn't seem to have caused any serious issues thus far... |
2f78de3
to
766a160
Compare
766a160
to
240cc31
Compare
Fixed up a few more minor things, back to you, @fleharty! |
…reateReadCountPanelOfNormals.
1839075
to
bc0e296
Compare
bc0e296
to
976ddf2
Compare
Oops, forgot to set some random seeds and got failures on Travis that were passing locally. Think it should be OK now. Good looking out Travis RNGs, you da real MVPs. |
6c438e0
to
88d9fcb
Compare
@fleharty can we close this out? |
@fleharty just a reminder about this PR, let's try to get it in before next release. |
I think this is great, 👍 |
@fleharty thanks! I think I need an explicit approval to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't realize that didn't realize I didn't approve.
Just a minor fix, but could conceivably change results by keeping/dropping samples/intervals on the edge of the filter. See discussion in https://gatk.broadinstitute.org/hc/en-us/community/posts/360057785591-Error-while-running-CreateReadCountPanelOfNormals