Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

Merged
merged 3 commits into from
Nov 13, 2020

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented May 28, 2020

Just a minor fix, but could conceivably change results by keeping/dropping samples/intervals on the edge of the filter. See discussion in https://gatk.broadinstitute.org/hc/en-us/community/posts/360057785591-Error-while-running-CreateReadCountPanelOfNormals

@gatk-bot
Copy link

gatk-bot commented May 28, 2020

Travis reported job failures from build 30435
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30435.11 logs
integration openjdk8 30435.2 logs
integration openjdk11 30435.11 logs
integration openjdk8 30435.2 logs

@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch from 0192a6d to b079c79 Compare June 14, 2020 15:08
@gatk-bot
Copy link

Travis reported job failures from build 30624
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30624.11 logs

@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch from b079c79 to a8b9577 Compare October 7, 2020 17:10
@samuelklee samuelklee requested a review from fleharty October 7, 2020 17:10
@samuelklee
Copy link
Contributor Author

@fleharty forgot about this short PR. Might be nice to get it in before release?

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee Although a small change, can you add a simple test that passes with this change, and fails on previous code?

.filter(intervalIndex -> !filterIntervals[intervalIndex] && readCounts.getEntry(sampleIndex, intervalIndex) == 0.)
.count();
if (numZerosInSample > maxZerosInSample) {
if (numZerosInSample / numPassingIntervals >= maximumZerosInSamplePercentage / 100.) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused, why is numZerosInSample a double rather than an int?
If you need it to be a double so that the fraction is a double, why not cast at the point of computing the fraction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, the cast happens in the next line and the variable is not used elsewhere, so I'm OK keeping it like this. Certainly it's valid to represent an integer with a double, at least here...?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samuelklee It's fine the way it is.

@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch from a8b9577 to 7c75993 Compare October 8, 2020 21:29
@samuelklee
Copy link
Contributor Author

samuelklee commented Oct 8, 2020

Hmm, thanks for suggesting the addition of a regression test @fleharty. This caused me to realize that I actually missed another gap in the previous filtering logic that might have yielded NaNs (resulting from division by zero interval medians) in this particular edge case, which actually takes effect before the rounding error I originally fixed.

However, because of how HDF5 writes NaN values as 0, this apparently doesn't lead to any catastrophic failures. We should definitely check that behavior is reasonable in this case (i.e., when interval medians are zero); I've filed #6878.

In the end, I added a regression test that only passes with the changes to address the rounding error. This was a bit of a pain because we use simulated data in the tests that cover this code, and the filters are applied in sequential order only on those elements that passed the previous filter. Note that there are many other possible filtering combinations that would be impractical to test.

I think that all of this filtering logic was ported over from GATK CNV (I only rewrote the code to perform the filtering in-place to improve memory usage), and I'm not sure that all edge-case behavior was well defined by the original logic (which probably implicitly assumed typical, well formed data, i.e., using more than one sample, without too many uncovered intervals). Fortunately, these edge-case usages (i.e., using a single sample to build the PoN, mistakenly including too many uncovered intervals, and/or disabling various filters) are probably not too common.

@samuelklee
Copy link
Contributor Author

samuelklee commented Oct 8, 2020

Hmm, actually, let me take a second look at this. I think I got a bit confused looking back at the original forum post by the fact that two different users were posting about slightly different scenarios. I'll do some more testing of edge cases just to make sure I'm not missing anything.

Apologies, but it's been a while since I opened this, so I need to refresh my memory!

EDIT: OK, I think I understand things now and edited the previous comment to remove confusing/misleading remarks. I'm OK with merging this for this release if you are, and we can look at the NaN issue later---doesn't seem to have caused any serious issues thus far...

@samuelklee
Copy link
Contributor Author

Fixed up a few more minor things, back to you, @fleharty!

@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch 2 times, most recently from 1839075 to bc0e296 Compare October 9, 2020 13:37
@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch from bc0e296 to 976ddf2 Compare October 9, 2020 13:39
@gatk-bot
Copy link

gatk-bot commented Oct 9, 2020

Travis reported job failures from build 31727
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk8 31727.2 logs

@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@broadinstitute broadinstitute deleted a comment from gatk-bot Oct 9, 2020
@samuelklee
Copy link
Contributor Author

Oops, forgot to set some random seeds and got failures on Travis that were passing locally. Think it should be OK now. Good looking out Travis RNGs, you da real MVPs.

@samuelklee samuelklee force-pushed the sl_fix_read_count_pon_filters branch from 6c438e0 to 88d9fcb Compare October 9, 2020 15:39
@samuelklee
Copy link
Contributor Author

@fleharty can we close this out?

@samuelklee
Copy link
Contributor Author

@fleharty just a reminder about this PR, let's try to get it in before next release.

@fleharty
Copy link
Contributor

I think this is great, 👍

@samuelklee
Copy link
Contributor Author

@fleharty thanks! I think I need an explicit approval to merge.

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't realize that didn't realize I didn't approve.

@samuelklee samuelklee merged commit 46ddda2 into master Nov 13, 2020
@samuelklee samuelklee deleted the sl_fix_read_count_pon_filters branch November 13, 2020 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants