Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

samuelklee · 2020-05-28T20:23:35Z

Just a minor fix, but could conceivably change results by keeping/dropping samples/intervals on the edge of the filter. See discussion in https://gatk.broadinstitute.org/hc/en-us/community/posts/360057785591-Error-while-running-CreateReadCountPanelOfNormals

gatk-bot · 2020-05-28T21:07:50Z

Travis reported job failures from build 30435
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30435.11	logs
integration	openjdk8	30435.2	logs
integration	openjdk11	30435.11	logs
integration	openjdk8	30435.2	logs

gatk-bot · 2020-06-14T15:44:16Z

Travis reported job failures from build 30624
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk11	30624.11	logs

samuelklee · 2020-10-07T17:10:41Z

@fleharty forgot about this short PR. Might be nice to get it in before release?

fleharty

@samuelklee Although a small change, can you add a simple test that passes with this change, and fails on previous code?

fleharty · 2020-10-08T18:50:09Z

src/main/java/org/broadinstitute/hellbender/tools/copynumber/denoising/SVDDenoisingUtils.java

                                .filter(intervalIndex -> !filterIntervals[intervalIndex] && readCounts.getEntry(sampleIndex, intervalIndex) == 0.)
                                .count();
-                        if (numZerosInSample > maxZerosInSample) {
+                        if (numZerosInSample / numPassingIntervals >= maximumZerosInSamplePercentage / 100.) {


I'm a little confused, why is numZerosInSample a double rather than an int?
If you need it to be a double so that the fraction is a double, why not cast at the point of computing the fraction?

In both cases, the cast happens in the next line and the variable is not used elsewhere, so I'm OK keeping it like this. Certainly it's valid to represent an integer with a double, at least here...?

@samuelklee It's fine the way it is.

samuelklee · 2020-10-08T21:54:26Z

Hmm, thanks for suggesting the addition of a regression test @fleharty. This caused me to realize that I actually missed another gap in the previous filtering logic that might have yielded NaNs (resulting from division by zero interval medians) in this particular edge case, which actually takes effect before the rounding error I originally fixed.

However, because of how HDF5 writes NaN values as 0, this apparently doesn't lead to any catastrophic failures. We should definitely check that behavior is reasonable in this case (i.e., when interval medians are zero); I've filed #6878.

In the end, I added a regression test that only passes with the changes to address the rounding error. This was a bit of a pain because we use simulated data in the tests that cover this code, and the filters are applied in sequential order only on those elements that passed the previous filter. Note that there are many other possible filtering combinations that would be impractical to test.

I think that all of this filtering logic was ported over from GATK CNV (I only rewrote the code to perform the filtering in-place to improve memory usage), and I'm not sure that all edge-case behavior was well defined by the original logic (which probably implicitly assumed typical, well formed data, i.e., using more than one sample, without too many uncovered intervals). Fortunately, these edge-case usages (i.e., using a single sample to build the PoN, mistakenly including too many uncovered intervals, and/or disabling various filters) are probably not too common.

samuelklee · 2020-10-08T23:07:09Z

Hmm, actually, let me take a second look at this. I think I got a bit confused looking back at the original forum post by the fact that two different users were posting about slightly different scenarios. I'll do some more testing of edge cases just to make sure I'm not missing anything.

Apologies, but it's been a while since I opened this, so I need to refresh my memory!

EDIT: OK, I think I understand things now and edited the previous comment to remove confusing/misleading remarks. I'm OK with merging this for this release if you are, and we can look at the NaN issue later---doesn't seem to have caused any serious issues thus far...

samuelklee · 2020-10-09T02:36:08Z

Fixed up a few more minor things, back to you, @fleharty!

…reateReadCountPanelOfNormals.

gatk-bot · 2020-10-09T15:02:18Z

Travis reported job failures from build 31727
Failures in the following jobs:

Test Type	JDK	Job ID	Logs
integration	openjdk8	31727.2	logs

samuelklee · 2020-10-09T15:39:10Z

Oops, forgot to set some random seeds and got failures on Travis that were passing locally. Think it should be OK now. Good looking out Travis RNGs, you da real MVPs.

samuelklee · 2020-10-23T11:55:22Z

@fleharty can we close this out?

samuelklee · 2020-11-13T15:10:31Z

@fleharty just a reminder about this PR, let's try to get it in before next release.

fleharty · 2020-11-13T19:34:17Z

I think this is great, 👍

samuelklee · 2020-11-13T19:45:45Z

@fleharty thanks! I think I need an explicit approval to merge.

fleharty

Sorry, I didn't realize that didn't realize I didn't approve.

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from 0192a6d to b079c79 Compare June 14, 2020 15:08

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from b079c79 to a8b9577 Compare October 7, 2020 17:10

samuelklee requested a review from fleharty October 7, 2020 17:10

droazen assigned fleharty Oct 8, 2020

fleharty requested changes Oct 8, 2020

View reviewed changes

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from a8b9577 to 7c75993 Compare October 8, 2020 21:29

samuelklee force-pushed the sl_fix_read_count_pon_filters branch 5 times, most recently from 2f78de3 to 766a160 Compare October 9, 2020 02:12

samuelklee mentioned this pull request Oct 9, 2020

Double check effect of zero fractional-coverage interval medians in somatic CNV denoising. #6878

Open

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from 766a160 to 240cc31 Compare October 9, 2020 02:32

broadinstitute deleted a comment from gatk-bot Oct 9, 2020

Adjusted logic for filtering zero-coverage samples and intervals in C…

d2f5a1f

…reateReadCountPanelOfNormals.

samuelklee force-pushed the sl_fix_read_count_pon_filters branch 2 times, most recently from 1839075 to bc0e296 Compare October 9, 2020 13:37

Added regression tests and logging.

976ddf2

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from bc0e296 to 976ddf2 Compare October 9, 2020 13:39

broadinstitute deleted a comment from gatk-bot Oct 9, 2020

Set random seeds for shuffles.

88d9fcb

samuelklee force-pushed the sl_fix_read_count_pon_filters branch from 6c438e0 to 88d9fcb Compare October 9, 2020 15:39

fleharty approved these changes Nov 13, 2020

View reviewed changes

samuelklee merged commit 46ddda2 into master Nov 13, 2020

samuelklee deleted the sl_fix_read_count_pon_filters branch November 13, 2020 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

samuelklee commented May 28, 2020 •

edited

Loading

gatk-bot commented May 28, 2020 •

edited

Loading

gatk-bot commented Jun 14, 2020

samuelklee commented Oct 7, 2020

fleharty left a comment

fleharty Oct 8, 2020

samuelklee Oct 8, 2020

fleharty Oct 8, 2020

samuelklee commented Oct 8, 2020 •

edited

Loading

samuelklee commented Oct 8, 2020 •

edited

Loading

samuelklee commented Oct 9, 2020

gatk-bot commented Oct 9, 2020

samuelklee commented Oct 9, 2020

samuelklee commented Oct 23, 2020

samuelklee commented Nov 13, 2020

fleharty commented Nov 13, 2020

samuelklee commented Nov 13, 2020

fleharty left a comment

Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

Adjusted logic for filtering zero-coverage samples and intervals in CreateReadCountPanelOfNormals. #6624

Conversation

samuelklee commented May 28, 2020 • edited Loading

gatk-bot commented May 28, 2020 • edited Loading

gatk-bot commented Jun 14, 2020

samuelklee commented Oct 7, 2020

fleharty left a comment

Choose a reason for hiding this comment

fleharty Oct 8, 2020

Choose a reason for hiding this comment

samuelklee Oct 8, 2020

Choose a reason for hiding this comment

fleharty Oct 8, 2020

Choose a reason for hiding this comment

samuelklee commented Oct 8, 2020 • edited Loading

samuelklee commented Oct 8, 2020 • edited Loading

samuelklee commented Oct 9, 2020

gatk-bot commented Oct 9, 2020

samuelklee commented Oct 9, 2020

samuelklee commented Oct 23, 2020

samuelklee commented Nov 13, 2020

fleharty commented Nov 13, 2020

samuelklee commented Nov 13, 2020

fleharty left a comment

Choose a reason for hiding this comment

samuelklee commented May 28, 2020 •

edited

Loading

gatk-bot commented May 28, 2020 •

edited

Loading

samuelklee commented Oct 8, 2020 •

edited

Loading

samuelklee commented Oct 8, 2020 •

edited

Loading