Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

jamesemery · 2018-06-11T17:13:17Z

Here is the non-string key solution.

codecov-io · 2018-06-11T18:23:46Z

Codecov Report

Merging #4878 into master will increase coverage by 0.082%.
The diff coverage is 90.217%.

@@               Coverage Diff               @@
##              master     #4878       +/-   ##
===============================================
+ Coverage     80.425%   80.507%   +0.082%     
- Complexity     17821     18020      +199     
===============================================
  Files           1089      1090        +1     
  Lines          64159     64937      +778     
  Branches       10344     10510      +166     
===============================================
+ Hits           51600     52279      +679     
- Misses          8498      8569       +71     
- Partials        4061      4089       +28

Impacted Files	Coverage Δ	Complexity Δ
...s/read/markduplicates/sparkrecords/PairedEnds.java	`100% <ø> (ø)`	`1 <0> (ø)`	⬇️
...roadinstitute/hellbender/utils/read/ReadUtils.java	`80% <100%> (+0.142%)`	`202 <3> (+3)`	⬆️
...tools/spark/validation/CompareDuplicatesSpark.java	`84.946% <100%> (+0.502%)`	`24 <3> (ø)`	⬇️
...itute/hellbender/engine/spark/GATKRegistrator.java	`100% <100%> (ø)`	`3 <0> (ø)`	⬇️
...icates/sparkrecords/MarkDuplicatesSparkRecord.java	`100% <100%> (ø)`	`7 <3> (ø)`	⬇️
...ils/read/markduplicates/sparkrecords/Fragment.java	`93.75% <100%> (-6.25%)`	`7 <1> (-4)`
...r/utils/read/markduplicates/sparkrecords/Pair.java	`97.297% <100%> (+0.631%)`	`27 <1> (-2)`	⬇️
.../read/markduplicates/sparkrecords/Passthrough.java	`100% <100%> (ø)`	`3 <0> (ø)`	⬇️
...ead/markduplicates/sparkrecords/EmptyFragment.java	`78.571% <80%> (-10.317%)`	`5 <1> (-4)`
...forms/markduplicates/MarkDuplicatesSparkUtils.java	`90.521% <87.097%> (+1.024%)`	`63 <10> (+1)`	⬆️
... and 10 more

droazen · 2018-06-11T19:21:35Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) {
        }


Are you checking the sequence dictionary on startup to make sure that we don't have > 64k contigs? If not, you should!

Can this method be made an instance method on MarkDuplicatesSparkRecord?

I added some checks for libraries and contigs so an exception is thrown in the case where we expect the keys to break.

fleharty · 2018-06-11T19:18:07Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) {
        }
    }

+    // Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
+    private static long getGroupsForPairs(Pair record) {


fleharty · 2018-06-11T19:24:42Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

@@ -313,8 +314,8 @@ public String toString() {
                    nonDuplicates.add(bestFragment);
                }

-                if (Utils.isNonEmpty(pairs)) {
-                    nonDuplicates.add(handlePairs(pairs, finder));
+                for (List<Pair> pairList : pairsStratified) {


fleharty · 2018-06-11T19:33:41Z

.../org/broadinstitute/hellbender/tools/spark/pipelines/MarkDuplicatesSparkIntegrationTest.java

+        args.addInput(getTestFile("hashCollisionedReads.bam"));
+        runCommandLine(args);
+
+        try ( final ReadsDataSource outputReadsSource = new ReadsDataSource(output.toPath()) ) {


Add a comment about this, it's not clear that this bam has reads that have a hash collision.

fleharty

@jamesemery Check the read groups. Check to see if there are more than 64k contigs and throw an exception.

fleharty · 2018-06-11T19:34:27Z

.../org/broadinstitute/hellbender/tools/spark/pipelines/MarkDuplicatesSparkIntegrationTest.java

@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() {
        }
    }

+    @Test
+    public void testHashCollisionHandling() {
+        File output = createTempFile("supplementaryReadUnmappedMate", "bam");


jamesemery · 2018-06-12T16:05:27Z

I'm posting an experimental version of this branch for performance purposes. This is no longer in a state where it should be merged.

…blems

jamesemery · 2018-06-12T20:53:49Z

@lbergelson I updated this branch with the new key representation. After some performance runs it appears that these lead to a slightly faster mapping operation and approximately 15% less serialization for the step where they are used. Can you take a look?

lbergelson

@jamesemery a whole pile of nitpicks. 👍 when they're resolved.

lbergelson · 2018-06-12T21:21:28Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+     */
+    public static Map<String, Byte> constructLibraryIndex(final SAMFileHeader header) {
+        final List<String> discoveredLibraries = header.getReadGroups().stream()
+                .map(r -> {String library = r.getLibrary(); return library==null? LibraryIdGenerator.UNKNOWN_LIBRARY : library;} )


use line breaks when writing a multiline lambda

lbergelson · 2018-06-12T21:39:03Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

+                .distinct()
+                .collect(Collectors.toList());
+        if (discoveredLibraries.size() > 255) {
+            throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries");


include how many were detected in the error message

lbergelson · 2018-06-12T21:45:07Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

    /**
     * Method which generates a map of the readgroups from the header so they can be serialized as indexes
     */
    private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) {
        final List<SAMReadGroupRecord> readGroups = header.getReadGroups();
+        if (readGroups.size() > ((int)Short.MAX_VALUE)*2) {
+            throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs");


I think there's an off by 1 here. SHORT.MAX_VALUE*2 = 32767 * 2 = 65534, but the comment says 65536 I think you meant be using 2^32-1 = 65535

lbergelson · 2018-06-12T21:45:25Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

    /**
     * Method which generates a map of the readgroups from the header so they can be serialized as indexes
     */
    private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) {
        final List<SAMReadGroupRecord> readGroups = header.getReadGroups();
+        if (readGroups.size() > ((int)Short.MAX_VALUE)*2) {
+            throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs");


It's always a good idea to include how many were detected in the error

lbergelson · 2018-06-12T21:51:51Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

     */
-    public static String keyForRead(final GATKRead read) {
-        return read.getName();
+    public static class keyForFragment extends ReadsKey {


lbergelson · 2018-06-12T22:12:16Z

...n/java/org/broadinstitute/hellbender/utils/read/markduplicates/sparkrecords/Passthrough.java


    Passthrough(GATKRead read, int partitionIndex) {
        super(partitionIndex, read.getName());

+        // use a


the mystery! use a what?

lbergelson · 2018-06-13T14:59:16Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

+
+        @Override
+        public int hashCode() {
+


extra line here

lbergelson · 2018-06-13T14:59:32Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

+
+        @Override
+        public int hashCode() {
+


to many line breaks, save our precious line breaks

lbergelson · 2018-06-13T15:00:24Z

.../org/broadinstitute/hellbender/tools/spark/pipelines/MarkDuplicatesSparkIntegrationTest.java

@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() {
        }
    }

+    @Test
+    public void testHashCollisionHandling() {
+        File output = createTempFile("supplementaryReadUnmappedMate", "bam");


I think this test is out of date now.

Could you add tests that generate some keys from the different reads and show that they are not equal for non-equal reads? Nothing too fancy.

Its not strictly out of date, it is still testing A potential source of bugs given the old hashing scheme. I don't see why not to keep it.

Fair enough.

lbergelson · 2018-06-13T15:03:55Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

+    }
+
+
+    private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) {


It still seems weird to encode these as longs when we could either A) keep them as the individual values and let serialization handle it, or B) convert them to a byte array and save a few bits. We can revisit later if we think it would help I guess. Metrics seem higher priority than shaving a few more bits here.

I'll add it to the list of things to check when this tool is tied out.

Do we have a list written somewhere?

lbergelson

@jamesemery some very minor comments and then 👍 to merge

lbergelson · 2018-06-14T15:43:13Z

src/test/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKeyUnitTest.java

+                {header, createTestRead("name1", "1", 1010, "10S90M", library1.getReadGroupId(), false),
+                        createTestRead("name1", "1", 1200, "100M", library1.getReadGroupId(), true),
+                        true,
+                        createTestRead("name2", "1", 1000, "100M", library1.getReadGroupId(), false),


add a case for read 2 on a different contig, otherwise the same

lbergelson · 2018-06-14T15:43:49Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

                .distinct()
                .collect(Collectors.toList());
        if (discoveredLibraries.size() > 255) {
-            throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries");
+            throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries but " + discoveredLibraries.size() + " were found");


typo MarkDuplciates

lbergelson · 2018-06-14T15:45:34Z

src/main/java/org/broadinstitute/hellbender/utils/read/markduplicates/ReadsKey.java

        }
    }

-
+    // Helper methods for generating summary longs
    private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) {
        return (((long)strandedUnclippedStart) << 32) |


could you add a comment explaining the reason you need to mask these, it wasn't obvious to us at first, so it probably won't be when we've come back to look at it later.

lbergelson · 2018-06-14T15:45:51Z

...n/java/org/broadinstitute/hellbender/utils/read/markduplicates/sparkrecords/Passthrough.java

@@ -14,8 +13,8 @@
    Passthrough(GATKRead read, int partitionIndex) {
        super(partitionIndex, read.getName());

-        // use a
-        this.key = ReadsKey.hashKeyForRead(read);
+        // use a hash key here instead of a normal key because collisions don't matter here


:

jamesemery requested a review from lbergelson June 11, 2018 17:13

droazen reviewed Jun 11, 2018

View reviewed changes

fleharty reviewed Jun 11, 2018

View reviewed changes

jamesemery added 4 commits June 12, 2018 16:43

Added a second layer of deconvolution for pairs that were causing pro…

b5940fe

…blems

Drastic changes require drastic pull requests

7af0a0a

who registrates the registrar?

092f222

cleaning up some of the leftover comments

25f068b

jamesemery force-pushed the je_fixHashDeconvolution branch from cd8f5b9 to 25f068b Compare June 12, 2018 20:52

lbergelson requested changes Jun 13, 2018

View reviewed changes

jamesemery added 3 commits June 13, 2018 11:48

fixed the broken tests

5ea6ec2

responded to comments

273f264

tests tests tests:

8c5c603

lbergelson approved these changes Jun 14, 2018

View reviewed changes

lbergelson assigned jamesemery Jun 14, 2018

responded to the last comments no takebavksies

91265b9

:

jamesemery merged commit ba62c2a into master Jun 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

jamesemery commented Jun 11, 2018

codecov-io commented Jun 11, 2018 •

edited

Loading

droazen Jun 11, 2018 •

edited

Loading

cmnbroad Jun 11, 2018

jamesemery Jun 12, 2018

fleharty Jun 11, 2018

fleharty Jun 11, 2018

fleharty Jun 11, 2018

fleharty left a comment

fleharty Jun 11, 2018

jamesemery commented Jun 12, 2018

jamesemery commented Jun 12, 2018

lbergelson left a comment

lbergelson Jun 12, 2018

lbergelson Jun 12, 2018

lbergelson Jun 12, 2018

lbergelson Jun 12, 2018

lbergelson Jun 12, 2018

lbergelson Jun 12, 2018

lbergelson Jun 13, 2018

lbergelson Jun 13, 2018

lbergelson Jun 13, 2018

lbergelson Jun 13, 2018

jamesemery Jun 13, 2018

lbergelson Jun 13, 2018

lbergelson Jun 13, 2018

jamesemery Jun 13, 2018

lbergelson Jun 13, 2018

lbergelson left a comment

lbergelson Jun 14, 2018

lbergelson Jun 14, 2018

lbergelson Jun 14, 2018

lbergelson Jun 14, 2018

		@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) {
		}

		}


		private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) {

Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

Conversation

jamesemery commented Jun 11, 2018

codecov-io commented Jun 11, 2018 • edited Loading

Codecov Report

droazen Jun 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fleharty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Jun 12, 2018

jamesemery commented Jun 12, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 11, 2018 •

edited

Loading

droazen Jun 11, 2018 •

edited

Loading