Fixed MarkDuplicatesSpark handling of unsorted bams #4732

jamesemery · 2018-05-02T20:59:13Z

Added a global sort to the beginning of the tool to ensure we are always working with name grouped bams. In the future we should evaluate if alternatives that avoid sorting are necessary.

Fixes #4701

* this should match the results of SAMRecordQueryNameComparator exactly * it operates on GATKRead instead of SAMRecord

lbergelson · 2018-05-08T19:10:02Z

Compile warning:

:compileTestJava/home/travis/build/broadinstitute/gatk/src/test/java/org/broadinstitute/hellbender/engine/spark/datasources/ReadsSparkSourceUnitTest.java:339: warning: [static] static method should be qualified by type name, ReadsSparkSource, instead of by an expression
        readsSparkSource.putPairsInSamePartition(header, problemReads, ctx);

lbergelson

@jamesemery Back to you with a few requests. Looks good I think.

lbergelson · 2018-05-07T16:06:03Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+            sortedReadsForMarking = reads;
+        } else {
+            headerForTool.setSortOrder(SAMFileHeader.SortOrder.queryname);
+            sortedReadsForMarking = ReadsSparkSource.putPairsInSamePartition(headerForTool, SparkUtils.querynameSortReads(reads, numReducers), new JavaSparkContext(reads.context()));


Pull the sort onto it's own line. It's not a great idea to hide really expensive operations inline with other calls.

I might extract this whole sorting operation into a function, "queryNameSortReadsIfNecessary"

lbergelson · 2018-05-07T17:42:07Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+    @Test
+    // Test that asserts the duplicate marking is sorting agnostic, specifically this is testing that when reads are scrambled across
+    // partitions in the input that all reads in a group are getting properly duplicate marked together as they are for queryname sorted bams
+    public void testSortOrderParitioningCorrectness() throws IOException {


typo paritioning -> partitioning

lbergelson · 2018-05-07T17:42:57Z

src/main/java/org/broadinstitute/hellbender/utils/spark/SparkUtils.java

+        } else {
+            readVoidPairs = rddReadPairs.sortByKey(comparator);
+        }
+        return readVoidPairs.keys();


This method should call the edge fixing method. We don't want to give people the option to do it wrong.

lbergelson · 2018-05-08T19:14:04Z

...roadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtils.java

-            JavaPairRDD<String, IndexPair<GATKRead>> keyReadPairs = indexedReads.mapToPair(read -> new Tuple2<>(ReadsKey.keyForRead(
-                    read.getValue()), read));
-            keyedReads = keyReadPairs.groupByKey(numReducers);
+            throw new GATKException("MarkDuplicatesSparkUtils.mark() requires input reads to be queryname sorted, yet the header indicated otherwise");


Could you have it print the sort order it thinks its in?

lbergelson · 2018-05-08T19:15:21Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+    public void testSortOrderParitioningCorrectness() throws IOException {
+
+        JavaSparkContext ctx = SparkContextFactory.getTestSparkContext();
+        JavaRDD<GATKRead> unsortedReads = generateUnsortedReads(10000,3, ctx, 100, true);


stupid nitpick: spaces here are wonky, and on the next line

lbergelson · 2018-05-08T19:19:12Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+                samRecordSetBuilder.addPair("READ" + readNameCounter++, 0, start1, start2);
+            }
+        }
+        final ReadCoordinateComparator coordinateComparitor = new ReadCoordinateComparator(hg19Header);


coordinateComparitor is unused, and misspelled.

Yeah, it really does have a poor lot in life doesn't it

lbergelson · 2018-05-08T19:20:35Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+        }
+    }
+
+    private JavaRDD<GATKRead> generateUnsortedReads(int numReadGroups, int numDuplicatesPerGroup, JavaSparkContext ctx, int numPartitions, boolean coordinate) {


these are sorted... rename to generateReadsWithDuplicates or something like that

lbergelson · 2018-05-08T19:22:09Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+        }
+    }
+
+    private JavaRDD<GATKRead> generateUnsortedReads(int numReadGroups, int numDuplicatesPerGroup, JavaSparkContext ctx, int numPartitions, boolean coordinate) {


could you add a bit of javadoc to this method explaining what it does

lbergelson · 2018-05-08T19:38:30Z

...itute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSparkUtilsUnitTest.java

+        sortedHeader.setSortOrder(SAMFileHeader.SortOrder.queryname);
+
+        // Using the header flagged as unsorted will result in the reads being sorted again
+        JavaRDD<GATKRead> unsortedReadsMarked = MarkDuplicatesSpark.mark(unsortedReads,unsortedHeader, MarkDuplicatesScoringStrategy.SUM_OF_BASE_QUALITIES,new OpticalDuplicateFinder(),100,true);


this is called unsorted, but isn't it actually coordinate sorted?

Why the different num reducers? Is that to find issues with edge fixing? If it is, I think you'd be better off with a specific (and possibly similar) test for that. Since we're always generating pairs, it seems to me that they might never get split across partitions if we're creating an even number of partitions.

So the reason for different numbers of partitions was that when I first wrote this test this test there was no exposed way to do the edge fixing for a queryname sorted bam. I didn't want to deal with the problems of having a mispartitioned bam so I let the queryname sorted reads reside on one partition so the spanning couldn't be wrong. Since this is a test of the coordinate sorted bam marking across partitions and not the edge fixing i'm not worried.

lbergelson

These changes look good. 👍 When tests pass.

lbergelson · 2018-05-09T17:56:34Z

...org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java

+        } else {
+            headerForTool.setSortOrder(SAMFileHeader.SortOrder.queryname);
+            JavaRDD<GATKRead> sortedReads = SparkUtils.querynameSortReads(reads, numReducers);
+            sortedReadsForMarking = ReadsSparkSource.putPairsInSamePartition(headerForTool, sortedReads, new JavaSparkContext(reads.context()));


@jamesemery I would use JavaSparkContext.fromSparkContext instead of new JavaSparkContext. I think it's the same, but it might change in future sparks.

codecov-io · 2018-05-09T21:47:17Z

Codecov Report

Merging #4732 into master will decrease coverage by 0.023%.
The diff coverage is 88%.

@@               Coverage Diff               @@
##              master     #4732       +/-   ##
===============================================
- Coverage     79.977%   79.954%   -0.023%     
- Complexity     17397     17524      +127     
===============================================
  Files           1080      1081        +1     
  Lines          63093     63963      +870     
  Branches       10179     10420      +241     
===============================================
+ Hits           50460     51141      +681     
- Misses          8648      8771      +123     
- Partials        3985      4051       +66

Impacted Files	Coverage Δ	Complexity Δ
...forms/markduplicates/MarkDuplicatesSparkUtils.java	`89.64% <0%> (-0.943%)`	`65 <0> (-2)`
...hellbender/utils/read/ReadQueryNameComparator.java	`100% <100%> (ø)`	`22 <22> (?)`
...transforms/markduplicates/MarkDuplicatesSpark.java	`95.918% <100%> (+0.796%)`	`16 <3> (+1)`	⬆️
...der/engine/spark/datasources/ReadsSparkSource.java	`82.051% <50%> (ø)`	`44 <0> (ø)`	⬇️
...adinstitute/hellbender/utils/spark/SparkUtils.java	`72.727% <62.5%> (-1.411%)`	`12 <1> (+1)`
...r/utils/read/markduplicates/sparkrecords/Pair.java	`92.135% <0%> (-3.371%)`	`22% <0%> (ø)`
...e/hellbender/tools/funcotator/FuncotatorUtils.java	`79.208% <0%> (-0.183%)`	`193% <0%> (+42%)`
... and 5 more

lbergelson and others added 3 commits May 2, 2018 16:17

adding new ReadQueryNameComparator

ae33725

* this should match the results of SAMRecordQueryNameComparator exactly * it operates on GATKRead instead of SAMRecord

almost done, blocked by QuerynameComparitor

8742a80

Finished implementation

942b290

jamesemery assigned lbergelson May 2, 2018

jamesemery requested a review from lbergelson May 2, 2018 20:59

fixing a compiler waring

dc5dfaf

lbergelson requested changes May 8, 2018

View reviewed changes

lbergelson assigned jamesemery and unassigned lbergelson May 8, 2018

responded to a round of comments

b338293

jamesemery mentioned this pull request May 9, 2018

Added a distinction between PCR orientation and Optical Duplicates orientation in MarkDuplicatesSpark #4752

Merged

jamesemery assigned lbergelson and unassigned jamesemery May 9, 2018

lbergelson approved these changes May 9, 2018

View reviewed changes

lbergelson reviewed May 9, 2018

View reviewed changes

using the better spark context getter

e8dcae0

jamesemery merged commit 33002a3 into master May 9, 2018

lbergelson deleted the je_MDUnsortedFix branch May 17, 2018 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed MarkDuplicatesSpark handling of unsorted bams #4732

Fixed MarkDuplicatesSpark handling of unsorted bams #4732

jamesemery commented May 2, 2018

lbergelson commented May 8, 2018

lbergelson left a comment

lbergelson May 7, 2018

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson May 7, 2018

jamesemery May 9, 2018

lbergelson May 7, 2018

jamesemery May 9, 2018 •

edited

Loading

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson May 8, 2018

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson May 8, 2018

lbergelson May 8, 2018

jamesemery May 9, 2018

lbergelson left a comment

lbergelson May 9, 2018

codecov-io commented May 9, 2018

Fixed MarkDuplicatesSpark handling of unsorted bams #4732

Fixed MarkDuplicatesSpark handling of unsorted bams #4732

Conversation

jamesemery commented May 2, 2018

lbergelson commented May 8, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery May 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented May 9, 2018

Codecov Report

jamesemery May 9, 2018 •

edited

Loading