Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated MarkDuplicates code to rely on the Picard OpticalDuplicatesFinder #4750

Merged
merged 13 commits into from
May 15, 2018

Conversation

jamesemery
Copy link
Collaborator

Fixes #4700

@jamesemery jamesemery requested a review from lbergelson May 8, 2018 19:10
@@ -86,6 +92,9 @@ public int getIndex() {

final JavaPairRDD<String, Iterable<IndexPair<GATKRead>>> keyedReads = getReadsGroupedByName(header, mappedReads, numReducers);

//TODO why in the world does a JavaRDD return a normal sparkcontext instead of the nice wrapped one...?
final Broadcast<Map<String, Short>> headerReadGroupIndexMap = new JavaSparkContext(reads.context()).broadcast( getHeaderReadGroupIndexMap(header));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use JavaSparkContext.fromSparkContext instead I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

prevIdx = i + 1;
}
}
if (prevIdx < readName.length()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAY deleting code!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huzzah!


@Override
public void setY(final short y) { this.y = y; }
public void setY(final int y) { this.y = (short)y; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add checks here that it's not > maxshort.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


/**
* An intermediate class flagged as being serializable so that the Picard OpticalDuplicateFinder can be serialized despite
* it not being marked as such. There is not state outside of construction at the time of creating this pr.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second sentence is more confusing than helpful

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery A few minor changes and typos. Looks good though. Surprisingly little change.

@@ -142,11 +142,11 @@
/**
* Little class to hold the sequence of a pair of reads and tile location information.
*/
static class PairedReadSequence implements OpticalDuplicateFinder.PhysicalLocation {
static class PairedReadSequence implements picard.sam.util.PhysicalLocation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the full path here?

@@ -1,7 +1,7 @@
package org.broadinstitute.hellbender.utils.read.markduplicates;

/** Little struct-like class to hold read pair (and fragment) end data for duplicate marking. */
public abstract class ReadEnds implements OpticalDuplicateFinder.PhysicalLocation {
public abstract class ReadEnds implements picard.sam.util.PhysicalLocation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorten the path?

@@ -20,7 +19,7 @@
* during the processing step of MarkDuplicatesSpark
*/
@DefaultSerializer(Pair.Serializer.class)
public final class Pair extends PairedEnds implements OpticalDuplicateFinder.PhysicalLocation {
public final class Pair extends PairedEnds implements picard.sam.util.PhysicalLocation {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short path?

@@ -225,6 +230,11 @@ public void addMatePair(final String readName,
record2.setAttribute("MC", null);
}

if (readGorup!=null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: readGorup

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public void testOpticalDuplicatesDifferentReadGroups() {
final AbstractMarkDuplicatesTester tester = getTester();
tester.setExpectedOpticalDuplicate(0);
tester.addMatePair("RUNID:7:1203:2886:82292", 19, 19, 485253, 485253, false, false, true, true, "42M59S", "59S42M", true, false, false, false, false, DEFAULT_BASE_QUALITY, "H0164.2"); // duplicate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate these 600 parameter methods... but I'm not sure what to do about them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a bunch of overloads that help the problem. For the most par every field is necessary to at least one of these tests.

@lbergelson lbergelson assigned jamesemery and unassigned lbergelson May 9, 2018
@jamesemery
Copy link
Collaborator Author

@lbergelson Responded to your comments.

@jamesemery jamesemery assigned lbergelson and unassigned jamesemery May 9, 2018
@codecov-io
Copy link

codecov-io commented May 9, 2018

Codecov Report

Merging #4750 into master will increase coverage by 0.069%.
The diff coverage is 87.037%.

@@               Coverage Diff               @@
##              master     #4750       +/-   ##
===============================================
+ Coverage     80.052%   80.121%   +0.069%     
- Complexity     17413     17492       +79     
===============================================
  Files           1080      1080               
  Lines          63099     63440      +341     
  Branches       10196     10275       +79     
===============================================
+ Hits           50512     50829      +317     
- Misses          8597      8607       +10     
- Partials        3990      4004       +14
Impacted Files Coverage Δ Complexity Δ
...transforms/markduplicates/MarkDuplicatesSpark.java 95.918% <ø> (ø) 16 <0> (ø) ⬇️
...llections/OpticalDuplicatesArgumentCollection.java 100% <ø> (ø) 1 <0> (ø) ⬇️
...icates/sparkrecords/MarkDuplicatesSparkRecord.java 100% <ø> (ø) 7 <0> (ø) ⬇️
...tractOpticalDuplicateFinderCommandLineProgram.java 100% <ø> (ø) 4 <0> (ø) ⬇️
.../markduplicates/EstimateLibraryComplexityGATK.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...hellbender/utils/read/markduplicates/ReadEnds.java 92% <100%> (ø) 15 <5> (ø) ⬇️
...r/utils/read/markduplicates/sparkrecords/Pair.java 93.333% <100%> (+1.199%) 23 <7> (+1) ⬆️
...forms/markduplicates/MarkDuplicatesSparkUtils.java 90% <100%> (+0.36%) 69 <8> (+4) ⬆️
...ates/AbstractMarkDuplicatesCommandLineProgram.java 93.333% <100%> (ø) 24 <4> (ø) ⬇️
...k/pipelines/BwaAndMarkDuplicatesPipelineSpark.java 77.778% <100%> (ø) 4 <0> (ø) ⬇️
... and 13 more

@jamesemery jamesemery force-pushed the je_opticalDuplicateFinding branch from abc1ca6 to 8971ef6 Compare May 10, 2018 14:09
@jamesemery
Copy link
Collaborator Author

@lbergelson Is this branch good to go?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Looks like there are a few leftover bits from the short era that need to be updated

@Override
public void setX(final short x) { this.x = x; }
public void setX(final int x) { this.x = (short)x; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you forgot to remove the casts to short when you updated this to int

@@ -24,7 +25,7 @@
// Needed for testing
public void setupOpticalDuplicateFinder() {
this.opticalDuplicateFinder = new OpticalDuplicateFinder(opticalDuplicatesArgumentCollection.READ_NAME_REGEX,
opticalDuplicatesArgumentCollection.OPTICAL_DUPLICATE_PIXEL_DISTANCE, logger);
opticalDuplicatesArgumentCollection.OPTICAL_DUPLICATE_PIXEL_DISTANCE,null);//TODO logger firgure out, logger);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still has this typo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson right, so i don't know the answer to this. Switiching over to the picard version of the optical duplicate finder means that we are no longer processing optical duplicates with our logger but the picard one, which happens to not be serializable. Should I just give it a null logger and call it a day then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null logger is fine for now . Could you open a ticket to enable us to pass a logger into it so we remember to look at it again?


@Override
public void setY(final short y) { this.y = y; }
public void setY(final int y) { this.y = (short) y; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover cast to short here

@@ -34,7 +34,7 @@
private final int score;

// Information used to detect optical dupes
private transient short readGroup = -1;
private short readGroupIndex = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a line break between this and the transient ones?

@@ -92,7 +92,7 @@ private Pair(Kryo kryo, Input input){
super(input.readInt(true), input.readString());

// Information used to detect optical dupes
readGroup = -1;
readGroupIndex = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line should be removed?

@Override
public void setY(final short y) { this.y = y; }
public void setY(final int y) { this.y = (short)y; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@lbergelson
Copy link
Member

…ances in that test should not have been considered close...
@jamesemery
Copy link
Collaborator Author

jamesemery commented May 11, 2018

@lbergelson Yeah, so the test was broken because it was assuming that the pairedEnds y value was getting truncated. Now that test is running on two reads that are ACTUALLY close enough for comparison.

final Short readGroup = headerReadGroupIndexMap.getValue().get(firstRead.getValue().getReadGroup());
final Pair pair = MarkDuplicatesSparkRecord.newPair(firstRead, secondRead.getValue(), header, secondRead.getIndex(), scoringStrategy);
// Validate and add the read group to the pair
final Short readGroup = headerReadGroupIndexMap.getValue().get(firstRead.getReadGroup());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add 2 simple tests that show that it explodes with the right errors on each of these cases

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, merge when tests are green

@jamesemery jamesemery merged commit 6855752 into master May 15, 2018
@davidbenjamin davidbenjamin deleted the je_opticalDuplicateFinding branch February 24, 2021 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants