-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878
Conversation
Codecov Report
@@ Coverage Diff @@
## master #4878 +/- ##
===============================================
+ Coverage 80.425% 80.507% +0.082%
- Complexity 17821 18020 +199
===============================================
Files 1089 1090 +1
Lines 64159 64937 +778
Branches 10344 10510 +166
===============================================
+ Hits 51600 52279 +679
- Misses 8498 8569 +71
- Partials 4061 4089 +28
|
@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you checking the sequence dictionary on startup to make sure that we don't have > 64k contigs? If not, you should!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this method be made an instance method on MarkDuplicatesSparkRecord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some checks for libraries and contigs so an exception is thrown in the case where we expect the keys to break.
@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) { | |||
} | |||
} | |||
|
|||
// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data | |||
private static long getGroupsForPairs(Pair record) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
@@ -313,8 +314,8 @@ public String toString() { | |||
nonDuplicates.add(bestFragment); | |||
} | |||
|
|||
if (Utils.isNonEmpty(pairs)) { | |||
nonDuplicates.add(handlePairs(pairs, finder)); | |||
for (List<Pair> pairList : pairsStratified) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fina?
args.addInput(getTestFile("hashCollisionedReads.bam")); | ||
runCommandLine(args); | ||
|
||
try ( final ReadsDataSource outputReadsSource = new ReadsDataSource(output.toPath()) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment about this, it's not clear that this bam has reads that have a hash collision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Check the read groups. Check to see if there are more than 64k contigs and throw an exception.
@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testHashCollisionHandling() { | |||
File output = createTempFile("supplementaryReadUnmappedMate", "bam"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
I'm posting an experimental version of this branch for performance purposes. This is no longer in a state where it should be merged. |
cd8f5b9
to
25f068b
Compare
@lbergelson I updated this branch with the new key representation. After some performance runs it appears that these lead to a slightly faster mapping operation and approximately 15% less serialization for the step where they are used. Can you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery a whole pile of nitpicks. 👍 when they're resolved.
*/ | ||
public static Map<String, Byte> constructLibraryIndex(final SAMFileHeader header) { | ||
final List<String> discoveredLibraries = header.getReadGroups().stream() | ||
.map(r -> {String library = r.getLibrary(); return library==null? LibraryIdGenerator.UNKNOWN_LIBRARY : library;} ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use line breaks when writing a multiline lambda
.distinct() | ||
.collect(Collectors.toList()); | ||
if (discoveredLibraries.size() > 255) { | ||
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include how many were detected in the error message
/** | ||
* Method which generates a map of the readgroups from the header so they can be serialized as indexes | ||
*/ | ||
private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) { | ||
final List<SAMReadGroupRecord> readGroups = header.getReadGroups(); | ||
if (readGroups.size() > ((int)Short.MAX_VALUE)*2) { | ||
throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's an off by 1 here. SHORT.MAX_VALUE*2
= 32767 * 2
= 65534
, but the comment says 65536
I think you meant be using 2^32-1
= 65535
/** | ||
* Method which generates a map of the readgroups from the header so they can be serialized as indexes | ||
*/ | ||
private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) { | ||
final List<SAMReadGroupRecord> readGroups = header.getReadGroups(); | ||
if (readGroups.size() > ((int)Short.MAX_VALUE)*2) { | ||
throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's always a good idea to include how many were detected in the error
*/ | ||
public static String keyForRead(final GATKRead read) { | ||
return read.getName(); | ||
public static class keyForFragment extends ReadsKey { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalize
|
||
Passthrough(GATKRead read, int partitionIndex) { | ||
super(partitionIndex, read.getName()); | ||
|
||
// use a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mystery! use a what?
|
||
@Override | ||
public int hashCode() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line here
|
||
@Override | ||
public int hashCode() { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to many line breaks, save our precious line breaks
@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() { | |||
} | |||
} | |||
|
|||
@Test | |||
public void testHashCollisionHandling() { | |||
File output = createTempFile("supplementaryReadUnmappedMate", "bam"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is out of date now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests that generate some keys from the different reads and show that they are not equal for non-equal reads? Nothing too fancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not strictly out of date, it is still testing A potential source of bugs given the old hashing scheme. I don't see why not to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
} | ||
|
||
|
||
private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems weird to encode these as longs when we could either A) keep them as the individual values and let serialization handle it, or B) convert them to a byte array and save a few bits. We can revisit later if we think it would help I guess. Metrics seem higher priority than shaving a few more bits here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add it to the list of things to check when this tool is tied out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a list written somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery some very minor comments and then 👍 to merge
{header, createTestRead("name1", "1", 1010, "10S90M", library1.getReadGroupId(), false), | ||
createTestRead("name1", "1", 1200, "100M", library1.getReadGroupId(), true), | ||
true, | ||
createTestRead("name2", "1", 1000, "100M", library1.getReadGroupId(), false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a case for read 2 on a different contig, otherwise the same
.distinct() | ||
.collect(Collectors.toList()); | ||
if (discoveredLibraries.size() > 255) { | ||
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries"); | ||
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries but " + discoveredLibraries.size() + " were found"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo MarkDuplciates
} | ||
} | ||
|
||
|
||
// Helper methods for generating summary longs | ||
private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) { | ||
return (((long)strandedUnclippedStart) << 32) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment explaining the reason you need to mask these, it wasn't obvious to us at first, so it probably won't be when we've come back to look at it later.
@@ -14,8 +13,8 @@ | |||
Passthrough(GATKRead read, int partitionIndex) { | |||
super(partitionIndex, read.getName()); | |||
|
|||
// use a | |||
this.key = ReadsKey.hashKeyForRead(read); | |||
// use a hash key here instead of a normal key because collisions don't matter here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Here is the non-string key solution.