Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a second layer of deconvolution for pairs that were causing problems in MarkDuplicatesSpark #4878

Merged
merged 8 commits into from
Jun 14, 2018

Conversation

jamesemery
Copy link
Collaborator

Here is the non-string key solution.

@jamesemery jamesemery requested a review from lbergelson June 11, 2018 17:13
@codecov-io
Copy link

codecov-io commented Jun 11, 2018

Codecov Report

Merging #4878 into master will increase coverage by 0.082%.
The diff coverage is 90.217%.

@@               Coverage Diff               @@
##              master     #4878       +/-   ##
===============================================
+ Coverage     80.425%   80.507%   +0.082%     
- Complexity     17821     18020      +199     
===============================================
  Files           1089      1090        +1     
  Lines          64159     64937      +778     
  Branches       10344     10510      +166     
===============================================
+ Hits           51600     52279      +679     
- Misses          8498      8569       +71     
- Partials        4061      4089       +28
Impacted Files Coverage Δ Complexity Δ
...s/read/markduplicates/sparkrecords/PairedEnds.java 100% <ø> (ø) 1 <0> (ø) ⬇️
...roadinstitute/hellbender/utils/read/ReadUtils.java 80% <100%> (+0.142%) 202 <3> (+3) ⬆️
...tools/spark/validation/CompareDuplicatesSpark.java 84.946% <100%> (+0.502%) 24 <3> (ø) ⬇️
...itute/hellbender/engine/spark/GATKRegistrator.java 100% <100%> (ø) 3 <0> (ø) ⬇️
...icates/sparkrecords/MarkDuplicatesSparkRecord.java 100% <100%> (ø) 7 <3> (ø) ⬇️
...ils/read/markduplicates/sparkrecords/Fragment.java 93.75% <100%> (-6.25%) 7 <1> (-4)
...r/utils/read/markduplicates/sparkrecords/Pair.java 97.297% <100%> (+0.631%) 27 <1> (-2) ⬇️
.../read/markduplicates/sparkrecords/Passthrough.java 100% <100%> (ø) 3 <0> (ø) ⬇️
...ead/markduplicates/sparkrecords/EmptyFragment.java 78.571% <80%> (-10.317%) 5 <1> (-4)
...forms/markduplicates/MarkDuplicatesSparkUtils.java 90.521% <87.097%> (+1.024%) 63 <10> (+1) ⬆️
... and 10 more

@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) {
}
Copy link
Contributor

@droazen droazen Jun 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you checking the sequence dictionary on startup to make sure that we don't have > 64k contigs? If not, you should!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this method be made an instance method on MarkDuplicatesSparkRecord?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some checks for libraries and contigs so an exception is thrown in the case where we expect the keys to break.

@@ -338,6 +339,12 @@ private static long getGroupKey(MarkDuplicatesSparkRecord record) {
}
}

// Note, this uses bitshift operators in order to perform only a single groupBy operation for all the merged data
private static long getGroupsForPairs(Pair record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

@@ -313,8 +314,8 @@ public String toString() {
nonDuplicates.add(bestFragment);
}

if (Utils.isNonEmpty(pairs)) {
nonDuplicates.add(handlePairs(pairs, finder));
for (List<Pair> pairList : pairsStratified) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fina?

args.addInput(getTestFile("hashCollisionedReads.bam"));
runCommandLine(args);

try ( final ReadsDataSource outputReadsSource = new ReadsDataSource(output.toPath()) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about this, it's not clear that this bam has reads that have a hash collision.

Copy link
Contributor

@fleharty fleharty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Check the read groups. Check to see if there are more than 64k contigs and throw an exception.

@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() {
}
}

@Test
public void testHashCollisionHandling() {
File output = createTempFile("supplementaryReadUnmappedMate", "bam");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

@jamesemery
Copy link
Collaborator Author

I'm posting an experimental version of this branch for performance purposes. This is no longer in a state where it should be merged.

@jamesemery jamesemery force-pushed the je_fixHashDeconvolution branch from cd8f5b9 to 25f068b Compare June 12, 2018 20:52
@jamesemery
Copy link
Collaborator Author

@lbergelson I updated this branch with the new key representation. After some performance runs it appears that these lead to a slightly faster mapping operation and approximately 15% less serialization for the step where they are used. Can you take a look?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery a whole pile of nitpicks. 👍 when they're resolved.

*/
public static Map<String, Byte> constructLibraryIndex(final SAMFileHeader header) {
final List<String> discoveredLibraries = header.getReadGroups().stream()
.map(r -> {String library = r.getLibrary(); return library==null? LibraryIdGenerator.UNKNOWN_LIBRARY : library;} )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use line breaks when writing a multiline lambda

.distinct()
.collect(Collectors.toList());
if (discoveredLibraries.size() > 255) {
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include how many were detected in the error message

/**
* Method which generates a map of the readgroups from the header so they can be serialized as indexes
*/
private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) {
final List<SAMReadGroupRecord> readGroups = header.getReadGroups();
if (readGroups.size() > ((int)Short.MAX_VALUE)*2) {
throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's an off by 1 here. SHORT.MAX_VALUE*2 = 32767 * 2 = 65534, but the comment says 65536 I think you meant be using 2^32-1 = 65535

/**
* Method which generates a map of the readgroups from the header so they can be serialized as indexes
*/
private static Map<String, Short> getHeaderReadGroupIndexMap(final SAMFileHeader header) {
final List<SAMReadGroupRecord> readGroups = header.getReadGroups();
if (readGroups.size() > ((int)Short.MAX_VALUE)*2) {
throw new GATKException("Detected too many read groups in the header, currently MarkDuplciatesSpark only supports up to 65536 unique readgroup IDs");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always a good idea to include how many were detected in the error

*/
public static String keyForRead(final GATKRead read) {
return read.getName();
public static class keyForFragment extends ReadsKey {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize


Passthrough(GATKRead read, int partitionIndex) {
super(partitionIndex, read.getName());

// use a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mystery! use a what?


@Override
public int hashCode() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line here


@Override
public int hashCode() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to many line breaks, save our precious line breaks

@@ -187,6 +187,25 @@ public void testSupplementaryReadUnmappedMate() {
}
}

@Test
public void testHashCollisionHandling() {
File output = createTempFile("supplementaryReadUnmappedMate", "bam");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test is out of date now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests that generate some keys from the different reads and show that they are not equal for non-equal reads? Nothing too fancy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not strictly out of date, it is still testing A potential source of bugs given the old hashing scheme. I don't see why not to keep it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

}


private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems weird to encode these as longs when we could either A) keep them as the individual values and let serialization handle it, or B) convert them to a byte array and save a few bits. We can revisit later if we think it would help I guess. Metrics seem higher priority than shaving a few more bits here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add it to the list of things to check when this tool is tied out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a list written somewhere?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery some very minor comments and then 👍 to merge

{header, createTestRead("name1", "1", 1010, "10S90M", library1.getReadGroupId(), false),
createTestRead("name1", "1", 1200, "100M", library1.getReadGroupId(), true),
true,
createTestRead("name2", "1", 1000, "100M", library1.getReadGroupId(), false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a case for read 2 on a different contig, otherwise the same

.distinct()
.collect(Collectors.toList());
if (discoveredLibraries.size() > 255) {
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries");
throw new GATKException("Detected too many read libraries among read groups header, currently MarkDuplciatesSpark only supports up to 256 unique readgroup libraries but " + discoveredLibraries.size() + " were found");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo MarkDuplciates

}
}


// Helper methods for generating summary longs
private static long longKeyForFragment(int strandedUnclippedStart, boolean reverseStrand, int referenceIndex, byte library) {
return (((long)strandedUnclippedStart) << 32) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining the reason you need to mask these, it wasn't obvious to us at first, so it probably won't be when we've come back to look at it later.

@@ -14,8 +13,8 @@
Passthrough(GATKRead read, int partitionIndex) {
super(partitionIndex, read.getName());

// use a
this.key = ReadsKey.hashKeyForRead(read);
// use a hash key here instead of a normal key because collisions don't matter here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jamesemery jamesemery merged commit ba62c2a into master Jun 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants