Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed a bug where overlapping reads in subsequent regions can have invalid base qualities #6943

Merged
merged 3 commits into from
Nov 24, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -102,20 +102,31 @@ public static void finalizeRegion(final AssemblyRegion region,
}

final byte minTailQualityToUse = errorCorrectReads ? HaplotypeCallerEngine.MIN_TAIL_QUALITY_WITH_ERROR_CORRECTION : minTailQuality;
final List<GATKRead> readsToUse = region.getReads().stream()
// TODO unclipping soft clips may introduce bases that aren't in the extended region if the unclipped bases
// TODO include a deletion w.r.t. the reference. We must remove kmers that occur before the reference haplotype start
.map(read -> dontUseSoftClippedBases || ! ReadUtils.hasWellDefinedFragmentSize(read) ?
ReadClipper.hardClipSoftClippedBases(read) : ReadClipper.revertSoftClippedBases(read))
.map(read -> softClipLowQualityEnds ? ReadClipper.softClipLowQualEnds(read, minTailQualityToUse) :
ReadClipper.hardClipLowQualEnds(read, minTailQualityToUse))
.filter(read -> read.getStart() <= read.getEnd())
.map(read -> read.isUnmapped() ? read : ReadClipper.hardClipAdaptorSequence(read))
.filter(read -> !read.isEmpty() && read.getCigar().getReadLength() > 0)
.map(read -> ReadClipper.hardClipToRegion(read, region.getPaddedSpan().getStart(), region.getPaddedSpan().getEnd() ))
.filter(read -> read.getStart() <= read.getEnd() && read.getLength() > 0 && read.overlaps(region.getPaddedSpan()))
.sorted(new ReadCoordinateComparator(readsHeader)) // TODO: sort may be unnecessary here
.collect(Collectors.toList());

final List<GATKRead> readsToUse = new ArrayList<>();
for (GATKRead originalRead : region.getReads()) {
// TODO unclipping soft clips may introduce bases that aren't in the extended region if the unclipped bases
// TODO include a deletion w.r.t. the reference. We must remove kmers that occur before the reference haplotype start
GATKRead read = (dontUseSoftClippedBases || ! ReadUtils.hasWellDefinedFragmentSize(originalRead) ?
ReadClipper.hardClipSoftClippedBases(originalRead) : ReadClipper.revertSoftClippedBases(originalRead));
read = (softClipLowQualityEnds ? ReadClipper.softClipLowQualEnds(read, minTailQualityToUse) :
ReadClipper.hardClipLowQualEnds(read, minTailQualityToUse));

if (read.getStart() <= read.getEnd()) {
read = (read.isUnmapped() ? read : ReadClipper.hardClipAdaptorSequence(read));

if (!read.isEmpty() && read.getCigar().getReadLength() > 0) {
read = ReadClipper.hardClipToRegion(read, region.getPaddedSpan().getStart(), region.getPaddedSpan().getEnd() );

if (read.getStart() <= read.getEnd() && read.getLength() > 0 && read.overlaps(region.getPaddedSpan())) {
// NOTE: here we make a defensive copy of the read if it has not been modified by the above operations
// which might only make copies in the case that the read is actually clipped
readsToUse.add(read == originalRead? read.copy() : read);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy() method performs a shallow copy -- it ends up calling SAMRecord.clone(), which copies all fields in the SAMRecord as if by assignment (the exception is the attributes array, which is explicitly copied). So the byte[] arrays for the bases and quals point to the same memory location in the copy as in the original. Is this a problem? Do we later modify the bases/quals in-place somewhere, or do we always copy the bases/quals arrays due to the defensive copies in SAMRecordToGATKReadAdapter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do modify them later, specifically when we call cleanOverlappingReadPairs() we modify the base qualities in place for reads that overlap and if any of those reads have not been clipped bby the clipping code here this could result in the next assembly region having invalid/wrong base qualities for the same read. Given how that method is structured it is non-trivial to refactor it to make the copy only in the event it needs to be modified and it seemed easier to just put a check in to make sure that every read is deeply copied at least once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I disagree with your assessment that cleanOverlappingReadPairs() changes the bases/quals in-place -- it calls getBases() and getBaseQualities() on the read, which perform defensive copies. The question is: is there any code that calls getBasesNoCopy() and/or getBaseQualitiesNoCopy() and truly modifies the bases/quals arrays in place?

Copy link
Contributor

@droazen droazen Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the PileupReadErrorCorrector calls getBaseQualitiesNoCopy() and getBasesNoCopy() and then modifies the bases/quals in-place. This could have the effect of modifying the base/qual arrays in the original reads that will be used in subsequent assembly regions. I think as part of this PR we should patch PileupReadErrorCorrector to call the getters that make copies, and then call the setter methods to update the bases/quals.

We should also check for additional problematic usages of the *NoCopy() methods in the HC/M2 codepaths.

}
}
}
}
readsToUse.sort(new ReadCoordinateComparator(readsHeader));

// handle overlapping read pairs from the same fragment
if (correctOverlappingBaseQualities) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ public void testfinalizeRegion() {

SAMLineParser parser = new SAMLineParser(header);
List<GATKRead> reads = new LinkedList<GATKRead>();
// NOTE: These reads are mates that overlap one-another without agreement which means they should have modified base qualities after calling finalize()
// Read2 has a clean cigar, and thus will not be copied by the clipping code before being fed to the overlapping bases code. This test asserts that its still copied.
SAMRecord orgRead0 = parser.parseLine("HWI-ST807:461:C2P0JACXX:4:2204:18080:5857\t83\t1\t42596803\t39\t1S95M5S\t=\t42596891\t-7\tGAATCATCATCAAATGGAATCTAATGGAATCATTGAACAGAATTGAATGGAATCGTCATCGAATGAATTGAATGCAATCATCGAATGGTCTCGAATAGAAT\tDAAAEDCFCCGEEDDBEDDDGCCDEDECDDFDCEECCFEECDCEDBCDBDBCC>DCECC>DBCDDBCBDDBCDDEBCCECC>DBCDBDBGC?FCCBDB>>?\tRG:Z:tumor");
SAMRecord orgRead1 = parser.parseLine("HWI-ST807:461:C2P0JACXX:4:2204:18080:5857\t163\t1\t42596891\t39\t101M\t=\t42596803\t7\tCTCGAATGGAATCATTTTCTACTGGAAAGGAATGGAATCATCGCATAGAATCGAATGGAATTAACATGGAATGGAATCGAATGTAATCATCATCAAATGGA\t>@>:ABCDECCCEDCBBBDDBDDEBCCBEBBCBEBCBCDDCD>DECBGCDCF>CCCFCDDCBABDEDFCDCDFFDDDG?DDEGDDFDHFEGDDGECB@BAA\tRG:Z:tumor");

Expand All @@ -64,9 +66,10 @@ public void testfinalizeRegion() {
activeRegion.addAll(reads);
SampleList sampleList = SampleList.singletonSampleList("tumor");
Byte minbq = 9;
AssemblyBasedCallerUtils.finalizeRegion(activeRegion, false, false, minbq, header, sampleList, false, false);
// NOTE: this test MUST be run with correctOverlappingBaseQualities enabled otherwise this test can succeed even with unsafe code
AssemblyBasedCallerUtils.finalizeRegion(activeRegion, false, false, minbq, header, sampleList, true, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that the modified version of this test fails without the copy() call you added above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already tested it. It does fail without the above fix. the problem is that when we added correctOverlappingBaseQualities as an argument to finalizeRegion() it was mistakenly set to false for this method, which means this test was doing nothing and thus we missed that the last round of refactoring broke the assertion in this test by allowing un-coppied base qualities to be adjusted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment then mentioning that that boolean argument is critical for the test to be meaningful?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen I added comments better explaining how the test works. Is this okay to merge?


// make sure reads are not changed due to finalizeRegion()
// make sure that the original reads are not changed due to finalizeRegion()
Assert.assertTrue(reads.get(0).convertToSAMRecord(header).equals(orgRead0));
Assert.assertTrue(reads.get(1).convertToSAMRecord(header).equals(orgRead1));
}
Expand Down