-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a bug where overlapping reads in subsequent regions can have invalid base qualities #6943
Conversation
…ets coppied without doing it again for every read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest a slightly different approach here @jamesemery
@@ -114,6 +118,9 @@ public static void finalizeRegion(final AssemblyRegion region, | |||
.filter(read -> !read.isEmpty() && read.getCigar().getReadLength() > 0) | |||
.map(read -> ReadClipper.hardClipToRegion(read, region.getPaddedSpan().getStart(), region.getPaddedSpan().getEnd() )) | |||
.filter(read -> read.getStart() <= read.getEnd() && read.getLength() > 0 && read.overlaps(region.getPaddedSpan())) | |||
// The transient attribute is preserved across copy operations and all of the previous alterations make copies, this simple ensures | |||
// that any reads that have not been copied along the way are copied here for safety. | |||
.map(read -> (read.getTransientAttribute("Original") != read? read : read.copy())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of relying on transient attributes (which, in my opinion, is brittle and a recipe for future problems), compare the memory address of the read pre/post clipping, and make a copy only if the address hasn't changed, as in the original patch in #4926. Rewrite this method to use a loop instead of streaming, if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have rewritten it as a loop at the expense of angering the Aesthetic gods. Can you re-review?
@droazen can this bbe merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Back to you with a few questions on this one
if (read.getStart() <= read.getEnd() && read.getLength() > 0 && read.overlaps(region.getPaddedSpan())) { | ||
// NOTE: here we make a defensive copy of the read if it has not been modified by the above operations | ||
// which might only make copies in the case that the read is actually clipped | ||
readsToUse.add(read == originalRead? read.copy() : read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy()
method performs a shallow copy -- it ends up calling SAMRecord.clone()
, which copies all fields in the SAMRecord as if by assignment (the exception is the attributes array, which is explicitly copied). So the byte[]
arrays for the bases and quals point to the same memory location in the copy as in the original. Is this a problem? Do we later modify the bases/quals in-place somewhere, or do we always copy the bases/quals arrays due to the defensive copies in SAMRecordToGATKReadAdapter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do modify them later, specifically when we call cleanOverlappingReadPairs()
we modify the base qualities in place for reads that overlap and if any of those reads have not been clipped bby the clipping code here this could result in the next assembly region having invalid/wrong base qualities for the same read. Given how that method is structured it is non-trivial to refactor it to make the copy only in the event it needs to be modified and it seemed easier to just put a check in to make sure that every read is deeply copied at least once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I disagree with your assessment that cleanOverlappingReadPairs()
changes the bases/quals in-place -- it calls getBases()
and getBaseQualities()
on the read, which perform defensive copies. The question is: is there any code that calls getBasesNoCopy()
and/or getBaseQualitiesNoCopy()
and truly modifies the bases/quals arrays in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the PileupReadErrorCorrector
calls getBaseQualitiesNoCopy()
and getBasesNoCopy()
and then modifies the bases/quals in-place. This could have the effect of modifying the base/qual arrays in the original reads that will be used in subsequent assembly regions. I think as part of this PR we should patch PileupReadErrorCorrector
to call the getters that make copies, and then call the setter methods to update the bases/quals.
We should also check for additional problematic usages of the *NoCopy()
methods in the HC/M2 codepaths.
@@ -64,7 +64,7 @@ public void testfinalizeRegion() { | |||
activeRegion.addAll(reads); | |||
SampleList sampleList = SampleList.singletonSampleList("tumor"); | |||
Byte minbq = 9; | |||
AssemblyBasedCallerUtils.finalizeRegion(activeRegion, false, false, minbq, header, sampleList, false, false); | |||
AssemblyBasedCallerUtils.finalizeRegion(activeRegion, false, false, minbq, header, sampleList, true, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that the modified version of this test fails without the copy()
call you added above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already tested it. It does fail without the above fix. the problem is that when we added correctOverlappingBaseQualities
as an argument to finalizeRegion()
it was mistakenly set to false for this method, which means this test was doing nothing and thus we missed that the last round of refactoring broke the assertion in this test by allowing un-coppied base qualities to be adjusted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment then mentioning that that boolean argument is critical for the test to be meaningful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen I added comments better explaining how the test works. Is this okay to merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This fixes the test that was accidentally disabled (when we added a flag for overlapping read adjustment we forgot to set it correctly for the test) and reinstated something like the code from #4926. It should now necessarily be the case that the finalize() method calls read.copy() at least once for every read (though many multiples of once are still quite possible).
Fixes #6882