-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Realignment filter #4698
Realignment filter #4698
Conversation
c309c84
to
3eeef83
Compare
* @param fragmentSize the maximum distance on either side of {@code interval} to look for mates. | ||
* @return a map of reads ot their mates for all reads for which a mate could be found. | ||
*/ | ||
public Map<GATKRead, GATKRead> getReadToMateMap(final int fragmentSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be more at home in ReadUtils
rather than here, as it's a pretty specific method. You could add a generic query(interval)
method to ReadsContext
to enable this method to move to ReadUtils
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Codecov Report
@@ Coverage Diff @@
## master #4698 +/- ##
===============================================
+ Coverage 80.111% 80.148% +0.037%
- Complexity 17442 17485 +43
===============================================
Files 1083 1083
Lines 63165 63313 +148
Branches 10193 10237 +44
===============================================
+ Hits 50602 50744 +142
+ Misses 8574 8573 -1
- Partials 3989 3996 +7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some random comments. I am trying to understand the algorithm to be able to apply the method too, so some of them might not make sense...
public static final String MIN_MAPPING_QUALITY_LONG_NAME = "min-mapping-quality"; | ||
@Argument(fullName = MIN_MAPPING_QUALITY_LONG_NAME, | ||
doc="Ignore reads with mapping quality less than this", optional=true) | ||
private int minMappingQuality = DEFAULT_MIN_MAPPING_QUALITY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be a default ReadFilter
(MappingQualityReadFilter), which can be tweaked by the user as a plugin (delete the filter, change the minimum MQ, set a maximum one, etc). Otherwise, the filter can be applied twice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done.
final List<BwaMemAlignment> alignments = aligner.alignSeqs(Arrays.asList(read), GATKRead::getBasesNoCopy).get(0); | ||
|
||
final List<BwaMemAlignment> nonAltAlignments = alignments.size() == 1 ? alignments : | ||
alignments.stream().filter(a -> a.getRefId() < numberOfRegularContigs).collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one alignment, it can still be alt, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, although I haven't seen such a case yet. However, if there's only one alignment then we're not worried about a variant falsely appearing multi-mapped.
} else if (alignments.get(0).getRefId() < 0) { | ||
return new RealignmentResult(false, alignments); | ||
} else if (alignments.size() == 1) { | ||
return new RealignmentResult(true, alignments); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is only one alignment, is it really always in the supposed location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically, yes. I actually wrote a version of this tool that used a liftover chain file to map the hg38 alignment back to the original reference in order to check this, but it slowed the tool, complicated the code, made an additional resource file (which, unlike the hg38 index, depends on the original reference) necessary, and made no difference in the results.
Basically, hg19 is good enough that even if it misses some alignments its best alignment will always show up somewhere in the hg38 alignments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be the case for other reference genomes (e.g., non-model species). I thought that the filter can be used with other species too, which have much worse assemblies...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this took a while. I get the sense that I still don't get the fundamental ideas of the realignment filter. In particular, the filter considers a realignment a success when the read realigns to a different chromosome - although I could be wrong.
I'd also be curious to hear more about the reasoning behind the filtering criteria, which I think goes as follows:
If more than 3 alt reads fail to realign, FILTER
Else, if we have 2 or more successful alignments, PASS
Else, FILTER if (# failed alignments > # success)
Could we discuss in person on Monday?
* <p> | ||
* The bam input to this tool should be the reassembly bamout produced by HaplotypeCaller or Mutect2 in the process of generating | ||
* the input callset. The original bam will also work but might fail to filter some indels. The reference passed with the -R argument | ||
* but be the reference to which the input bam was realigned. This does not need to correspond to the reference of the BWAS-MEM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: but be -> must be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: BWAS-MEM -> BWA-MEM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done x 2
return checkAlignments(nonAltAlignments); | ||
} | ||
|
||
public RealignmentResult checkAlignments(final List<BwaMemAlignment> alignments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
aligner.setFlagOption(BwaMemAligner.MEM_F_ALL); | ||
} | ||
|
||
public RealignmentResult realign(final GATKRead read) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by IntelliJ, this method can be package-private (i.e. we can remove the public
keyword). I don't know how big of a deal this is. One the one hand, it's good style to keep the visibility as tight as possible. On the other hand I get the sense that GATK code as a whole doesn't think it's that big of a deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the GATK I think we informally make anything available to a CLI public and use package-private largely for visible-for-testing methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Thanks!
@@ -68,6 +67,7 @@ public void testDreamTumorNormal(final File tumorBam, final String tumorSample, | |||
Utils.resetRandomGenerator(); | |||
final File unfilteredVcf = createTempFile("unfiltered", ".vcf"); | |||
final File filteredVcf = createTempFile("filtered", ".vcf"); | |||
final File alignmentFilteredVcf = createTempFile("alignment-filtered", ".vcf"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public void apply(final VariantContext vc, final ReadsContext readsContext, final ReferenceContext refContext, final FeatureContext fc) { | ||
Trilean passesFilter = Trilean.UNKNOWN; | ||
|
||
if (vc.getNAlleles() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this expression ever evaluate to true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless, the ternary operator here will save a couple lines of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this expression ever evaluate to true?
I think probably not for vcf in-spec, but I'm being defensive against non-spec vcfs and spec changes.
ternary. . .
done
} | ||
} | ||
|
||
private boolean supportsVariant(final GATKRead read, final VariantContext vc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems risky to me to have no unit tests for these helper methods - it's difficult to know whether they're working as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I will be a good citizen and write some tests.
.filter(m -> m.getRefId() == r.getRefId()) | ||
.filter(m -> Math.abs(m.getRefStart() - r.getRefStart()) < realignmentArgumentCollection.maxReasonableFragmentLength) | ||
.filter(m -> SAMFlag.READ_REVERSE_STRAND.isSet(m.getSamFlag()) == SAMFlag.READ_REVERSE_STRAND.isSet(r.getSamFlag())) | ||
.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m)))
-> .map(m -> new Pair<>(r,m)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
continue; | ||
} | ||
|
||
// First check the read itself. If it maps uniquely we don't need to bother with the mate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it maps uniquely we don't need to bother with the mate
: is no longer accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Deleted.
// TODO: we need to check that contig is the same or equivalent up to hg38 alt contig | ||
// TODO: do this by looking at index.getReferenceContigNames() and bamHeader.getSequenceDictionary().getSequences() | ||
// TODO: in IDE and seeing what the correspondence could be | ||
// TODO: put in check that start position is within eg 10 Mb of original mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the alt reads maps not exactly at the original location but within 10 Mb, we consider it a success because it could be the same location when lifted over to grch38?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the idea. This check is an edge case that might never actually arise because if the mapping is unique it's almost certainly in the right place. That's why I feel fine leaving this for later -- or never. But yes, the 10 MB is a liftover fudge factor. In practice I think the largest discrepancy between hg19 and hg38 coordinates is about 4 MB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might not be true for other assemblies (e.g., non-model species).
.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m))) | ||
.collect(Collectors.toList()); | ||
|
||
if (plausiblePairs.size() <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we consider it successful when a read and its mate originally map to somewhere in chr3 but are both realigned uniquely to chr17
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do, but as I discussed above with @magicDGS this simply never happens in practice. Maybe if someone is using a really ancient reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or any other species assembly. I would like to see this working for non-human data.
e6e5f7d
to
9d603c7
Compare
Back to @takutosato. |
* The algorithm is: | ||
* 1) make two maps of read name --> read for reads overlapping {@code interval}, one for first-of-pair reads and one | ||
* for second-of-pair reads. | ||
* 2) For all reads in an expanded interval padded by {@code fragmentSize} on both sides look for a read of the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"read of the same that..." -> "read of the same name that..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m))) | ||
.collect(Collectors.toList()); | ||
|
||
if (plausiblePairs.size() <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
|
||
@Test | ||
public void testSupportsVariant() { | ||
final GATKRead refRead = makeRead("refRead", 1, "ACGTACGT", "8M"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refRead
is not used, but I think we should keep the ref sequence (extended to include "GGGGG") somehow, either as a comment or as an unused variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had meant to make sure that the ref read doesn't support any of the variants and forgot to do so. Fixed.
import java.util.Collection; | ||
import java.util.List; | ||
|
||
public class RealignmentEngineUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great tests!
@davidbenjamin I think there was a typo in a comment but other than that looks great! Thanks for answering my questions - it didn't let me comment back (the reply boxes disappeared for some reason) but I went through them all. Please merge when ready |
9d603c7
to
31c4f06
Compare
@magicDGS you make some good points about other species. We can't devote any time to that right now, but I do think it's worth coming back to. |
@davidbenjamin - can you please open an issue for my concerns and what is necessary for solving the issues? I might have a look at some point! |
Thanks! |
Closes #2930.
@takutosato Here it is, finally. It's working as well as I can get it to work, which is decently but not a slam dunk.