Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realignment filter #4698

Merged
merged 1 commit into from
May 22, 2018
Merged

Realignment filter #4698

merged 1 commit into from
May 22, 2018

Conversation

davidbenjamin
Copy link
Contributor

Closes #2930.

@takutosato Here it is, finally. It's working as well as I can get it to work, which is decently but not a slam dunk.

@davidbenjamin davidbenjamin added this to the Mutect2 milestone Apr 24, 2018
@davidbenjamin davidbenjamin force-pushed the db_realignment_filter branch 3 times, most recently from c309c84 to 3eeef83 Compare April 24, 2018 19:05
* @param fragmentSize the maximum distance on either side of {@code interval} to look for mates.
* @return a map of reads ot their mates for all reads for which a mate could be found.
*/
public Map<GATKRead, GATKRead> getReadToMateMap(final int fragmentSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more at home in ReadUtils rather than here, as it's a pretty specific method. You could add a generic query(interval) method to ReadsContext to enable this method to move to ReadUtils.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@codecov-io
Copy link

codecov-io commented Apr 24, 2018

Codecov Report

Merging #4698 into master will increase coverage by 0.037%.
The diff coverage is 93.617%.

@@               Coverage Diff               @@
##              master     #4698       +/-   ##
===============================================
+ Coverage     80.111%   80.148%   +0.037%     
- Complexity     17442     17485       +43     
===============================================
  Files           1083      1083               
  Lines          63165     63313      +148     
  Branches       10193     10237       +44     
===============================================
+ Hits           50602     50744      +142     
+ Misses          8574      8573        -1     
- Partials        3989      3996        +7
Impacted Files Coverage Δ Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
...broadinstitute/hellbender/engine/ReadsContext.java 100% <100%> (ø) 10 <1> (+1) ⬆️
...a/org/broadinstitute/hellbender/utils/Trilean.java 100% <100%> (ø) 3 <2> (+2) ⬆️
...alignmentfilter/RealignmentArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 99.301% <100%> (+0.005%) 10 <0> (ø) ⬇️
...roadinstitute/hellbender/utils/read/ReadUtils.java 79.81% <100%> (+0.592%) 196 <6> (+7) ⬆️
...rs/realignmentfilter/FilterAlignmentArtifacts.java 91.228% <91.228%> (ø) 19 <19> (?)
...s/walkers/realignmentfilter/RealignmentEngine.java 93.333% <93.333%> (ø) 35 <35> (?)
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...adinstitute/hellbender/utils/spark/SparkUtils.java 72.727% <0%> (-11.797%) 12% <0%> (-9%)
... and 33 more

Copy link
Contributor

@magicDGS magicDGS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some random comments. I am trying to understand the algorithm to be able to apply the method too, so some of them might not make sense...

public static final String MIN_MAPPING_QUALITY_LONG_NAME = "min-mapping-quality";
@Argument(fullName = MIN_MAPPING_QUALITY_LONG_NAME,
doc="Ignore reads with mapping quality less than this", optional=true)
private int minMappingQuality = DEFAULT_MIN_MAPPING_QUALITY;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a default ReadFilter (MappingQualityReadFilter), which can be tweaked by the user as a plugin (delete the filter, change the minimum MQ, set a maximum one, etc). Otherwise, the filter can be applied twice!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done.

final List<BwaMemAlignment> alignments = aligner.alignSeqs(Arrays.asList(read), GATKRead::getBasesNoCopy).get(0);

final List<BwaMemAlignment> nonAltAlignments = alignments.size() == 1 ? alignments :
alignments.stream().filter(a -> a.getRefId() < numberOfRegularContigs).collect(Collectors.toList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one alignment, it can still be alt, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, although I haven't seen such a case yet. However, if there's only one alignment then we're not worried about a variant falsely appearing multi-mapped.

} else if (alignments.get(0).getRefId() < 0) {
return new RealignmentResult(false, alignments);
} else if (alignments.size() == 1) {
return new RealignmentResult(true, alignments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is only one alignment, is it really always in the supposed location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically, yes. I actually wrote a version of this tool that used a liftover chain file to map the hg38 alignment back to the original reference in order to check this, but it slowed the tool, complicated the code, made an additional resource file (which, unlike the hg38 index, depends on the original reference) necessary, and made no difference in the results.

Basically, hg19 is good enough that even if it misses some alignments its best alignment will always show up somewhere in the hg38 alignments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the case for other reference genomes (e.g., non-model species). I thought that the filter can be used with other species too, which have much worse assemblies...

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this took a while. I get the sense that I still don't get the fundamental ideas of the realignment filter. In particular, the filter considers a realignment a success when the read realigns to a different chromosome - although I could be wrong.

I'd also be curious to hear more about the reasoning behind the filtering criteria, which I think goes as follows:

If more than 3 alt reads fail to realign, FILTER
Else, if we have 2 or more successful alignments, PASS
Else, FILTER if (# failed alignments > # success)

Could we discuss in person on Monday?

* <p>
* The bam input to this tool should be the reassembly bamout produced by HaplotypeCaller or Mutect2 in the process of generating
* the input callset. The original bam will also work but might fail to filter some indels. The reference passed with the -R argument
* but be the reference to which the input bam was realigned. This does not need to correspond to the reference of the BWAS-MEM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: but be -> must be?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: BWAS-MEM -> BWA-MEM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done x 2

return checkAlignments(nonAltAlignments);
}

public RealignmentResult checkAlignments(final List<BwaMemAlignment> alignments) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

aligner.setFlagOption(BwaMemAligner.MEM_F_ALL);
}

public RealignmentResult realign(final GATKRead read) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested by IntelliJ, this method can be package-private (i.e. we can remove the public keyword). I don't know how big of a deal this is. One the one hand, it's good style to keep the visibility as tight as possible. On the other hand I get the sense that GATK code as a whole doesn't think it's that big of a deal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the GATK I think we informally make anything available to a CLI public and use package-private largely for visible-for-testing methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@@ -68,6 +67,7 @@ public void testDreamTumorNormal(final File tumorBam, final String tumorSample,
Utils.resetRandomGenerator();
final File unfilteredVcf = createTempFile("unfiltered", ".vcf");
final File filteredVcf = createTempFile("filtered", ".vcf");
final File alignmentFilteredVcf = createTempFile("alignment-filtered", ".vcf");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public void apply(final VariantContext vc, final ReadsContext readsContext, final ReferenceContext refContext, final FeatureContext fc) {
Trilean passesFilter = Trilean.UNKNOWN;

if (vc.getNAlleles() == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this expression ever evaluate to true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, the ternary operator here will save a couple lines of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this expression ever evaluate to true?

I think probably not for vcf in-spec, but I'm being defensive against non-spec vcfs and spec changes.

ternary. . .

done

}
}

private boolean supportsVariant(final GATKRead read, final VariantContext vc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems risky to me to have no unit tests for these helper methods - it's difficult to know whether they're working as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I will be a good citizen and write some tests.

.filter(m -> m.getRefId() == r.getRefId())
.filter(m -> Math.abs(m.getRefStart() - r.getRefStart()) < realignmentArgumentCollection.maxReasonableFragmentLength)
.filter(m -> SAMFlag.READ_REVERSE_STRAND.isSet(m.getSamFlag()) == SAMFlag.READ_REVERSE_STRAND.isSet(r.getSamFlag()))
.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m))) -> .map(m -> new Pair<>(r,m)))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

continue;
}

// First check the read itself. If it maps uniquely we don't need to bother with the mate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it maps uniquely we don't need to bother with the mate: is no longer accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Deleted.

// TODO: we need to check that contig is the same or equivalent up to hg38 alt contig
// TODO: do this by looking at index.getReferenceContigNames() and bamHeader.getSequenceDictionary().getSequences()
// TODO: in IDE and seeing what the correspondence could be
// TODO: put in check that start position is within eg 10 Mb of original mapping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the alt reads maps not exactly at the original location but within 10 Mb, we consider it a success because it could be the same location when lifted over to grch38?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the idea. This check is an edge case that might never actually arise because if the mapping is unique it's almost certainly in the right place. That's why I feel fine leaving this for later -- or never. But yes, the 10 MB is a liftover fudge factor. In practice I think the largest discrepancy between hg19 and hg38 coordinates is about 4 MB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might not be true for other assemblies (e.g., non-model species).

.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m)))
.collect(Collectors.toList());

if (plausiblePairs.size() <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we consider it successful when a read and its mate originally map to somewhere in chr3 but are both realigned uniquely to chr17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, but as I discussed above with @magicDGS this simply never happens in practice. Maybe if someone is using a really ancient reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or any other species assembly. I would like to see this working for non-human data.

@davidbenjamin davidbenjamin force-pushed the db_realignment_filter branch 3 times, most recently from e6e5f7d to 9d603c7 Compare May 17, 2018 19:25
@davidbenjamin
Copy link
Contributor Author

Back to @takutosato.

* The algorithm is:
* 1) make two maps of read name --> read for reads overlapping {@code interval}, one for first-of-pair reads and one
* for second-of-pair reads.
* 2) For all reads in an expanded interval padded by {@code fragmentSize} on both sides look for a read of the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"read of the same that..." -> "read of the same name that..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.map(m -> new Pair<BwaMemAlignment, BwaMemAlignment>(r,m)))
.collect(Collectors.toList());

if (plausiblePairs.size() <= 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it


@Test
public void testSupportsVariant() {
final GATKRead refRead = makeRead("refRead", 1, "ACGTACGT", "8M");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refRead is not used, but I think we should keep the ref sequence (extended to include "GGGGG") somehow, either as a comment or as an unused variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to make sure that the ref read doesn't support any of the variants and forgot to do so. Fixed.

import java.util.Collection;
import java.util.List;

public class RealignmentEngineUnitTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tests!

@takutosato
Copy link
Contributor

@davidbenjamin I think there was a typo in a comment but other than that looks great! Thanks for answering my questions - it didn't let me comment back (the reply boxes disappeared for some reason) but I went through them all.

Please merge when ready

@davidbenjamin davidbenjamin force-pushed the db_realignment_filter branch from 9d603c7 to 31c4f06 Compare May 22, 2018 02:11
@davidbenjamin davidbenjamin merged commit ce0722a into master May 22, 2018
@davidbenjamin davidbenjamin deleted the db_realignment_filter branch May 22, 2018 13:39
@davidbenjamin
Copy link
Contributor Author

@magicDGS you make some good points about other species. We can't devote any time to that right now, but I do think it's worth coming back to.

@magicDGS
Copy link
Contributor

@davidbenjamin - can you please open an issue for my concerns and what is necessary for solving the issues? I might have a look at some point!

@davidbenjamin
Copy link
Contributor Author

@magicDGS Done -- see issue #4803.

@magicDGS
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants