Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment filter #2930

Closed
droazen opened this issue Jun 5, 2017 · 1 comment
Closed

Alignment filter #2930

droazen opened this issue Jun 5, 2017 · 1 comment
Assignees
Labels
Milestone

Comments

@droazen
Copy link
Contributor

droazen commented Jun 5, 2017

@davidbenjamin commented on Thu Jan 05 2017

Detect false positives due to mapping as follows: for every purported variant, align all alt reads, with an aligner that outputs multiple candidate alignments with scores, and toss out reads that map well to a different locus.

The simplest version of this is 1) write a VariantWalker that inputs a bam and a vcf and outputs the bases of all alt reads in the ReadsContext at each variant; 2) send these to an external alignment program; 3) read in the alignments in a GATK tool and filter accordingly. The ambitious version is to write our own simple aligner, eg a kmer-based method like BLAT or BBMap but with all the messy parts for handling big indels, RNA, and proteins removed. Writing our own BWA aligner would be wildly impractical.

@takutosato @LeeTL1220 keeping you in the loop.


@davidbenjamin commented on Thu Jan 26 2017

Even better: rely on someone else in the group, such as Ted, to write a Java binding for BWA in memory. See #2367.


@davidbenjamin commented on Sun Apr 23 2017

So. . . given that our pipeline aligns with BWA, it might seem like this is just a redundant and laborious rehashing of the mapping quality score.

However, the mapping quality only considers multi-mapping within the reference, and therefore doesn't account for mapping errors due to incompleteness of the reference. That is, reads from genomic regions that are not part of the reference (because they're hard to assemble, like centromeres etc) might map well to a unique regions within the reference, and therefore will have fine mapping quality even though they are artifacts.

There are published "decoy genomes" -- essentially pseudo-contigs of regions missing from the reference, and mapping with BWA in memory to those might be very helpful.

So, we need to: 1) get our hands on a decoy genome that will play nicely with BWA, and 2) talk to the SV team.


@ldgauthier commented on Mon Apr 24 2017

To be pedantic, the mapping quality also considers how well the read aligns
to its best mapping. In places where a sample has a lot of nearby SNPs
compared to the reference the mapping qualities of the reads are low
compared to reads that contain fewer SNPs. I've been mulling over the
conflation of these two aspects of mapping quality for a while because it
biases our VQSR results, but maybe the new filtering models will be able to
figure it out.

The b37 reference with decoy contigs is here:
/humgen/1kg/reference/human_g1k_v37_decoy.fasta.I believe that the
reference issue that required the decoy in the b37 1000G work was resolved
in the hg38 reference. This is an excellent topic to discuss with Heng
during his office hours when he gets back from China in a few weeks, but I
expect the SV team will also be helpful in the meantime.

On Sun, Apr 23, 2017 at 11:14 PM, David Benjamin notifications@github.com
wrote:

So. . . given that our pipeline aligns with BWA, it might seem like this
is just a redundant and laborious rehashing of the mapping quality score.

However, the mapping quality only considers multi-mapping within the
reference, and therefore doesn't account for mapping errors due to
incompleteness of the reference. That is, reads from genomic regions that
are not part of the reference (because they're hard to assemble, like
centromeres etc) might map well to a unique regions within the reference,
and therefore will have fine mapping quality even though they are artifacts.

There are published "decoy genomes" -- essentially pseudo-contigs of
regions missing from the reference, and mapping with BWA in memory to
those might be very helpful.

So, we need to: 1) get our hands on a decoy genome that will play nicely
with BWA, and 2) talk to the SV team.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
broadinstitute/gatk-protected#844 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AGRhdCJQob4WqdwDN0R8jvbNGT1l0vSCks5rzBOmgaJpZM4Lb8pz
.


@davidbenjamin commented on Wed May 03 2017

Copying comments from closed issue #993. Instead of running an aligner in memory, let's first try preprocessing an alignability (to the ref + decoy) resource file. Then we can simply query this file at each called variant.

ENCODE used a kmer size of 36 bp, which is seriously obsolete and will tend to underestimate alignability. However, the GEM program (paper here: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 and binary here: http://algorithms.cnag.cat/wiki/The_GEM_library#Documentation and blog post on how to run it here: http://blog.kokocinski.net/index.php/sequence-mappability-alignability?blog=2) was used by ENCODE to produce this track and we can easily produce it ourselves with any kmer size and any mismatch threshold.

Furthermore, once we make this track we can store this track in memory eg as a HashedListTargetCollection and therefore we can query it for every read to get an annotation for the number of uniquely mappable reads (up to some error tolerance).

One more thing: we can also query based on the start position of each read's mate.

@droazen droazen added the Mutect label Jun 5, 2017
@davidbenjamin davidbenjamin added this to the Make CGA happy with Mutect 2 milestone Jun 9, 2017
@davidbenjamin
Copy link
Contributor

Once we figure out what to do in #3089, this issue consists of doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants