You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Detect false positives due to mapping as follows: for every purported variant, align all alt reads, with an aligner that outputs multiple candidate alignments with scores, and toss out reads that map well to a different locus.
The simplest version of this is 1) write a VariantWalker that inputs a bam and a vcf and outputs the bases of all alt reads in the ReadsContext at each variant; 2) send these to an external alignment program; 3) read in the alignments in a GATK tool and filter accordingly. The ambitious version is to write our own simple aligner, eg a kmer-based method like BLAT or BBMap but with all the messy parts for handling big indels, RNA, and proteins removed. Writing our own BWA aligner would be wildly impractical.
So. . . given that our pipeline aligns with BWA, it might seem like this is just a redundant and laborious rehashing of the mapping quality score.
However, the mapping quality only considers multi-mapping within the reference, and therefore doesn't account for mapping errors due to incompleteness of the reference. That is, reads from genomic regions that are not part of the reference (because they're hard to assemble, like centromeres etc) might map well to a unique regions within the reference, and therefore will have fine mapping quality even though they are artifacts.
There are published "decoy genomes" -- essentially pseudo-contigs of regions missing from the reference, and mapping with BWA in memory to those might be very helpful.
So, we need to: 1) get our hands on a decoy genome that will play nicely with BWA, and 2) talk to the SV team.
To be pedantic, the mapping quality also considers how well the read aligns
to its best mapping. In places where a sample has a lot of nearby SNPs
compared to the reference the mapping qualities of the reads are low
compared to reads that contain fewer SNPs. I've been mulling over the
conflation of these two aspects of mapping quality for a while because it
biases our VQSR results, but maybe the new filtering models will be able to
figure it out.
The b37 reference with decoy contigs is here:
/humgen/1kg/reference/human_g1k_v37_decoy.fasta.I believe that the
reference issue that required the decoy in the b37 1000G work was resolved
in the hg38 reference. This is an excellent topic to discuss with Heng
during his office hours when he gets back from China in a few weeks, but I
expect the SV team will also be helpful in the meantime.
So. . . given that our pipeline aligns with BWA, it might seem like this
is just a redundant and laborious rehashing of the mapping quality score.
However, the mapping quality only considers multi-mapping within the
reference, and therefore doesn't account for mapping errors due to
incompleteness of the reference. That is, reads from genomic regions that
are not part of the reference (because they're hard to assemble, like
centromeres etc) might map well to a unique regions within the reference,
and therefore will have fine mapping quality even though they are artifacts.
There are published "decoy genomes" -- essentially pseudo-contigs of
regions missing from the reference, and mapping with BWA in memory to those might be very helpful.
So, we need to: 1) get our hands on a decoy genome that will play nicely
with BWA, and 2) talk to the SV team.
Copying comments from closed issue #993. Instead of running an aligner in memory, let's first try preprocessing an alignability (to the ref + decoy) resource file. Then we can simply query this file at each called variant.
Furthermore, once we make this track we can store this track in memory eg as a HashedListTargetCollection and therefore we can query it for every read to get an annotation for the number of uniquely mappable reads (up to some error tolerance).
One more thing: we can also query based on the start position of each read's mate.
The text was updated successfully, but these errors were encountered:
@davidbenjamin commented on Thu Jan 05 2017
Detect false positives due to mapping as follows: for every purported variant, align all alt reads, with an aligner that outputs multiple candidate alignments with scores, and toss out reads that map well to a different locus.
The simplest version of this is 1) write a
VariantWalker
that inputs a bam and a vcf and outputs the bases of all alt reads in theReadsContext
at each variant; 2) send these to an external alignment program; 3) read in the alignments in a GATK tool and filter accordingly. The ambitious version is to write our own simple aligner, eg a kmer-based method like BLAT or BBMap but with all the messy parts for handling big indels, RNA, and proteins removed. Writing our own BWA aligner would be wildly impractical.@takutosato @LeeTL1220 keeping you in the loop.
@davidbenjamin commented on Thu Jan 26 2017
Even better: rely on someone else in the group, such as Ted, to write a Java binding for BWA in memory. See #2367.
@davidbenjamin commented on Sun Apr 23 2017
So. . . given that our pipeline aligns with BWA, it might seem like this is just a redundant and laborious rehashing of the mapping quality score.
However, the mapping quality only considers multi-mapping within the reference, and therefore doesn't account for mapping errors due to incompleteness of the reference. That is, reads from genomic regions that are not part of the reference (because they're hard to assemble, like centromeres etc) might map well to a unique regions within the reference, and therefore will have fine mapping quality even though they are artifacts.
There are published "decoy genomes" -- essentially pseudo-contigs of regions missing from the reference, and mapping with BWA in memory to those might be very helpful.
So, we need to: 1) get our hands on a decoy genome that will play nicely with BWA, and 2) talk to the SV team.
@ldgauthier commented on Mon Apr 24 2017
To be pedantic, the mapping quality also considers how well the read aligns
to its best mapping. In places where a sample has a lot of nearby SNPs
compared to the reference the mapping qualities of the reads are low
compared to reads that contain fewer SNPs. I've been mulling over the
conflation of these two aspects of mapping quality for a while because it
biases our VQSR results, but maybe the new filtering models will be able to
figure it out.
The b37 reference with decoy contigs is here:
/humgen/1kg/reference/human_g1k_v37_decoy.fasta.I believe that the
reference issue that required the decoy in the b37 1000G work was resolved
in the hg38 reference. This is an excellent topic to discuss with Heng
during his office hours when he gets back from China in a few weeks, but I
expect the SV team will also be helpful in the meantime.
On Sun, Apr 23, 2017 at 11:14 PM, David Benjamin notifications@github.com
wrote:
@davidbenjamin commented on Wed May 03 2017
Copying comments from closed issue #993. Instead of running an aligner in memory, let's first try preprocessing an alignability (to the ref + decoy) resource file. Then we can simply query this file at each called variant.
The text was updated successfully, but these errors were encountered: