-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SAM recommendation of MAPQ=0 for unmapped reads #753
base: master
Are you sure you want to change the base?
Conversation
We already have a recommendation that MAPQ of 255 should not be used. Expand on this to recommend zero is used when unmapped. This is purely a recommendation, made for maximum compatibility, and not a specification requirement. Fixes samtools#727
Should this be further constrained to unmapped records with missing mapping qualities? That is, recommend using 0 instead of 255. Otherwise, is this recommending that all unmapped records have their mapping qualities reset to 0? |
Is there any value in another mapping quality? By definition, if it's not mapped, the mapping quality doesn't make sense. |
The use case for non-zero mapq is for an aligner that can confidently map to a sequence that is not included in the reference genome. The primary use case for this is graph genome aligners. Graph genome aligners can have a mapq w.r.t. to graph but when projected onto a linear reference, that that alignment position is fully contained within an insertion. How should these be represented? The two approaches I've seen in the wild are:
Making this mapq change will cause an existing implementation in widespread usage to break/become non-compliant. Ok to do, but it would be good to have a) some time to update the implementation, and b) spec-defined guideance on how non-reference alignments should be handled (both those with INS liftovers and those that do not map to reference contig (e.g. graph decoy sequence). For example, DRAGEN retains information about originating alignment via a SA-style custom SAM tag. This will be a lossy operation if mapq is forced to 0. |
I disagree. The specification already states "If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800" so adding a recommendation to use zero values for such cases cannot break existing implementations any more than they are already broken. Spec compliant software should ignore those fields. All this PR is doing is trying to encourage people to stop using them in such situations. My recommendation is people should use additional aux fields if they need a way to represent something which is currently outside of the specification, or to raise a PR here requesting things are changed. Mind you, that sentence above also rules out the well established (since day one) strategy of "placed" reads that are unmapped by POSitioned along side their mate pair, which is rather unfortunate. Edit: sorry I'm being forgetful. There is a recommendations section which does define meaning to RNAME and POS for unmapped data (ie the mate-pair strategy above). It's rather poor form though for the same document to be essentially saying "these fields have no meaning" followed by "this is the meaning of these fields"! I accept that it's "no assumptions" rather than no meaning, but we're leading people up the garden path somewhat. The phrasing in section 1.4 ought to be amended to indicate this. Eg: "Bit 0x4 is the only reliable place to tell whether the read is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, and bits 0x2, 0x100, and 0x800. However see the Recommended Practices in section 2 for suggested interpretations." It's far from ideal and I don't believe it should ever have been written this way, but it is what it is. As for the DRAGEN strategy of placing the read where the graph insertion is, I'm fully in agreement with it from a logical view point. Remember why we "place" unmapped reads along side their mate pair. It's because the sample may not be the same as the reference and possibly we have a large insertion. By putting reads together, we can pull out all data for such a region via a single SAM query and then do denovo assembly to construct a sample-specific set of alleles for that locus. Indeed this is exactly what a lot of variant callers do. If we have a better placement for that read via a graph genome, then it is sensible that we place the unmapped data closer to the exact site of insertion. Previously this wouldn't have made much difference with classic small insert sized templates, but we now have longer range library types and also the possibility of very long reads which may be broken up into supplementary reads and placed at different locations. Also, why MAPQ and only MAPQ? If we have a path through a graph and we've flattened those coordinates back to their nearest place on a linear reference, then we have the alignment of this sequence against the graph, including the path in the graph, the cigar string against that path, it's mapping quality and maybe NM / MD tags against that path too. Basically everything we see in GAF format. I'm not arguing for adding meaning to those other fields, but questioning why only one field was chosen (mapq). This is all best done with additional tags rather than changing to meaning of the mandatory columns, especially given our assertion that we cannot make any assumptions on their meaning. The spec really needs a refresh to accommodate such ideas, but IMO it should be driven by the people generating the data and doing the graph alignments. (Ie GA4GH "Driver projects" and similar.) It's also a much larger issue than this one minor tweak. |
On the related issue of POS having no meaning for reads with no aligned bases, I hit this many years ago when implementing Gap5. It's worse than it looks too as it's still problematic even when you do have aligned bases. Insertions basically appear before the next base, so a read that starts with an insertion could have CIGAR of Fundamentally though this breaks indexing. Even the normal case of
A query for chr:5-5 wouldn't give us Seq2. (And a query for chr:6-6 wouldn't give us Seq1.) We need more nuance here. Processing each read iteratively in a pileup orientation also doesn't work and would give us a set of reads with an insertion at position 5 that doesn't include Seq2, as it hasn't yet started. Then when we get to position 6 we're told there was additional data we should have been using for computing the insertion consensus. Doh! In Gap5 I replaced every CIGAR with 1D and decremented POS by 1, and then during processing removed the initial deleted base. VERY ugly, but it made the pileup iterator work as our position coordinates are now the base before an insertion rather than after the insertion. It's a miss-design of the spec IMO, but we are where we are and I don't really know how to retrofit it without resorting to such hacks. |
We already have a recommendation that MAPQ of 255 should not be used. Expand on this to recommend zero is used when unmapped.
This is purely a recommendation, made for maximum compatibility, and not a specification requirement.
Fixes #727