-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpileup on 1I1I vs 2I cigar ops #139
Comments
If, in sam.c sam_parse1(), we replace:
with:
and add:
after:
in the following loop, the problem seems fixed.
the results are:
|
Does it make more sense to make the parser to auto-combine duplicated op-codes? |
Probably no Peter. The pileup code doesn't support P cigar well either, but it needs to. Eg say our organism has an inserted triplet Ideally mpileup would emit |
In reply to myself, it appears at some stage I got fed up of the insertion bugs and fixed this, as well as supporting P operator. So: 766717e made It still doesn't cope with multiple dels concatenated though, so parts of the issue remain, eg:
Those -1N shouldn't be there. |
This means 4M1D1D1D3M is reported as 4M3D3M instead, and importantly "p->indel=-3" for the first 1D and the 2nd 1D has "p->indel=0" (with p->is_del=1, the same as it would for the 2nd base in a 3D cigar op). Previously samtools mpileup would produce incorrect looking output for the 1D1D scenario. Fixing this in sam.c means not only is samtools mpileup now looking better, but any tool using the mpileup API will be getting consistent results. Note that samtools mpileup already resolved the ...1I1I1I... case, but it did this within the samools bam_plcmd.c code itself. Hence while the pileup API works, it left p->indel=1 instead of p->indel=3 for this situation. So we also resolve that in a similar fashion. Note 2P1I1I is reported as p->indel=2 (a 2bp indel) even though bam_plp_insertion would return e.g. +4**AC, as we're reporting the number of bases inserted in this sequence rather than the padded alignment size. Fixes samtools/samtools#139, or at least the remaining part of the puzzle. Most had previously been fixed already back in 2014.
This means 4M1D1D1D3M is reported as 4M3D3M instead, and importantly "p->indel=-3" for the first 1D and the 2nd 1D has "p->indel=0" (with p->is_del=1, the same as it would for the 2nd base in a 3D cigar op). Previously samtools mpileup would produce incorrect looking output for the 1D1D scenario. Fixing this in sam.c means not only is samtools mpileup now looking better, but any tool using the mpileup API will be getting consistent results. Note that samtools mpileup already resolved the ...1I1I1I... case, but it did this within the samools bam_plcmd.c code itself. Hence while the pileup API works, it left p->indel=1 instead of p->indel=3 for this situation. So we also resolve that in a similar fashion. Note 2P1I1I is reported as p->indel=2 (a 2bp indel) even though bam_plp_insertion would return e.g. +4**AC, as we're reporting the number of bases inserted in this sequence rather than the padded alignment size. Fixes samtools/samtools#139, or at least the remaining part of the puzzle. Most had previously been fixed already back in 2014.
Cigar operations I and D with neighbouring operations of the same should be concatenated together. Ie CIGAR 1I1I1I should be treated as 3I and 1D1D1D as 3D.
The failure to do so causes pileup to report +1C or -1C when it really means +3CAG and -3CAG (for example). The deletion case isn't so severe as it then does emit the other remaining deletion characters in the next row, but the insertion output is simply incorrect.
It could be argued that this is a failure of the specification (it doesn't exclude 1I1I1I in favour of the more sensible 3I).
The text was updated successfully, but these errors were encountered: