Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpileup on 1I1I vs 2I cigar ops #139

Closed
jkbonfield opened this issue Feb 11, 2014 · 4 comments · Fixed by samtools/htslib#1552
Closed

mpileup on 1I1I vs 2I cigar ops #139

jkbonfield opened this issue Feb 11, 2014 · 4 comments · Fixed by samtools/htslib#1552
Milestone

Comments

@jkbonfield
Copy link
Contributor

Cigar operations I and D with neighbouring operations of the same should be concatenated together. Ie CIGAR 1I1I1I should be treated as 3I and 1D1D1D as 3D.

The failure to do so causes pileup to report +1C or -1C when it really means +3CAG and -3CAG (for example). The deletion case isn't so severe as it then does emit the other remaining deletion characters in the next row, but the insertion output is simply incorrect.

It could be argued that this is a failure of the specification (it doesn't exclude 1I1I1I in favour of the more sensible 3I).

@jmarshall jmarshall added this to the wishlist milestone Mar 24, 2014
@kirkmcclure
Copy link

If, in sam.c sam_parse1(), we replace:

        for (q = p; *p && *p != '\t'; ++p)
            if (!isdigit(*p)) ++n_cigar;

with:

        int prevOp = 512, thisOp;
        for (q = p; *p && *p != '\t'; ++p) {
            if (!isdigit(*p)) {
                thisOp = (uint8_t)*p >= 128? -1 : h->cigar_tab[(int)*p];
                if ( prevOp == thisOp && ((thisOp == BAM_CINS) || (thisOp == BAM_CDEL)) ) continue;
                prevOp = thisOp;
                ++n_cigar;
            }
        }

and add:

            while ( ((op == BAM_CINS) || (op == BAM_CDEL)) && isdigit((int)*(q+1)) ) {
                char *cpNextOp = q+1;
                int nextLen = strtol(cpNextOp, &cpNextOp, 10)<<BAM_CIGAR_SHIFT;
                if ((uint8_t)*cpNextOp < 128 && h->cigar_tab[(int)*cpNextOp] == op) {
                    cigar[i] += nextLen;
                    q = cpNextOp;
                }
                else break;
            }

after:

            _parse_err(op < 0, "unrecognized CIGAR operator");

in the following loop, the problem seems fixed.
I see no regression test effects and for:

@SQ SN:c2   LN:9
@CO c2  CTAATAATC
@CO
s1  0   c2  1   0   9M  *   0   0   CTAATAATC   XXXXXXXXX
s1a 0   c2  1   0   9M2I    *   0   0   CTAATAATCGG *
s1aa    0   c2  1   0   9M1I1I  *   0   0   CTAATAATCGG *
s1d 0   c2  1   0   7M2D    *   0   0   CTAATAA *
s1dd    0   c2  1   0   7M1D1D  *   0   0   CTAATAA *
s4  0   c2  1   0   4M2I2I1D2D2M    *   0   0   CTAAGGAATC  *
s5  0   c2  1   0   2M3I5D3I2M  *   0   0   CTGGGGGGTC  *
s5i 0   c2  1   0   2M1I1I1I5D3I2M  *   0   0   CTGGGGGGTC  *
s5d 0   c2  1   0   2M3I2D3D3I2M    *   0   0   CTGGGGGGTC  *

the results are:

c2      1       C       9       ^!.^!.^!.^!.^!.^!.^!.^!.^!.     X~~~~~~~~
c2      2       T       9       .......+3GGG.+3GGG.+3GGG        X~~~~~~~~
c2      3       A       9       ......***       X~~~~~~~~
c2      4       A       9       ......+4GGAA*** X~~~~~~~~
c2      5       T       9       .....****       X~~~~~~~~
c2      6       A       9       .....****       X~~~~~~~~
c2      7       A       9       ....-2TC.-2TC**+3GGT*+3GGT*+3GGT        X~~~~~~~~
c2      8       T       7       ....... X~~~~~~
c2      9       C       7       .$.+2GG$.+2GG$.$.$.$.$  X~~~~~~

@peterjc
Copy link
Contributor

peterjc commented Apr 26, 2015

Does it make more sense to make the parser to auto-combine duplicated op-codes?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Apr 29, 2015

Probably no Peter. The pileup code doesn't support P cigar well either, but it needs to. Eg say our organism has an inserted triplet AGA on one allele and A*A on the other allele, with the reference having neither. 1I1P1I is a reasonable way to describe the A*A alignment.

Ideally mpileup would emit +3A*A (and +3AGA on the others), but it doesn't deal with P at all and I'm not sure if adding * into insertions breaks anything. Traditionally this would be +2AA.

@jkbonfield
Copy link
Contributor Author

In reply to myself, it appears at some stage I got fed up of the insertion bugs and fixed this, as well as supporting P operator. So:

766717e made 1I1I1I1I become 4I and the above 1I1P1I example becomes +3A*A. It's dated 2014 but took 5 years to get merged! (According to git log, by myself sometime after I became a dev, lol)

It still doesn't cope with multiple dels concatenated though, so parts of the issue remain, eg:

@ seq4c[samtools.../samtools]; cat /tmp/mp_D.sam
@SQ	SN:z	LN:13
@CO	AGCTTAGCAG ref
@CO	AGCT   CAG -3TAG
s1	0	z	2	0	10M	*	0	0	AGCTTAGCAG	*
s1	0	z	2	0	4M3D3M	*	0	0	AGCTCAG	*
s1	0	z	2	0	4M1D1D1D3M	*	0	0	AGCTCAG	*
@ seq4c[samtools.../samtools]; samtools mpileup /tmp/mp_D.sam 
[mpileup] 1 samples in 1 input files
z	2	N	3	^!A^!A^!A	~~~
z	3	N	3	GGG	~~~
z	4	N	3	CCC	~~~
z	5	N	3	TT-3NNNT-1N	~~~
z	6	N	3	T**-1N	~~~
z	7	N	3	A**-1N	~~~
z	8	N	3	G**	~~~
z	9	N	3	CCC	~~~
z	10	N	3	AAA	~~~
z	11	N	3	G$G$G$	~~~

Those -1N shouldn't be there.

jkbonfield added a commit to jkbonfield/htslib that referenced this issue Jan 17, 2023
This means 4M1D1D1D3M is reported as 4M3D3M instead, and importantly
"p->indel=-3" for the first 1D and the 2nd 1D has "p->indel=0" (with
p->is_del=1, the same as it would for the 2nd base in a 3D cigar op).

Previously samtools mpileup would produce incorrect looking output for
the 1D1D scenario.  Fixing this in sam.c means not only is samtools
mpileup now looking better, but any tool using the mpileup API will be
getting consistent results.

Note that samtools mpileup already resolved the ...1I1I1I... case, but
it did this within the samools bam_plcmd.c code itself.  Hence while the
pileup API works, it left p->indel=1 instead of p->indel=3 for this
situation.  So we also resolve that in a similar fashion.

Note 2P1I1I is reported as p->indel=2 (a 2bp indel) even though
bam_plp_insertion would return e.g. +4**AC, as we're reporting the
number of bases inserted in this sequence rather than the padded
alignment size.

Fixes samtools/samtools#139, or at least the remaining part of the
puzzle.  Most had previously been fixed already back in 2014.
daviesrob pushed a commit to samtools/htslib that referenced this issue Jan 24, 2023
This means 4M1D1D1D3M is reported as 4M3D3M instead, and importantly
"p->indel=-3" for the first 1D and the 2nd 1D has "p->indel=0" (with
p->is_del=1, the same as it would for the 2nd base in a 3D cigar op).

Previously samtools mpileup would produce incorrect looking output for
the 1D1D scenario.  Fixing this in sam.c means not only is samtools
mpileup now looking better, but any tool using the mpileup API will be
getting consistent results.

Note that samtools mpileup already resolved the ...1I1I1I... case, but
it did this within the samools bam_plcmd.c code itself.  Hence while the
pileup API works, it left p->indel=1 instead of p->indel=3 for this
situation.  So we also resolve that in a similar fashion.

Note 2P1I1I is reported as p->indel=2 (a 2bp indel) even though
bam_plp_insertion would return e.g. +4**AC, as we're reporting the
number of bases inserted in this sequence rather than the padded
alignment size.

Fixes samtools/samtools#139, or at least the remaining part of the
puzzle.  Most had previously been fixed already back in 2014.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants