Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve mpileup overlap removal #1751

Merged
merged 2 commits into from
Feb 27, 2024
Merged

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Feb 16, 2024

The previous version didn't do overlap removal for deletions, meaning we'd get the same read twice where one copy had a deletion.

This is still true for more complex compound cigar strings as 2M1I5D1I5M in the second record. Possibly this can be improved on further[*], but it's a pretty exotic case that is unlikely to cause problems and the behaviour in this PR there is the same as how it used to be (ie it falls back to the old method of just continuing to the next position).

Fixes samtools/samtools#1992

[*] The cause of 5D1I5M failing is because we're on the "M" cigar and the previous is not "D". We could relax that check, but I wasn't confident this would work in all possible scenarios, eg backwards B skips or N ref-skips.

The code previously just skipped deletions, meaning templates with a
deletion in one copy and not in the other could still be reported
twice in the mpileup.  Note this depended on what came first due to
the order of iref updates.

The new code moves the two iref updates to after the two
cigar_iref2iseq_next so we're always consistently getting the next
reference base regardless.

We then detect deletions and catch the lagging sequence up, applying
the same Q*=0.8 or Q=0 logic that the mismatching base code does.

It's tricky to know if this has a beneficial impact on variant calling
or not!  If there is debate over whether a deletion exists between the
read-pairs, then it is probably not something we wish to take into
account, however it's currently random which read we select.
Previously we'd have had both deletions and not-deletion present in
the results, over-inflating depth but maybe bringing less confidence
to such deletion calls.  However it is at least consistent with the
stated intention of overlap removal.

One possible further mitigation could be to reduce quality more around
such positions (so not just Q*0.8, and also adjusting the before/after
base too for the deletion read).  That hasn't been done here though.

Fixes samtools/samtools#1992
Without this we cannot test that the overlap removal code is working,
which operates by zeroing quality values.
@whitwham whitwham merged commit 6d0dd00 into samtools:develop Feb 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicated qname with different reads in pileup file
2 participants