Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MZ:i tag as a check for base modification validity. #1590

Merged
merged 1 commit into from
Apr 6, 2023

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Mar 29, 2023

If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit.

See samtools/hts-specs#646

Edit: I forgot to add that I also tweaked the error messages to be slightly more useful by including the read name, but that's just trivial tweakage.

If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
@vasudeva8 vasudeva8 merged commit a616e85 into samtools:develop Apr 6, 2023
vasudeva8 pushed a commit to vasudeva8/htslib that referenced this pull request May 30, 2023
If a sequence is hard-clipped after calling the base modifications,
then the tool may, or may not, update the MM and ML tags accordingly.
We have no way of distinguishing these two cases.  While the base
modification parsing code already detects overflows where the
coordinates go beyond the sequence end, this isn't fool proof,
especially if the clipping is short.

So instead we have an (as yet unwritten) proposal of MZ:i tag holding
the sequence length, to be written at the same time as the MM and ML
tags.  This can then be used as a sanity check later on, to detect
cases where the sequence has changed length via a tool that is unaware
of base modifications.

TODO: as a separate PR, we should add a new API that can trim bases
off the start/end of MM/ML strings to make it trivial for tools that
are doing hard clipping via htslib.  (Indeed we don't even have an API
for SEQ/QUAL either, so it can do all together).  This would make it
far easier for people to keep everything in sync, and this code could
then also update MZ while it's at it.  That's new API though so it can
arrive as a separate commit.

See samtools/hts-specs#646
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants