Add MZ:i tag as a check for base modification validity. #1590
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If a sequence is hard-clipped after calling the base modifications, then the tool may, or may not, update the MM and ML tags accordingly. We have no way of distinguishing these two cases. While the base modification parsing code already detects overflows where the coordinates go beyond the sequence end, this isn't fool proof, especially if the clipping is short.
So instead we have an (as yet unwritten) proposal of MZ:i tag holding the sequence length, to be written at the same time as the MM and ML tags. This can then be used as a sanity check later on, to detect cases where the sequence has changed length via a tool that is unaware of base modifications.
TODO: as a separate PR, we should add a new API that can trim bases off the start/end of MM/ML strings to make it trivial for tools that are doing hard clipping via htslib. (Indeed we don't even have an API for SEQ/QUAL either, so it can do all together). This would make it far easier for people to keep everything in sync, and this code could then also update MZ while it's at it. That's new API though so it can arrive as a separate commit.
See samtools/hts-specs#646
Edit: I forgot to add that I also tweaked the error messages to be slightly more useful by including the read name, but that's just trivial tweakage.