-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base mod tweaks to improve validity checking. #1749
Conversation
sam_mods.c
Outdated
for (i = 0; i < state->nmods; i++) { | ||
// Check if any remaining items in MM after hitting the end | ||
// of the sequence. | ||
if (!b->core.l_qseq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this check be best done outside the loop?
172da4d
to
0226248
Compare
It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in Also, while playing with this, I noticed that both
Also, while fiddling with this I accidentally ran the wrong program (
which suggests a more worrying problem somewhere... |
I'll look at the other programs once I've finished my current PR, but in answer to the first question:
Yes. It's a best efforts thing basically. We can detect errors on the reverse strand because it means parsing all the way through the base modification string and reading it backwards. Hence we detect an overflow case right at the start. We could obviously do that full parse everywhere, but that would slow things down. The strategy I took was report errors as we find them rather than a full validation pass every time. If you call a |
When we fail with "Corrupted aux data" it's often useful to know the BAM flags as well as the read name as this often helps disambiguate which record for this template is the problematic read.
0226248
to
1a172fc
Compare
I think what I have should now be passing the memory checkers. I still need to think of more examples, but there are 2 dedicated test files already whose purposes was specifically to test MM strings referring to bases off the ends of sequences. Both test_mod and pileup_mod report errors and return with non-zero exit status, confirming the tests work. What extra tests were you thinking of? |
The memory checkers are much happier. For invalid inputs, there's |
MM being incompatible with SEQ length does mean MM refers to data beyond the end of the sequence. So it's the same thing. |
1a172fc
to
777ac4d
Compare
Added MM overflow detection for +ve strand data. It was already detected on -ve strand as this is done in the initial parse loop. We only detect +ve strand overflow once we hit the end of the iterator, but it's still sufficient for validation. Also improve MM parsing when faced with empty lists, such as "C+m;C+h,0". These parsed correctly before, but left state->MM[0] pointing at the "C" rather than ";" which makes overflow detection harder.
Similarly for the base mod state on failure in pileup_mod.
This frees memory when destroying earlier than expected, such as during a processing failure. I can't figure out how this has been missed all these years!
777ac4d
to
d028e0d
Compare
I added MM overflow errors too in the tests. I also moved the code performing this test, as as reported it didn't trigger on all cases. I think when I wrote it I was maybe testing a base modification right at the end of the sequence and also one beyond the end, but I can't be sure now. This does appear to work though and it still fixes things in |
In particular, remove an excess ML entry (",169") that is rejected as an error by bam_parse_basemod2() since samtools/htslib#1749, which appeared in HTSlib 1.20. Fixes #1291. Also clean the generated MM-*.bam files.
We may perhaps want to have a control over whether we hard error or not? If so maybe add some mitigation (nullifying) instead?