Base mod tweaks to improve validity checking. #1749

jkbonfield · 2024-02-14T14:10:59Z

We may perhaps want to have a control over whether we hard error or not? If so maybe add some mitigation (nullifying) instead?

daviesrob · 2024-02-19T12:29:28Z

sam_mods.c

+        for (i = 0; i < state->nmods; i++) {
+            // Check if any remaining items in MM after hitting the end
+            // of the sequence.
+            if (!b->core.l_qseq)


Wouldn't this check be best done outside the loop?

daviesrob · 2024-02-19T15:21:50Z

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Also, while playing with this, I noticed that both pileup_mod and test_mod leak memory on failure. In the latter case it's due to a missing sam_close() in the failure clean-up path. pileup_mod is used in some tests with a non-zero return. Unfortunately it's not possible to see the leak for them in the CI tests because stderr gets redirected by the test harness, and we don't distinguish the leak-sanitizer non-zero return from the one we expect. It is possible to see when running by hand though:

$ ./test/pileup_mod test/base_mods/MM-MNf1.sam 
[E::bam_parse_basemod2] r1: MM/MN data length is incompatible with SEQ length

=================================================================
==20713==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 11536 byte(s) in 1 object(s) allocated from:
    #0 0x7f65233df7f7 in __interceptor_calloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:77
    #1 0x405bf4 in pileup_cd_create test/pileup_mod.c:75

SUMMARY: AddressSanitizer: 11536 byte(s) leaked in 1 allocation(s).

Also, while fiddling with this I accidentally ran the wrong program (test_mod) on the test/base_mods/MM-pileup.sam file, and got this:

==19796==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c00000003f at pc 0x0000004902b8 bp 0x7ffc2c9086f0 sp 0x7ffc2c9086e8
READ of size 1 at 0x60c00000003f thread T0
    #0 0x4902b7 in bam_mods_at_next_pos sam_mods.c:504
    #1 0x492f39 in bam_next_basemod sam_mods.c:611
    #2 0x406488 in main ./test/test_mod.c:187
    #3 0x7f1338598c86 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21c86)
    #4 0x407079 in _start (./test/test_mod+0x407079)

0x60c00000003f is located 1 bytes to the left of 128-byte region [0x60c000000040,0x60c0000000c0)
allocated by thread T0 here:
    #0 0x7f133a609d15 in __interceptor_realloc ../../../../gcc-12.1.0/libsanitizer/asan/asan_malloc_linux.cpp:85
    #1 0x45896c in sam_realloc_bam_data sam.c:446

which suggests a more worrying problem somewhere...

jkbonfield · 2024-02-19T15:28:29Z

I'll look at the other programs once I've finished my current PR, but in answer to the first question:

It would be good to have a few more tests for bad data here. Adding some extra mods beyond the end of the sequence to records in MM-orient.sam results in no detection on the forward strand, but a complaint for the reverse. Is that expected?

Yes. It's a best efforts thing basically. We can detect errors on the reverse strand because it means parsing all the way through the base modification string and reading it backwards. Hence we detect an overflow case right at the start. We could obviously do that full parse everywhere, but that would slow things down. The strategy I took was report errors as we find them rather than a full validation pass every time.

If you call a bam_next_basemod iterator which is what my new samtools fixmate -M code is doing then you'll trigger the error at the other end too. It's possible the API being used by the test harness is taking a different route somewhere.

When we fail with "Corrupted aux data" it's often useful to know the BAM flags as well as the read name as this often helps disambiguate which record for this template is the problematic read.

jkbonfield · 2024-02-20T12:16:05Z

I think what I have should now be passing the memory checkers. I still need to think of more examples, but there are 2 dedicated test files already whose purposes was specifically to test MM strings referring to bases off the ends of sequences. Both test_mod and pileup_mod report errors and return with non-zero exit status, confirming the tests work.

What extra tests were you thinking of?

test/test_mod.c

daviesrob · 2024-02-20T17:22:38Z

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

jkbonfield · 2024-02-20T17:24:59Z

The memory checkers are much happier.

For invalid inputs, there's test/base_mods/MM-MNf1.sam and test/base_mods/MM-MNf1.sam, but they both fail with "MM/MN data length is incompatible with SEQ length". I don't think there's anything to test the "MM tag refers to bases beyond sequence length" case you're looking for here (or the equivalent on the reverse strand).

MM being incompatible with SEQ length does mean MM refers to data beyond the end of the sequence. So it's the same thing.

Added MM overflow detection for +ve strand data. It was already detected on -ve strand as this is done in the initial parse loop. We only detect +ve strand overflow once we hit the end of the iterator, but it's still sufficient for validation. Also improve MM parsing when faced with empty lists, such as "C+m;C+h,0". These parsed correctly before, but left state->MM[0] pointing at the "C" rather than ";" which makes overflow detection harder.

Similarly for the base mod state on failure in pileup_mod.

This frees memory when destroying earlier than expected, such as during a processing failure. I can't figure out how this has been missed all these years!

jkbonfield · 2024-02-22T14:52:28Z

I added MM overflow errors too in the tests. I also moved the code performing this test, as as reported it didn't trigger on all cases. I think when I wrote it I was maybe testing a base modification right at the end of the sequence and also one beyond the end, but I can't be sure now. This does appear to work though and it still fixes things in mpileup -M.

In particular, remove an excess ML entry (",169") that is rejected as an error by bam_parse_basemod2() since samtools/htslib#1749, which appeared in HTSlib 1.20. Fixes #1291. Also clean the generated MM-*.bam files.

jkbonfield mentioned this pull request Feb 14, 2024

Add fixmate -M option for base modifications samtools/samtools#1990

Merged

daviesrob self-assigned this Feb 15, 2024

daviesrob reviewed Feb 19, 2024

View reviewed changes

jkbonfield force-pushed the base_mod_tweaks branch from 172da4d to 0226248 Compare February 19, 2024 14:30

Slightly improve aux tag sanity check error message.

280b3d0

When we fail with "Corrupted aux data" it's often useful to know the BAM flags as well as the read name as this often helps disambiguate which record for this template is the problematic read.

jkbonfield force-pushed the base_mod_tweaks branch from 0226248 to 1a172fc Compare February 20, 2024 11:03

daviesrob reviewed Feb 20, 2024

View reviewed changes

test/test_mod.c Show resolved Hide resolved

jkbonfield force-pushed the base_mod_tweaks branch from 1a172fc to 777ac4d Compare February 20, 2024 17:35

jkbonfield added 3 commits February 22, 2024 14:47

Add a missing sam_close call in test/test_mod.c

18a92f8

Similarly for the base mod state on failure in pileup_mod.

Call the pileup destructor in bam_plp_destroy.

d028e0d

This frees memory when destroying earlier than expected, such as during a processing failure. I can't figure out how this has been missed all these years!

jkbonfield force-pushed the base_mod_tweaks branch from 777ac4d to d028e0d Compare February 22, 2024 14:49

daviesrob merged commit 7db7e83 into samtools:develop Feb 23, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Base mod tweaks to improve validity checking. #1749

Base mod tweaks to improve validity checking. #1749

jkbonfield commented Feb 14, 2024

daviesrob Feb 19, 2024

daviesrob commented Feb 19, 2024

jkbonfield commented Feb 19, 2024

jkbonfield commented Feb 20, 2024 •

edited

Loading

daviesrob commented Feb 20, 2024

jkbonfield commented Feb 20, 2024

jkbonfield commented Feb 22, 2024

Base mod tweaks to improve validity checking. #1749

Base mod tweaks to improve validity checking. #1749

Conversation

jkbonfield commented Feb 14, 2024

daviesrob Feb 19, 2024

Choose a reason for hiding this comment

daviesrob commented Feb 19, 2024

jkbonfield commented Feb 19, 2024

jkbonfield commented Feb 20, 2024 • edited Loading

daviesrob commented Feb 20, 2024

jkbonfield commented Feb 20, 2024

jkbonfield commented Feb 22, 2024

jkbonfield commented Feb 20, 2024 •

edited

Loading