Protection against over sized aux tags in CRAM #1613

jkbonfield · 2023-05-03T15:59:01Z

CRAM encoding can overflow the concatenated block of BAM aux tags used in the decoder when the size of the aux tags becomes excessive.

This can happen in real world data in some bizarre circumstances. For example very long ONT records with per-base aux tags left intact, passed through an aligner that records secondary alignments as SEQ "*" but leaves the aux tags in place. This means the limit of the number of bases per container is not triggered, giving rise to excessively large containers, and specifically aux tags that combine to >2GB.

We fix it in two ways.

Protect against existing files with this problem by detecting the overflow may happen and simply bailing out. This is perhaps overly harsh, but previously this would simply have core dumped and to date we've only ever had one report of this happening (yesterday), so I expect it's vanishingly rare.
By changing the encoder so it produces new containers using base+aux count rather than just base count (as well as the existing record number count).

It's possible to construct CRAM containers that are extremely large such that building a block of BAM records representing the container overflows due to the size of the combined aux fields. We could change how we construct blocks of data, and work at a more individual read level, but realistically it's just not good form to be handling arbitrarily large containers as they may cause excessive memory issues which brings its own denial attacks. POTENTIAL SECURITY ISSUE: Note the previous could overflow cr->aux, which then went negative and caused negative offsets to be passed to memcpy. This would lead to a crash. I cannot see a way to get this to not crash and hence leak data, but it could form a denial of service on a remote server using htslib. .

Currently CRAM containers can in some circumstance become huge. To prevent this we currently have a limit of the number of sequences (default 10,000) and also by number of bases (default 500 * number of seqs) so long-read technologies don't put too much in a container. However if we have 10k of reads with jointly under 5Mb of sequence that also have over 2GB worth of aux data, then we can trigger the overflow fixed in the previous commit. How do we get >430 bytes worth of aux for every base and >214Kb of aux for every read, in real world data rather than in deliberate stress testing? One possibility is with SEQ "*" (eg secondary alignments from minimap2) on very long-read data with heavy aux tag usage, as this doesn't increase base count at all. The same issue occurs to a lesser extent which supplementaries and hard-clipping. We now create new containers when seq+aux goes beyond the specified limit instead of just seq. In normal circumstances this will have a limited effect. Thanks to Martin Pollard for triggering and reporting this corner case.

jkbonfield added 2 commits May 3, 2023 17:01

jkbonfield force-pushed the CRAM_long_aux branch from bedd45e to 1e846cf Compare May 3, 2023 16:01

This was referenced May 4, 2023

[Bug]: container's samtools version is deprecated and segfaults please update to supported version epi2me-labs/wf-human-variation#30

Closed

Segfault on the cliveome nanoporetech/modkit#9

Open

daviesrob self-assigned this May 9, 2023

daviesrob merged commit f3ad960 into samtools:develop May 11, 2023

jkbonfield mentioned this pull request May 23, 2023

markdup [E::cram_flush_thread] Call to cram_encode_container failed error #1621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protection against over sized aux tags in CRAM #1613

Protection against over sized aux tags in CRAM #1613

jkbonfield commented May 3, 2023

Protection against over sized aux tags in CRAM #1613

Protection against over sized aux tags in CRAM #1613

Conversation

jkbonfield commented May 3, 2023