Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protection against over sized aux tags in CRAM #1613

Merged
merged 2 commits into from
May 11, 2023

Conversation

jkbonfield
Copy link
Contributor

CRAM encoding can overflow the concatenated block of BAM aux tags used in the decoder when the size of the aux tags becomes excessive.

This can happen in real world data in some bizarre circumstances. For example very long ONT records with per-base aux tags left intact, passed through an aligner that records secondary alignments as SEQ "*" but leaves the aux tags in place. This means the limit of the number of bases per container is not triggered, giving rise to excessively large containers, and specifically aux tags that combine to >2GB.

We fix it in two ways.

  1. Protect against existing files with this problem by detecting the overflow may happen and simply bailing out. This is perhaps overly harsh, but previously this would simply have core dumped and to date we've only ever had one report of this happening (yesterday), so I expect it's vanishingly rare.

  2. By changing the encoder so it produces new containers using base+aux count rather than just base count (as well as the existing record number count).

It's possible to construct CRAM containers that are extremely large
such that building a block of BAM records representing the container
overflows due to the size of the combined aux fields.

We could change how we construct blocks of data, and work at a more
individual read level, but realistically it's just not good form to be
handling arbitrarily large containers as they may cause excessive
memory issues which brings its own denial attacks.

POTENTIAL SECURITY ISSUE: Note the previous could overflow cr->aux,
which then went negative and caused negative offsets to be passed to
memcpy.  This would lead to a crash.  I cannot see a way to get this
to not crash and hence leak data, but it could form a denial of
service on a remote server using htslib.

.
Currently CRAM containers can in some circumstance become huge.  To
prevent this we currently have a limit of the number of sequences
(default 10,000) and also by number of bases (default 500 * number of
seqs) so long-read technologies don't put too much in a container.

However if we have 10k of reads with jointly under 5Mb of sequence
that also have over 2GB worth of aux data, then we can trigger the
overflow fixed in the previous commit.

How do we get >430 bytes worth of aux for every base and >214Kb of aux
for every read, in real world data rather than in deliberate stress
testing?  One possibility is with SEQ "*" (eg secondary alignments
from minimap2) on very long-read data with heavy aux tag usage, as this
doesn't increase base count at all.  The same issue occurs to a lesser
extent which supplementaries and hard-clipping.

We now create new containers when seq+aux goes beyond the specified
limit instead of just seq.  In normal circumstances this will have
a limited effect.

Thanks to Martin Pollard for triggering and reporting this corner case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants