Protection against over sized aux tags in CRAM #1613
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
CRAM encoding can overflow the concatenated block of BAM aux tags used in the decoder when the size of the aux tags becomes excessive.
This can happen in real world data in some bizarre circumstances. For example very long ONT records with per-base aux tags left intact, passed through an aligner that records secondary alignments as SEQ "*" but leaves the aux tags in place. This means the limit of the number of bases per container is not triggered, giving rise to excessively large containers, and specifically aux tags that combine to >2GB.
We fix it in two ways.
Protect against existing files with this problem by detecting the overflow may happen and simply bailing out. This is perhaps overly harsh, but previously this would simply have core dumped and to date we've only ever had one report of this happening (yesterday), so I expect it's vanishingly rare.
By changing the encoder so it produces new containers using base+aux count rather than just base count (as well as the existing record number count).