Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Don't create overly large CRAM blocks.
Currently CRAM containers can in some circumstance become huge. To prevent this we currently have a limit of the number of sequences (default 10,000) and also by number of bases (default 500 * number of seqs) so long-read technologies don't put too much in a container. However if we have 10k of reads with jointly under 5Mb of sequence that also have over 2GB worth of aux data, then we can trigger the overflow fixed in the previous commit. How do we get >430 bytes worth of aux for every base and >214Kb of aux for every read, in real world data rather than in deliberate stress testing? One possibility is with SEQ "*" (eg secondary alignments from minimap2) on very long-read data with heavy aux tag usage, as this doesn't increase base count at all. The same issue occurs to a lesser extent which supplementaries and hard-clipping. We now create new containers when seq+aux goes beyond the specified limit instead of just seq. In normal circumstances this will have a limited effect. Thanks to Martin Pollard for triggering and reporting this corner case.
- Loading branch information