Ensure no-stored-sequence reads are counted in container size #1710
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In CRAM, sequence bases have to be stored for certain read features, even if the input record has SEQ set to "*" (i.e. no sequence stored). In such cases, the sequence becomes a string of "N"s, which need to be accounted for when calculating how many items are put into a slice. Failing to do this could lead to some data structures overflowing with certain unusual inputs. The number of Ns is set by the query-consuming operators in the CIGAR string, which can become quite large if you have lots of long insertions.
A limit is also set on the apparent sequence length of these records to prevent problems with excessive time and memory use when trying to encode them. This also prevents issues with integer overflow if the apparent sequence length exceeded INT_MAX.
In cram_encode_container() an extra check is added to ensure at least one record is present, to fix a crash that could occur if cram_put_bam_seq() bailed out with an error.
The attached file testcases.zip includes some test cases.
bad1.sam
has enough CIGAR operations to overflow INT32 when converting to CRAM. Inbad2.sam
the CIGAR operations don't overflow, but the block containing the fake sequences does, and it takes a lot of time and memory to get there.good.sam
should convert to CRAM successfully.Credit to OSS-Fuzz
Fixes oss-fuzz 64557