Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure no-stored-sequence reads are counted in container size #1710

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

daviesrob
Copy link
Member

@daviesrob daviesrob commented Nov 30, 2023

In CRAM, sequence bases have to be stored for certain read features, even if the input record has SEQ set to "*" (i.e. no sequence stored). In such cases, the sequence becomes a string of "N"s, which need to be accounted for when calculating how many items are put into a slice. Failing to do this could lead to some data structures overflowing with certain unusual inputs. The number of Ns is set by the query-consuming operators in the CIGAR string, which can become quite large if you have lots of long insertions.

A limit is also set on the apparent sequence length of these records to prevent problems with excessive time and memory use when trying to encode them. This also prevents issues with integer overflow if the apparent sequence length exceeded INT_MAX.

In cram_encode_container() an extra check is added to ensure at least one record is present, to fix a crash that could occur if cram_put_bam_seq() bailed out with an error.

The attached file testcases.zip includes some test cases. bad1.sam has enough CIGAR operations to overflow INT32 when converting to CRAM. In bad2.sam the CIGAR operations don't overflow, but the block containing the fake sequences does, and it takes a lot of time and memory to get there. good.sam should convert to CRAM successfully.

Credit to OSS-Fuzz
Fixes oss-fuzz 64557

In CRAM, sequence bases have to be stored for certain read
features, even if the input record has SEQ set to "*" (i.e. no
sequence stored).  In such cases, the sequence becomes a string
of "N"s, which need to be accounted for when calculating how many
items are put into a slice.  Failing to do this could lead to
some data structures overflowing with certain unusual inputs.
The number of Ns is set by the query-consuming operators in
the CIGAR string, which can become quite large if you have
lots of long insertions.

A limit is also set on the apparent sequence length of these
records to prevent problems with excessive time and memory use
when trying to encode them.  This also prevents issues with
integer overflow if the apparent sequence length exceeded INT_MAX.

In cram_encode_container() an extra check is added to ensure
at least one record is present, to fix a crash that could
occur if cram_put_bam_seq() bailed out with an error.

Credit to OSS-Fuzz
Fixes oss-fuzz 64557
@jkbonfield jkbonfield merged commit 927ed61 into samtools:develop Dec 4, 2023
9 checks passed
@daviesrob daviesrob deleted the overlong_seqs branch December 4, 2023 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants