Ensure no-stored-sequence reads are counted in container size #1710

daviesrob · 2023-11-30T15:59:06Z

In CRAM, sequence bases have to be stored for certain read features, even if the input record has SEQ set to "*" (i.e. no sequence stored). In such cases, the sequence becomes a string of "N"s, which need to be accounted for when calculating how many items are put into a slice. Failing to do this could lead to some data structures overflowing with certain unusual inputs. The number of Ns is set by the query-consuming operators in the CIGAR string, which can become quite large if you have lots of long insertions.

A limit is also set on the apparent sequence length of these records to prevent problems with excessive time and memory use when trying to encode them. This also prevents issues with integer overflow if the apparent sequence length exceeded INT_MAX.

In cram_encode_container() an extra check is added to ensure at least one record is present, to fix a crash that could occur if cram_put_bam_seq() bailed out with an error.

The attached file testcases.zip includes some test cases. bad1.sam has enough CIGAR operations to overflow INT32 when converting to CRAM. In bad2.sam the CIGAR operations don't overflow, but the block containing the fake sequences does, and it takes a lot of time and memory to get there. good.sam should convert to CRAM successfully.

Credit to OSS-Fuzz
Fixes oss-fuzz 64557

In CRAM, sequence bases have to be stored for certain read features, even if the input record has SEQ set to "*" (i.e. no sequence stored). In such cases, the sequence becomes a string of "N"s, which need to be accounted for when calculating how many items are put into a slice. Failing to do this could lead to some data structures overflowing with certain unusual inputs. The number of Ns is set by the query-consuming operators in the CIGAR string, which can become quite large if you have lots of long insertions. A limit is also set on the apparent sequence length of these records to prevent problems with excessive time and memory use when trying to encode them. This also prevents issues with integer overflow if the apparent sequence length exceeded INT_MAX. In cram_encode_container() an extra check is added to ensure at least one record is present, to fix a crash that could occur if cram_put_bam_seq() bailed out with an error. Credit to OSS-Fuzz Fixes oss-fuzz 64557

jkbonfield merged commit 927ed61 into samtools:develop Dec 4, 2023
9 checks passed

daviesrob deleted the overlong_seqs branch December 4, 2023 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure no-stored-sequence reads are counted in container size #1710

Ensure no-stored-sequence reads are counted in container size #1710

daviesrob commented Nov 30, 2023 •

edited

Loading

Ensure no-stored-sequence reads are counted in container size #1710

Ensure no-stored-sequence reads are counted in container size #1710

Conversation

daviesrob commented Nov 30, 2023 • edited Loading

daviesrob commented Nov 30, 2023 •

edited

Loading