Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

jkbonfield · 2022-11-07T15:19:55Z

See samtools/samtools#1748 for background, although this isn't fixing that issue and I'm not yet sure what that problem is.

I'm unsure on whether this is a desireable solution. See the benchmarks below; the "fix" isn't free.

Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match.

This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome.

# Old htslib 1.16 speed, enabled here by explicitly ignoring md5
$ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.807s
user    0m0.702s
sys     0m0.101s

# New performance, dominated by validating SQ M5s
$ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m20.312s
user    0m18.543s
sys     0m1.760s

# Using REF_PATH md5 cache instead is fast
$ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.222s
user    0m0.204s
sys     0m0.016s

This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.

daviesrob · 2022-11-08T15:08:25Z

That is a bit of a penalty, although it may be possible to optimise the loading code a bit.

How difficult would it be to only check the references at the point when they're used in alignment records? That would improve the case above, as you'd only need to check one chromosome. Although it would also mean the error could turn up quite a long way into writing the file, which isn't as ideal as discovering the problem right at the start...

jkbonfield · 2022-12-16T12:59:31Z

Taking the idea suggested of doing the checks on the fly has worked well. Thanks.

On the same file above, converting BAM to CRAM:

ignore_md5=0 using -t $HREF -i reference=$HREF                                  
        10M-10M         10M-20M         10M-50M                                 
real    0m0.886s        0m2.207s        0m5.403s                                
user    0m0.791s        0m2.062s        0m5.199s                                
sys     0m0.092s        0m0.144s        0m0.201s                                
                                                                                
ignore_md5=1                                                                    
real    0m0.612s        0m1.929s        0m5.115s                                
user    0m0.510s        0m1.779s        0m4.899s                                
sys     0m0.100s        0m0.148s        0m0.213s                                
                                                                                
ignore_md5=0 using REF_CACHE (md5)                                              
real    0m0.024s        0m1.278s        0m4.190s                                
user    0m0.017s        0m1.244s        0m4.068s                                
sys     0m0.006s        0m0.032s        0m0.121s

Ignoring the obvious HUGE win here of using a preformatted raw text file found via md5 cache (and mmap for I/O), the difference between validating md5sum and not is now negligible.

We also see the same big speed gain when doing embed_ref=2 on tiny regions, so there's that as an alternative to validating external refs still:

ignore_md5=0 embed_ref=2                                                        
        10M-10M         10M-20M         10M-50M                                 
real    0m0.025s        0m1.692s        0m5.474s                                
user    0m0.014s        0m1.578s        0m5.330s                                
sys     0m0.010s        0m0.072s        0m0.140s

It seems to work on name sorted data too (where it also validates, but once only per ref).

daviesrob · 2023-01-05T18:07:42Z

The revised version looks OK so far, apart maybe from the advice printed when you use the wrong reference:

./test/test_view -C -t /tmp/ce.fa -p /tmp/ce#1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

Trying embed_ref=1 on its own doesn't work:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -p /tmp/ce#1.er1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

embed_ref=2 does though, and produces a working cram file, as also does no_ref=1; or embed_ref=1 when combined with ignore_md5=1:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=2 -p /tmp/ce#1.er2.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o no_ref=1 -p /tmp/ce#1.nr1.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -o ignore_md5=1 -p /tmp/ce#1.er1.im1.cram test/ce#1.tmp.cram.sam_

Would it be safe to disable the check in embed_ref=1 mode?

Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match. This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome. # Old htslib 1.16 speed, enabled here by explicitly ignoring md5 $ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m0.807s user 0m0.702s sys 0m0.101s # New performance, dominated by validating SQ M5s $ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m20.312s user 0m18.543s sys 0m1.760s # Using REF_PATH md5 cache instead is fast $ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m0.222s user 0m0.204s sys 0m0.016s This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.

Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match. The references are checked on-the-fly as they're used. For name sorted data this may have a significant initial cost (the same that we see when converting from BAM to CRAM without M5 tags in the BAM header), but for position streaming the cost is only for chromosomes currently in use. This means writing a small CRAM file covering a region from just one chromosome will only incur the CPU cost of validating that one chromosome, as we're not trying to check the header tags are valid (we take them at their word) and are instead trying to ensure the fasta sequence we've been given matches the supplied headers. [SQUASH_ME: This almost completely removes the previous commit, but it's done as a separate commit for ease of review.]

Note this doesn't happen on the command line, so it's something specific to the way test/sam.c is written that triggers this. test/sam.c calls cram_load_reference before the SAM header is known. That in turn means refs2id hasn't been called and fd->refs array isn't in sync with the file header (as we have a subset of refs). Then cram_write_SAM_hdr md5sums the wrong reference when created SQ M5 tags. There was already code in the cram_write_SAM_hdr to recall refs2id to get things in sync, but unfortunately it was after the creation of M5 tags. What's odd is how this ever worked in test/sam. Maybe it just didn't and we never noticed? It was only spotted here due to the new validation code.

jkbonfield · 2023-01-12T16:58:55Z

I decided to update the error message to suggest embed_ref=2. The problem with level 1 is that we're still stating that the external reference is the correct one, when we know there is a disparity between how the file was aligned and the reference on disk. The difference could be significant, including indels that cause every read to be off by 1.

Ignoring the md5sum and continuing could give us a very suboptimal cram, so it's best to be defensive here.

… 1.19 Andrew Whitwham (22): Change stats to use a single structure. Enable the use of read groups in duplicate marking. Added json formatted stats. Changed error handling to use print_error. Added tests and documentation. Stop reads wrongly being removed in hard clip mode. (PR #1722) Changed the default UMI barcode regex. (#1735) Speed up optical duplicate checking. Added a patch to enable uncompressed writing. Update copyright for winter release. January 2023 NEWS update. Take variable initialisation out of for statement. (#1798) Fix bug in split output format and improve man page (PR #1821) Removed line that caused a warning in gcc-13. (#1838) Switched back to openssl for Alpine. Fix a minor memory leak. (#1875) Unmap reads where there are no aligned bases left. Summer 2023 copyright update. News Updates for Summer 2023. (PR #1878) Add the order of filtering to the man page. (#1907) Minor layout changes and added const modifier. Speed up optical duplicate marking. (PR #1952) Armin Toepfer (1): Allow sorting using minimisers w/ fwd strand only David Seifert (1): Use POSIX `grep` Devang Thakkar (2): consistent, more comprehensive flag filtering update documentation James Bonfield (58): Improve samtools sort on lustre. Revision to sort -n name comparison function strnum_cmp. Make consensus -m simple use a more usual definition of min_depth. Also add a --min-BQ option to samtools consensus. Add a bam_sanitize function to fixmate and view. Update bam sanitizer to use the new aux tag iterator. Add depth --excl-flags synonym for -G. Adds --incl-flags and --require-flags to the samtools depth command. Fix samtools depth usage typo. Work around PacBio HiFi quality induced biases. Also change the new pileup function quality scoring for gaps. Fix SEGV in samtoools collate when no filename given. Revise collate to require an input filename. Make faidx/fqidx output line length default to the input line length. (PR #1738) Minor clarification of samtools coverage documentation. Fix samtools consensus buffer overrun with MD:Z handling. Fix a buffer read-overflow on sequences with seq "*". Fixes needed to accommodate samtools/htslib#1522 Permit 1 thread with samtools view. Update samtools tview man page Add consensus --mark-ins option. Fix view -X command line parsing. Add CRAM_OPT_REQUIRED_FIELDS option for view -c. (PR #1776) Make the typical lack of --input-fmt option more explicit. First abandoned attempt at a cram_size sub-command. Improve samtools cram_size subcommand. Allow cirrus-ci to query the htslib PR from samtools PR title. Make test/regression.sh report diffs when failing Improve samtools collate consistency with sort - re: temp file handling. Improvements to samtools consensus for platforms with higher indel error rates. Major revision of the previous consensus-indel PR. General consensus code tidyup Expand fasta/fastq man page Minor fix to wording of mpileup --rf. Change errors produced in samtools view -d validation Expand the mpileup man page default filtering. Fix a sort -M bug when merging sub-blocks. Add minimiser sort option to collate by a fasta (indexed). Adds homopolymer squashing capacity to sort -M Add sort minimiser tests Add error checking to view -e filter expression code. Make calmd "work" on unaligned data or empty files. Extend import --order TAG to --order TAG:length Add a "samtools consensus -aa" mode. (#1851) Add samtools fastq -d TAG:value check Fix samtools reheader CRAM output version. Add long option support to samtools index. Be consistent with rounding of "average length" in samtools stats. Document samtools consensus --qual-calibration Add view -N/-R ^FILE for FILE negation. (#1896) Add samtools sort -N / samtools merge -N option. Bug fix samtools sort --write-index sort mode checks. Add hclen SAM filter documentation. Remove -5 option from samtools consensus. Fix a couple minor samtools view memory problems. News updates for Winter 2023 (1.19) Renamed NEWS to NEWS.md so it's formatted nicely on github. Make samtools mpileup -aa work on empty files. (#1939) John Marshall (10): Ensure searching NEWS for /empty/ finds this "zero-length file" entry Issue templates: Try to direct conda-related issues to Bioconda Apply `samtools fastq --input-fmt-option ...` settings Clarify that view -d argument does not include tag type (#1853) Documentation formatting etc fixes [minor] Rename test output file so it is ignored and cleaned Cast off_t values to type expected by format directive [minor] Add missing Makefile dependencies Check fclose(stdout) at the end of main() Use the spec-defined PL values in man page example Nils Homer (2): bug fix: template coordinate sort when merging in memory bug fix: samtools merge --template-coordinate Pierre Lindenbaum (2): add mapq histogram to samtools stats (PR #1684) new option --plot-depth for samtools coverage Rob Davies (18): Make sort merge in memory before writing temporary files Use the same thread pool for reading and writing in sort Use the thread pool when reading temporary files Limit the number of sort temporary files by merging if too many Fix stats breakage on long deletions when given a reference Make calmd run faster on non-position-sorted data Allow ampliconstats bed file to use fewer refs than the input file Happy New Year Change cram_size man page to cram-size Fix ci clone with no referenced HTSlib PR Minor NEWS adjustments and additional item Minor manual page updates Refactor some names as we now split on tags other than RG Allow `-d RG` to make new files from alignment RG tag values Add a -M option to limit the number of split files made Add tests for split by string aux tag functionality Fix various --write-index issues Add latest NEWS for 1.19 Valeriu Ohan (2): Use TMPDIR environment variable, when looking for a temporary folder. Add -d TAG option, which allows read split by custom aux tags. Vasudeva Easwara Sarma (1): treats empty barcode as no barcode - avoids issues with realloc daviesrob (1): Switch MacOS CI tests to an ARM-based image (#1770) erboone (1): Add plot of read lengths ("RL") from samtools stats output. vasudeva8 (5): update to show line nr reset functionality base update with parse_aux_list leak fix mpileup manual updated with samples updated cat to support multiple non-seekable streams wulj2 (5): add --duplicate-count option in 'samtools markdup' subcommand to record the original read duplication count(include itself) in a 'dc' tag. fix argument of duplicate-count add test data to test samtools markdup --duplicate-count to record the original primary read duplication count , to get the expected result 18_primary_duplicate_count.expected.sam from 18_primary_duplicate_count.sam, run command samtools markdup -t --duplicate-count --barcode-tag BC -S 18_primary_duplicate_count.sam update samtools-markdup --duplicate-count option document item update test data for samtools markdup --duplicate-count option, use sequence data generated by wgsim, no privace concern now; and also update the test.pl part for this test Étienne Mollier (2): fix various typos caught by lintian parsers whatis entry for samtools-fasta and samtools-fastq

jkbonfield added a commit to jkbonfield/samtools that referenced this pull request Nov 7, 2022

Fixes needed to accommodate samtools/htslib#1522

a48c76b

jkbonfield mentioned this pull request Nov 7, 2022

Fixes needed to accommodate samtools/htslib#1522 samtools/samtools#1752

Merged

daviesrob self-assigned this Nov 8, 2022

c4chris pushed a commit to sib-swiss/samtools that referenced this pull request Jan 11, 2023

Fixes needed to accommodate samtools/htslib#1522

711d86e

jkbonfield added 3 commits January 12, 2023 16:55

jkbonfield force-pushed the SQ_M5_check branch from 12ba892 to f8fb853 Compare January 12, 2023 16:57

daviesrob merged commit 4ec92c1 into samtools:develop Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

jkbonfield commented Nov 7, 2022 •

edited

Loading

daviesrob commented Nov 8, 2022

jkbonfield commented Dec 16, 2022

daviesrob commented Jan 5, 2023

jkbonfield commented Jan 12, 2023

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

Conversation

jkbonfield commented Nov 7, 2022 • edited Loading

daviesrob commented Nov 8, 2022

jkbonfield commented Dec 16, 2022

daviesrob commented Jan 5, 2023

jkbonfield commented Jan 12, 2023

jkbonfield commented Nov 7, 2022 •

edited

Loading