-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CRAM SQ/M5 header checking when specifying a fasta file. #1522
Conversation
That is a bit of a penalty, although it may be possible to optimise the loading code a bit. How difficult would it be to only check the references at the point when they're used in alignment records? That would improve the case above, as you'd only need to check one chromosome. Although it would also mean the error could turn up quite a long way into writing the file, which isn't as ideal as discovering the problem right at the start... |
Taking the idea suggested of doing the checks on the fly has worked well. Thanks. On the same file above, converting BAM to CRAM:
Ignoring the obvious HUGE win here of using a preformatted raw text file found via md5 cache (and mmap for I/O), the difference between validating md5sum and not is now negligible. We also see the same big speed gain when doing embed_ref=2 on tiny regions, so there's that as an alternative to validating external refs still:
It seems to work on name sorted data too (where it also validates, but once only per ref). |
The revised version looks OK so far, apart maybe from the advice printed when you use the wrong reference:
Trying
Would it be safe to disable the check in |
Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match. This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome. # Old htslib 1.16 speed, enabled here by explicitly ignoring md5 $ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m0.807s user 0m0.702s sys 0m0.101s # New performance, dominated by validating SQ M5s $ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m20.312s user 0m18.543s sys 0m1.760s # Using REF_PATH md5 cache instead is fast $ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000 real 0m0.222s user 0m0.204s sys 0m0.016s This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.
Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match. The references are checked on-the-fly as they're used. For name sorted data this may have a significant initial cost (the same that we see when converting from BAM to CRAM without M5 tags in the BAM header), but for position streaming the cost is only for chromosomes currently in use. This means writing a small CRAM file covering a region from just one chromosome will only incur the CPU cost of validating that one chromosome, as we're not trying to check the header tags are valid (we take them at their word) and are instead trying to ensure the fasta sequence we've been given matches the supplied headers. [SQUASH_ME: This almost completely removes the previous commit, but it's done as a separate commit for ease of review.]
Note this doesn't happen on the command line, so it's something specific to the way test/sam.c is written that triggers this. test/sam.c calls cram_load_reference before the SAM header is known. That in turn means refs2id hasn't been called and fd->refs array isn't in sync with the file header (as we have a subset of refs). Then cram_write_SAM_hdr md5sums the wrong reference when created SQ M5 tags. There was already code in the cram_write_SAM_hdr to recall refs2id to get things in sync, but unfortunately it was after the creation of M5 tags. What's odd is how this ever worked in test/sam. Maybe it just didn't and we never noticed? It was only spotted here due to the new validation code.
12ba892
to
f8fb853
Compare
I decided to update the error message to suggest Ignoring the md5sum and continuing could give us a very suboptimal cram, so it's best to be defensive here. |
… 1.19 Andrew Whitwham (22): Change stats to use a single structure. Enable the use of read groups in duplicate marking. Added json formatted stats. Changed error handling to use print_error. Added tests and documentation. Stop reads wrongly being removed in hard clip mode. (PR #1722) Changed the default UMI barcode regex. (#1735) Speed up optical duplicate checking. Added a patch to enable uncompressed writing. Update copyright for winter release. January 2023 NEWS update. Take variable initialisation out of for statement. (#1798) Fix bug in split output format and improve man page (PR #1821) Removed line that caused a warning in gcc-13. (#1838) Switched back to openssl for Alpine. Fix a minor memory leak. (#1875) Unmap reads where there are no aligned bases left. Summer 2023 copyright update. News Updates for Summer 2023. (PR #1878) Add the order of filtering to the man page. (#1907) Minor layout changes and added const modifier. Speed up optical duplicate marking. (PR #1952) Armin Toepfer (1): Allow sorting using minimisers w/ fwd strand only David Seifert (1): Use POSIX `grep` Devang Thakkar (2): consistent, more comprehensive flag filtering update documentation James Bonfield (58): Improve samtools sort on lustre. Revision to sort -n name comparison function strnum_cmp. Make consensus -m simple use a more usual definition of min_depth. Also add a --min-BQ option to samtools consensus. Add a bam_sanitize function to fixmate and view. Update bam sanitizer to use the new aux tag iterator. Add depth --excl-flags synonym for -G. Adds --incl-flags and --require-flags to the samtools depth command. Fix samtools depth usage typo. Work around PacBio HiFi quality induced biases. Also change the new pileup function quality scoring for gaps. Fix SEGV in samtoools collate when no filename given. Revise collate to require an input filename. Make faidx/fqidx output line length default to the input line length. (PR #1738) Minor clarification of samtools coverage documentation. Fix samtools consensus buffer overrun with MD:Z handling. Fix a buffer read-overflow on sequences with seq "*". Fixes needed to accommodate samtools/htslib#1522 Permit 1 thread with samtools view. Update samtools tview man page Add consensus --mark-ins option. Fix view -X command line parsing. Add CRAM_OPT_REQUIRED_FIELDS option for view -c. (PR #1776) Make the typical lack of --input-fmt option more explicit. First abandoned attempt at a cram_size sub-command. Improve samtools cram_size subcommand. Allow cirrus-ci to query the htslib PR from samtools PR title. Make test/regression.sh report diffs when failing Improve samtools collate consistency with sort - re: temp file handling. Improvements to samtools consensus for platforms with higher indel error rates. Major revision of the previous consensus-indel PR. General consensus code tidyup Expand fasta/fastq man page Minor fix to wording of mpileup --rf. Change errors produced in samtools view -d validation Expand the mpileup man page default filtering. Fix a sort -M bug when merging sub-blocks. Add minimiser sort option to collate by a fasta (indexed). Adds homopolymer squashing capacity to sort -M Add sort minimiser tests Add error checking to view -e filter expression code. Make calmd "work" on unaligned data or empty files. Extend import --order TAG to --order TAG:length Add a "samtools consensus -aa" mode. (#1851) Add samtools fastq -d TAG:value check Fix samtools reheader CRAM output version. Add long option support to samtools index. Be consistent with rounding of "average length" in samtools stats. Document samtools consensus --qual-calibration Add view -N/-R ^FILE for FILE negation. (#1896) Add samtools sort -N / samtools merge -N option. Bug fix samtools sort --write-index sort mode checks. Add hclen SAM filter documentation. Remove -5 option from samtools consensus. Fix a couple minor samtools view memory problems. News updates for Winter 2023 (1.19) Renamed NEWS to NEWS.md so it's formatted nicely on github. Make samtools mpileup -aa work on empty files. (#1939) John Marshall (10): Ensure searching NEWS for /empty/ finds this "zero-length file" entry Issue templates: Try to direct conda-related issues to Bioconda Apply `samtools fastq --input-fmt-option ...` settings Clarify that view -d argument does not include tag type (#1853) Documentation formatting etc fixes [minor] Rename test output file so it is ignored and cleaned Cast off_t values to type expected by format directive [minor] Add missing Makefile dependencies Check fclose(stdout) at the end of main() Use the spec-defined PL values in man page example Nils Homer (2): bug fix: template coordinate sort when merging in memory bug fix: samtools merge --template-coordinate Pierre Lindenbaum (2): add mapq histogram to samtools stats (PR #1684) new option --plot-depth for samtools coverage Rob Davies (18): Make sort merge in memory before writing temporary files Use the same thread pool for reading and writing in sort Use the thread pool when reading temporary files Limit the number of sort temporary files by merging if too many Fix stats breakage on long deletions when given a reference Make calmd run faster on non-position-sorted data Allow ampliconstats bed file to use fewer refs than the input file Happy New Year Change cram_size man page to cram-size Fix ci clone with no referenced HTSlib PR Minor NEWS adjustments and additional item Minor manual page updates Refactor some names as we now split on tags other than RG Allow `-d RG` to make new files from alignment RG tag values Add a -M option to limit the number of split files made Add tests for split by string aux tag functionality Fix various --write-index issues Add latest NEWS for 1.19 Valeriu Ohan (2): Use TMPDIR environment variable, when looking for a temporary folder. Add -d TAG option, which allows read split by custom aux tags. Vasudeva Easwara Sarma (1): treats empty barcode as no barcode - avoids issues with realloc daviesrob (1): Switch MacOS CI tests to an ARM-based image (#1770) erboone (1): Add plot of read lengths ("RL") from samtools stats output. vasudeva8 (5): update to show line nr reset functionality base update with parse_aux_list leak fix mpileup manual updated with samples updated cat to support multiple non-seekable streams wulj2 (5): add --duplicate-count option in 'samtools markdup' subcommand to record the original read duplication count(include itself) in a 'dc' tag. fix argument of duplicate-count add test data to test samtools markdup --duplicate-count to record the original primary read duplication count , to get the expected result 18_primary_duplicate_count.expected.sam from 18_primary_duplicate_count.sam, run command samtools markdup -t --duplicate-count --barcode-tag BC -S 18_primary_duplicate_count.sam update samtools-markdup --duplicate-count option document item update test data for samtools markdup --duplicate-count option, use sequence data generated by wgsim, no privace concern now; and also update the test.pl part for this test Étienne Mollier (2): fix various typos caught by lintian parsers whatis entry for samtools-fasta and samtools-fastq
See samtools/samtools#1748 for background, although this isn't fixing that issue and I'm not yet sure what that problem is.
I'm unsure on whether this is a desireable solution. See the benchmarks below; the "fix" isn't free.
Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match.
This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome.
This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.