Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRAM SQ/M5 header checking when specifying a fasta file. #1522

Merged
merged 3 commits into from
Jan 13, 2023

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Nov 7, 2022

See samtools/samtools#1748 for background, although this isn't fixing that issue and I'm not yet sure what that problem is.

I'm unsure on whether this is a desireable solution. See the benchmarks below; the "fix" isn't free.


Given the possibility of creating a CRAM which cannot be decoded again due to specifying a fasta file that differs to the M5 tags specified in the header, we're now more militant in enforcing these match.

This does however add a performance hit when processing the header, which can unfairly penalise small files or creation of CRAMs for a small portion of the genome.

# Old htslib 1.16 speed, enabled here by explicitly ignoring md5
$ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.807s
user    0m0.702s
sys     0m0.101s

# New performance, dominated by validating SQ M5s
$ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m20.312s
user    0m18.543s
sys     0m1.760s

# Using REF_PATH md5 cache instead is fast
$ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
real    0m0.222s
user    0m0.204s
sys     0m0.016s

This significant start-up cost is why this check wasn't previously here. We need to decide between validation of correctness and performance. It's optional now, but perhaps ignore_md5 isn't the correct option (it affects other things) and perhaps it should be opt-in instead of opt-out.

jkbonfield added a commit to jkbonfield/samtools that referenced this pull request Nov 7, 2022
@daviesrob daviesrob self-assigned this Nov 8, 2022
@daviesrob
Copy link
Member

That is a bit of a penalty, although it may be possible to optimise the loading code a bit.

How difficult would it be to only check the references at the point when they're used in alignment records? That would improve the case above, as you'd only need to check one chromosome. Although it would also mean the error could turn up quite a long way into writing the file, which isn't as ideal as discovering the problem right at the start...

@jkbonfield
Copy link
Contributor Author

Taking the idea suggested of doing the checks on the fly has worked well. Thanks.

On the same file above, converting BAM to CRAM:

ignore_md5=0 using -t $HREF -i reference=$HREF                                  
        10M-10M         10M-20M         10M-50M                                 
real    0m0.886s        0m2.207s        0m5.403s                                
user    0m0.791s        0m2.062s        0m5.199s                                
sys     0m0.092s        0m0.144s        0m0.201s                                
                                                                                
ignore_md5=1                                                                    
real    0m0.612s        0m1.929s        0m5.115s                                
user    0m0.510s        0m1.779s        0m4.899s                                
sys     0m0.100s        0m0.148s        0m0.213s                                
                                                                                
ignore_md5=0 using REF_CACHE (md5)                                              
real    0m0.024s        0m1.278s        0m4.190s                                
user    0m0.017s        0m1.244s        0m4.068s                                
sys     0m0.006s        0m0.032s        0m0.121s                                

Ignoring the obvious HUGE win here of using a preformatted raw text file found via md5 cache (and mmap for I/O), the difference between validating md5sum and not is now negligible.

We also see the same big speed gain when doing embed_ref=2 on tiny regions, so there's that as an alternative to validating external refs still:

ignore_md5=0 embed_ref=2                                                        
        10M-10M         10M-20M         10M-50M                                 
real    0m0.025s        0m1.692s        0m5.474s                                
user    0m0.014s        0m1.578s        0m5.330s                                
sys     0m0.010s        0m0.072s        0m0.140s                                

It seems to work on name sorted data too (where it also validates, but once only per ref).

@daviesrob
Copy link
Member

The revised version looks OK so far, apart maybe from the advice printed when you use the wrong reference:

./test/test_view -C -t /tmp/ce.fa -p /tmp/ce#1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

Trying embed_ref=1 on its own doesn't work:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -p /tmp/ce#1.er1.cram test/ce#1.tmp.cram.sam_ 
[E::validate_md5] SQ header M5 tag discrepancy for reference 'CHROMOSOME_I'
[E::validate_md5] Please use the correct reference, or consider using embedded references

embed_ref=2 does though, and produces a working cram file, as also does no_ref=1; or embed_ref=1 when combined with ignore_md5=1:

./test/test_view -C -t /tmp/ce.fa -o embed_ref=2 -p /tmp/ce#1.er2.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o no_ref=1 -p /tmp/ce#1.nr1.cram test/ce#1.tmp.cram.sam_
./test/test_view -C -t /tmp/ce.fa -o embed_ref=1 -o ignore_md5=1 -p /tmp/ce#1.er1.im1.cram test/ce#1.tmp.cram.sam_

Would it be safe to disable the check in embed_ref=1 mode?

c4chris pushed a commit to sib-swiss/samtools that referenced this pull request Jan 11, 2023
Given the possibility of creating a CRAM which cannot be decoded again
due to specifying a fasta file that differs to the M5 tags specified
in the header, we're now more militant in enforcing these match.

This does however add a performance hit when processing the header,
which can unfairly penalise small files or creation of CRAMs for a
small portion of the genome.

   # Old htslib 1.16 speed, enabled here by explicitly ignoring md5
   $ time samtools view -T $HREF -O cram,ignore_md5 -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
   real    0m0.807s
   user    0m0.702s
   sys     0m0.101s

   # New performance, dominated by validating SQ M5s
   $ time samtools view -T $HREF -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
   real    0m20.312s
   user    0m18.543s
   sys     0m1.760s

   # Using REF_PATH md5 cache instead is fast
   $ time samtools view -o /tmp/_.cram ~/scratch/data/9827_2#49.bam 10:10000000-11000000
   real    0m0.222s
   user    0m0.204s
   sys     0m0.016s

This significant start-up cost is why this check wasn't previously
here.  We need to decide between validation of correctness and
performance.  It's optional now, but perhaps ignore_md5 isn't the
correct option (it affects other things) and perhaps it should be
opt-in instead of opt-out.
Given the possibility of creating a CRAM which cannot be decoded again
due to specifying a fasta file that differs to the M5 tags specified
in the header, we're now more militant in enforcing these match.

The references are checked on-the-fly as they're used.  For name
sorted data this may have a significant initial cost (the same that we
see when converting from BAM to CRAM without M5 tags in the BAM
header), but for position streaming the cost is only for chromosomes
currently in use.

This means writing a small CRAM file covering a region from just one
chromosome will only incur the CPU cost of validating that one
chromosome, as we're not trying to check the header tags are valid (we
take them at their word) and are instead trying to ensure the fasta
sequence we've been given matches the supplied headers.

[SQUASH_ME: This almost completely removes the previous commit, but
it's done as a separate commit for ease of review.]
Note this doesn't happen on the command line, so it's something
specific to the way test/sam.c is written that triggers this.

test/sam.c calls cram_load_reference before the SAM header is known.
That in turn means refs2id hasn't been called and fd->refs array isn't
in sync with the file header (as we have a subset of refs).  Then
cram_write_SAM_hdr md5sums the wrong reference when created SQ M5
tags.

There was already code in the cram_write_SAM_hdr to recall refs2id to
get things in sync, but unfortunately it was after the creation of M5
tags.

What's odd is how this ever worked in test/sam.  Maybe it just didn't
and we never noticed?  It was only spotted here due to the new
validation code.
@jkbonfield
Copy link
Contributor Author

I decided to update the error message to suggest embed_ref=2. The problem with level 1 is that we're still stating that the external reference is the correct one, when we know there is a disparity between how the file was aligned and the reference on disk. The difference could be significant, including indels that cause every read to be off by 1.

Ignoring the md5sum and continuing could give us a very suboptimal cram, so it's best to be defensive here.

@daviesrob daviesrob merged commit 4ec92c1 into samtools:develop Jan 13, 2023
clrpackages pushed a commit to clearlinux-pkgs/samtools that referenced this pull request Dec 28, 2023
… 1.19

Andrew Whitwham (22):
      Change stats to use a single structure.
      Enable the use of read groups in duplicate marking.
      Added json formatted stats.
      Changed error handling to use print_error.
      Added tests and documentation.
      Stop reads wrongly being removed in hard clip mode. (PR #1722)
      Changed the default UMI barcode regex. (#1735)
      Speed up optical duplicate checking.
      Added a patch to enable uncompressed writing.
      Update copyright for winter release.
      January 2023 NEWS update.
      Take variable initialisation out of for statement. (#1798)
      Fix bug in split output format and improve man page (PR #1821)
      Removed line that caused a warning in gcc-13. (#1838)
      Switched back to openssl for Alpine.
      Fix a minor memory leak. (#1875)
      Unmap reads where there are no aligned bases left.
      Summer 2023 copyright update.
      News Updates for Summer 2023. (PR #1878)
      Add the order of filtering to the man page. (#1907)
      Minor layout changes and added const modifier.
      Speed up optical duplicate marking. (PR #1952)

Armin Toepfer (1):
      Allow sorting using minimisers w/ fwd strand only

David Seifert (1):
      Use POSIX `grep`

Devang Thakkar (2):
      consistent, more comprehensive flag filtering
      update documentation

James Bonfield (58):
      Improve samtools sort on lustre.
      Revision to sort -n name comparison function strnum_cmp.
      Make consensus -m simple use a more usual definition of min_depth.
      Also add a --min-BQ option to samtools consensus.
      Add a bam_sanitize function to fixmate and view.
      Update bam sanitizer to use the new aux tag iterator.
      Add depth --excl-flags synonym for -G.
      Adds --incl-flags and --require-flags to the samtools depth command.
      Fix samtools depth usage typo.
      Work around PacBio HiFi quality induced biases.
      Also change the new pileup function quality scoring for gaps.
      Fix SEGV in samtoools collate when no filename given.
      Revise collate to require an input filename.
      Make faidx/fqidx output line length default to the input line length. (PR #1738)
      Minor clarification of samtools coverage documentation.
      Fix samtools consensus buffer overrun with MD:Z handling.
      Fix a buffer read-overflow on sequences with seq "*".
      Fixes needed to accommodate samtools/htslib#1522
      Permit 1 thread with samtools view.
      Update samtools tview man page
      Add consensus --mark-ins option.
      Fix view -X command line parsing.
      Add CRAM_OPT_REQUIRED_FIELDS option for view -c. (PR #1776)
      Make the typical lack of --input-fmt option more explicit.
      First abandoned attempt at a cram_size sub-command.
      Improve samtools cram_size subcommand.
      Allow cirrus-ci to query the htslib PR from samtools PR title.
      Make test/regression.sh report diffs when failing
      Improve samtools collate consistency with sort - re: temp file handling.
      Improvements to samtools consensus for platforms with higher indel error rates.
      Major revision of the previous consensus-indel PR.
      General consensus code tidyup
      Expand fasta/fastq man page
      Minor fix to wording of mpileup --rf.
      Change errors produced in samtools view -d validation
      Expand the mpileup man page default filtering.
      Fix a sort -M bug when merging sub-blocks.
      Add minimiser sort option to collate by a fasta (indexed).
      Adds homopolymer squashing capacity to sort -M
      Add sort minimiser tests
      Add error checking to view -e filter expression code.
      Make calmd "work" on unaligned data or empty files.
      Extend import --order TAG to --order TAG:length
      Add a "samtools consensus -aa" mode. (#1851)
      Add samtools fastq -d TAG:value check
      Fix samtools reheader CRAM output version.
      Add long option support to samtools index.
      Be consistent with rounding of "average length" in samtools stats.
      Document samtools consensus --qual-calibration
      Add view -N/-R ^FILE for FILE negation. (#1896)
      Add samtools sort -N / samtools merge -N option.
      Bug fix samtools sort --write-index sort mode checks.
      Add hclen SAM filter documentation.
      Remove -5 option from samtools consensus.
      Fix a couple minor samtools view memory problems.
      News updates for Winter 2023 (1.19)
      Renamed NEWS to NEWS.md so it's formatted nicely on github.
      Make samtools mpileup -aa work on empty files. (#1939)

John Marshall (10):
      Ensure searching NEWS for /empty/ finds this "zero-length file" entry
      Issue templates: Try to direct conda-related issues to Bioconda
      Apply `samtools fastq --input-fmt-option ...` settings
      Clarify that view -d argument does not include tag type (#1853)
      Documentation formatting etc fixes [minor]
      Rename test output file so it is ignored and cleaned
      Cast off_t values to type expected by format directive [minor]
      Add missing Makefile dependencies
      Check fclose(stdout) at the end of main()
      Use the spec-defined PL values in man page example

Nils Homer (2):
      bug fix: template coordinate sort when merging in memory
      bug fix: samtools merge --template-coordinate

Pierre Lindenbaum (2):
      add mapq histogram to samtools stats (PR #1684)
      new option --plot-depth for samtools coverage

Rob Davies (18):
      Make sort merge in memory before writing temporary files
      Use the same thread pool for reading and writing in sort
      Use the thread pool when reading temporary files
      Limit the number of sort temporary files by merging if too many
      Fix stats breakage on long deletions when given a reference
      Make calmd run faster on non-position-sorted data
      Allow ampliconstats bed file to use fewer refs than the input file
      Happy New Year
      Change cram_size man page to cram-size
      Fix ci clone with no referenced HTSlib PR
      Minor NEWS adjustments and additional item
      Minor manual page updates
      Refactor some names as we now split on tags other than RG
      Allow `-d RG` to make new files from alignment RG tag values
      Add a -M option to limit the number of split files made
      Add tests for split by string aux tag functionality
      Fix various --write-index issues
      Add latest NEWS for 1.19

Valeriu Ohan (2):
      Use TMPDIR environment variable, when looking for a temporary folder.
      Add -d TAG option, which allows read split by custom aux tags.

Vasudeva Easwara Sarma (1):
      treats empty barcode as no barcode - avoids issues with realloc

daviesrob (1):
      Switch MacOS CI tests to an ARM-based image (#1770)

erboone (1):
      Add plot of read lengths ("RL") from samtools stats output.

vasudeva8 (5):
      update to show line nr
      reset functionality base
      update with parse_aux_list leak fix
      mpileup manual updated with samples
      updated cat to support multiple non-seekable streams

wulj2 (5):
      add --duplicate-count option in 'samtools markdup' subcommand to record the original read duplication count(include itself) in a 'dc' tag.
      fix argument of duplicate-count
      add test data to test samtools markdup --duplicate-count to record the original primary read duplication count , to get the expected result 18_primary_duplicate_count.expected.sam from 18_primary_duplicate_count.sam, run command samtools markdup -t --duplicate-count --barcode-tag BC -S 18_primary_duplicate_count.sam
      update samtools-markdup --duplicate-count option document item
      update test data for samtools markdup --duplicate-count option, use sequence data generated by wgsim, no privace concern now; and also update the test.pl part for this test

Étienne Mollier (2):
      fix various typos caught by lintian parsers
      whatis entry for samtools-fasta and samtools-fastq
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants