Staden io_lib 1.14.13
Version 1.14.13
This release has a mixture of on-going CRAM 4 work (not compatible with previous CRAM 4) and some more general quality of life improvements for all CRAM versions including speed-ups and better multi-threading.
Note both CRAM 3.1 and 4.0 are still to be considered an unofficial CRAM extensions.
Updates:
-
Scramble can now filter-in or filter-out aux tags during transcoding. This is done using -d and -D options. For example:
scramble -D OQ,BI,BD in.bam out.cram
removes the GATK added OQ, BI and BD aux tags.
Requested by @jhaezebrouck in issue #24. -
The Scramble -X options are now implemented using a CRAM_OPT_PROFILE option. This simplifies the scramble code and makes it easier to call from a library. This also fixes a number of bugs in the order of argument parsing.
-
Improved CRAM writing speeds.
The bam_copy function now only copies the number of used bytes rather than the number of allocated bytes, which can sometimes be substantially smaller. As this was done in the main thread it may have a significant benefit when multi-threading. -
Added libdeflate support into CRAM too (in addition to the existing support in BAM). This isn't a huge change to CRAM speeds except at high levels (-8 and -9) which are now slower, but also better compression ratio. A modest 2-3% speed gain is visible are low and mid levels, and at -1/-2 to -4 the compression ratio is also improved.
-
CRAM 3.1 compression level -1 is now 25% faster, but 4% larger. This is achieved by difference choice of compression codecs, most notably disabling the name tokeniser for level 1. Use level 2 for something comparable to the old behaviour.
-
Added an io_lib/version.h to make it easier to detect the version being compiled against using IOLIB_VERSION macros.
Requested by German Tischler in issue #25. -
Refactored the cram encoding interface used by biobambam.
Implemented by German Tischler in PR#27. -
CRAM 4 now uses E_CONST instead of a uni-value version of E_HUFFMAN. Also added offset field to VARINT_SIGNED and VARINT_UNSIGNED which helps for data series that have values from -1 to MAXINT.
-
CRAM 4 container structure has changed so that all values are variable sized integers instead of fixed size.
-
Further improvements with CRAM 4's use of signed values.
- Ref_seq_id is container and slice headers are now signed.
- RI (ref ID) data series and NS (mate ref ID) are also now signed as -1 is a valid value.
- Embedded ref id is now 0 for unusued instead of -1.
-
Reversed the use of CRAM 4 delta encoding for the B array. It only helps at the moment for ONT signal data, so it needs more work to
make it auto-detect when delta makes sense. (Enabling it globally for CRAM4 B aux tags was accidental.) -
Htscodecs submodule has gained support for big-endian platforms
Other big-endian improvements to parts of CRAM4 too.
Bug fixes:
-
Fixed CRAM MD tag generatin when using the "b" feature code. (NB: unused by known CRAM encoders).
Also see samtools/htslib#1086 for more details. -
Fixed CRAM quality string when using "q" feature code (unused by encoders?) and in lossy-quality mode (maybe utilised in old Cramtools).
Also see samtools/htslib#1094 for more details. -
Fixed some minor memory leaks.
-
"Scramble -X archive -1" enabled lzma, which should only have arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)
-
Removed minor compilation warning in printf debugging.
-
Fixed a 7 year old bug in scram_pileup which couldn't cope with soft-clips being followed by hard-clips.