Enable auto-vectorisation in CRAM 3.1 codecs. #1669

jkbonfield · 2023-09-06T09:37:59Z

I suspect this was initially hard as on, but later made something we explicitly enable but forgetting to add that code into htslib.

On Illumina it made little difference, but wasn't detrimental. I need bigger data sets, but they're mostly on unavailable systems right now. Small tests demonstrate utility though, specifically on decode speeds.

For other platforms:

Ultima Genomics

Orig

real 0m25.784s
user 0m24.506s
sys 0m1.189s

real 0m9.155s
user 0m7.775s
sys 0m1.379s

RANS_ORDER_SIMD_AUTO

real 0m24.987s
user 0m23.699s
sys 0m1.219s

real 0m8.097s
user 0m6.635s
sys 0m1.461s

That's 13% quicker decode and 3% quicker encode.

It's mostly QS and tags:

$ ~/samtools/samtools cram-size -v _.cram|grep 32x16
BLOCK 10 617823 77895 12.61% r32x16-o1
BLOCK 12 911236491 188134803 20.65% r32x16-o1R QS
BLOCK 27 232221 38816 16.72% r32x16-o0 FC
BLOCK 31 54067 10718 19.82% r32x16-o0 BS
BLOCK 7614554 917596491 50148593 5.47% r32x16-o1 t0Z
BLOCK 7630914 931877007 108982153 11.69% r32x16-o1R tpB

ONT

Orig

real 0m3.018s
user 0m2.854s
sys 0m0.130s

real 0m0.578s
user 0m0.538s
sys 0m0.040s

RANS_ORDER_SIMD_AUTO

real 0m2.912s
user 0m2.740s
sys 0m0.120s

real 0m0.500s
user 0m0.430s
sys 0m0.070s

That's 16% quicker decode and 4% quicker encode, but sample size is admittedly tiny for both tests.

File size changes are under 0.1% growth, mainly due to 32 rANS states instead of 4. The RANS_ORDER_SIMD_AUTO flag basically enables the 32-way rANS if the block is sufficiently large (>50kb), so it's the extra 112 byte state overhead isn't significant.

I suspect this was initially hard as on, but later made something we explicitly enable but forgetting to add that code into htslib. On Illumina it made little difference, but wasn't detrimental. I need bigger data sets, but they're mostly on unavailable systems right now. Small tests demonstrate utility though, specifically on decode speeds. For other platforms: Ultima Genomics =============== Orig real 0m25.784s user 0m24.506s sys 0m1.189s real 0m9.155s user 0m7.775s sys 0m1.379s RANS_ORDER_SIMD_AUTO real 0m24.987s user 0m23.699s sys 0m1.219s real 0m8.097s user 0m6.635s sys 0m1.461s That's 13% quicker decode and 3% quicker encode. It's mostly QS and tags: $ ~/samtools/samtools cram-size -v _.cram|grep 32x16 BLOCK 10 617823 77895 12.61% r32x16-o1 BLOCK 12 911236491 188134803 20.65% r32x16-o1R QS BLOCK 27 232221 38816 16.72% r32x16-o0 FC BLOCK 31 54067 10718 19.82% r32x16-o0 BS BLOCK 7614554 917596491 50148593 5.47% r32x16-o1 t0Z BLOCK 7630914 931877007 108982153 11.69% r32x16-o1R tpB ONT === Orig real 0m3.018s user 0m2.854s sys 0m0.130s real 0m0.578s user 0m0.538s sys 0m0.040s RANS_ORDER_SIMD_AUTO real 0m2.912s user 0m2.740s sys 0m0.120s real 0m0.500s user 0m0.430s sys 0m0.070s That's 16% quicker decode and 4% quicker encode, but sample size is admittedly tiny for both tests. File size changes are under 0.1% growth, mainly due to 32 rANS states instead of 4. The RANS_ORDER_SIMD_AUTO flag basically enables the 32-way rANS if the block is sufficiently large (>50kb), so it's the extra 112 byte state overhead isn't significant.

jkbonfield force-pushed the cram_simd branch from 618af7c to 8365127 Compare September 6, 2023 10:08

daviesrob assigned whitwham Sep 12, 2023

whitwham merged commit 30211d8 into samtools:develop Sep 15, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable auto-vectorisation in CRAM 3.1 codecs. #1669

Enable auto-vectorisation in CRAM 3.1 codecs. #1669

jkbonfield commented Sep 6, 2023

Enable auto-vectorisation in CRAM 3.1 codecs. #1669

Enable auto-vectorisation in CRAM 3.1 codecs. #1669

Conversation

jkbonfield commented Sep 6, 2023

Ultima Genomics

ONT