Slightly speed up various cram decoding functions #1580

jkbonfield · 2023-03-03T16:04:27Z

None of this is huge, but it all adds up.

bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation.

    // Old          time   inst        cyc
    // gcc -O2      12.36  78936832183 36853852204
    // gcc -O3      12.37  78713347525 36867027825
    // clang13 -O2  12.43  77451926728 37012866717
    // clang13 -O3  12.32  77627221907 36691623424
    // gcc12 -O2    12.43  78895089091 37081260172
    // gcc12 -O3    12.36  78505904437 36829216967

    // New
    // gcc -O2      12.47  78832021505 37200597109 +
    // gcc -O3      12.14  76499369401 36390334338 --
    // clang13 -O2  12.38  76678460761 36920111561 ~
    // clang13 -O3  12.26  76678023071 36548488492 ~
    // gcc12 -O2    12.38  78581694397 36880034181 -
    // gcc12 -O3    12.15  76356625541 36293921439 --

Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed.
Changed the block_resize to resize in slightly smaller chunks and to use integer maths.
Reduce excessive pointer redirection in cram_decode_seq.

Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too.

Combined before and after on 10 million NovaSeq CRAM (v3.1)

epyc 7543

               before   after
gcc(7)  -O2    7.67     7.63   -0.5%
gcc12   -O2    7.59     7.60   +0.1%
clang7  -O2    8.12     7.57   -6.8%
clang13 -O2    8.06     7.54   -6.5%

gcc(7)  -O3    7.73     7.46   -3.5%
gcc12   -O3    7.46     7.35   -1.5%
clang7  -O3    8.08     7.57   -6.3%
clang13 -O3    7.95     7.66   -3.6%

Xeon Gold 6142

               before   after
gcc(7)  -O2    9.74     9.14   -6.2%
gcc12   -O2    9.43     8.45  -10.4%
clang7  -O2    9.61     8.64  -10.0%
clang13 -O2    9.95     8.85  -11.1%

gcc(7)  -O3    9.51     8.81   -7.4%
gcc12   -O3    9.15     8.42   -8.0%
clang7  -O3    9.92     8.72  -12.1%
clang13 -O3    9.68     8.91   -8.0%

Biggest change is with clang, but also on Intel we see bigger changes than AMD too.

jkbonfield · 2023-03-06T12:05:28Z

Extra data for other data sets (including duplicating Novaseq data from above). I stuck with a clang 13 -O2 and one CPU rather than testing everything, as that combination seemed both realistic and showed a considerable benefit. Pleasing to see it applies well on other data too.

Xeon Gold 6142, clang13 -O2, diff data sets
novaseq	       9.95	8.85  -11.1%
revio	      23.33    19.46  -16.6%
ultima	     195.20   177.71   -9.0%
ONT	      68.27    60.67  -11.1%

daviesrob · 2023-03-09T09:43:05Z

cram/cram_decode.c

+                                if (refp[i] != 'N')
+                                    continue;
+
+                                if (add_md_char(s, decode_md, refp[i],


Doesn't this go wrong if there are two 'N's in range? The old code would increment md_dist between them, but in this version the update is outside the inner loop so the second would always get md_dist = 0.

The old code either incremented nm or md_dist, so the number of md_dist increments is equal to length - nm increment. Thus I optimised it out by counting up one only.

It does however look like the seq[seq_pos-1+i] = base; line got lost, hmm.

Edit: nm - it's just hard to see here due to the way it's presented. It's been folded into the memcpy below.

The problem with md_dist is that add_md_char() needs it, and before it was being incremented inside the for (i = 0; i < pos - seq_pos; i++) {} loop, but now it isn't. So I don't see how add_md_char() could be getting the same value passed in as it used to?

Oh sorry! Yes I see...

It probably needs this adding before the add_md_char bit:

md_dist += pos - seq_pos - (nm - nm_start); nm_start = nm;

I also wonder if the loop could be further optimised by using memchr to find the next 'N', so we efficiently keep skipping forward. Sometimes the str or mem fucntions are glacial, memchr seemed to work well above.

Thinking more, we can start the loop off at the correct location too (first N) as we used memchr to start with to detect if it's worth doing. So the entire thing is now looking like a while (s=memchr(...)) loop.

None of this is huge, but it all adds up. - bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation. // Old time inst cyc // gcc -O2 12.36 78936832183 36853852204 // gcc -O3 12.37 78713347525 36867027825 // clang13 -O2 12.43 77451926728 37012866717 // clang13 -O3 12.32 77627221907 36691623424 // gcc12 -O2 12.43 78895089091 37081260172 // gcc12 -O3 12.36 78505904437 36829216967 // New // gcc -O2 12.47 78832021505 37200597109 + // gcc -O3 12.14 76499369401 36390334338 -- // clang13 -O2 12.38 76678460761 36920111561 ~ // clang13 -O3 12.26 76678023071 36548488492 ~ // gcc12 -O2 12.38 78581694397 36880034181 - // gcc12 -O3 12.15 76356625541 36293921439 -- - Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed. - Changed the block_resize to resize in slightly smaller chunks and to use integer maths. - Reduce excessive pointer redirection in cram_decode_seq. Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too. Combined before and after on 10 million NovaSeq CRAM (v3.1) epyc 7543 before after gcc(7) -O2 7.67 7.63 -0.5% gcc12 -O2 7.59 7.60 +0.1% clang7 -O2 8.12 7.57 -6.8% clang13 -O2 8.06 7.54 -6.5% gcc(7) -O3 7.73 7.46 -3.5% gcc12 -O3 7.46 7.35 -1.5% clang7 -O3 8.08 7.57 -6.3% clang13 -O3 7.95 7.66 -3.6% Xeon Gold 6142 before after gcc(7) -O2 9.74 9.14 -6.2% gcc12 -O2 9.43 8.45 -10.4% clang7 -O2 9.61 8.64 -10.0% clang13 -O2 9.95 8.85 -11.1% gcc(7) -O3 9.51 8.81 -7.4% gcc12 -O3 9.15 8.42 -8.0% clang7 -O3 9.92 8.72 -12.1% clang13 -O3 9.68 8.91 -8.0%

jkbonfield · 2023-03-13T15:05:30Z

Working on fixing it! Turns out my trivial 2 line SAM file for testing wasn't exactly enough. :/

Rewrite bits of the decode_md loop again, fixing bug in previous commit and finding a perhaps faster route too. Comparisons with Dev(/D) and this commit (/4) on Revio (re/) and NovaSeq (nv/) with a variety of compilers and optimisations. Figures are cycle counts from perf stat Xeon E5-2660 Xeon Gold 6142 re/D gcc12-O2 85699982958 74752510144 re/4 gcc12-O2 82265084038 71947558666 -3.7/3.7 re/D gcc12-O3 85837077212 74392223354 re/4 gcc12-O3 82024293685 71861154116 -4.4/3.4 re/D clang12-3 85608876213 73934329619 re/4 clang12-3 84390364926 73961392095 -1.4/0 re/D clang12-2 86861787827 74255338533 re/4 clang12-2 83186843797 72421845542 -4.2/2.5; better than O3 nv/D gcc12-O2 36694089398 31444641828 nv/4 gcc12-O2 34949122875 30061074125 -4.8/-4.4 nv/D gcc12-O3 36528573980 30792932748 nv/4 gcc12-O3 35069572111 30066058127 -4.0/2.4 nv/D clang12-3 37906764004 32459168883 nv/4 clang12-3 36344679534 30786987972 -4.1/-5.2 nv/D clang12-2 38443827308 32304948037 nv/4 clang12-2 36361384580 31022553379 -5.4/-4.0 Revised benchmarks on 10 million NovaSeq records, now showing billions of cycles as more robust than CPU time. EPYC 7543 before after gcc(7) -O2 28.6 28.3 -1.0 gcc12 -O2 28.2 28.3 +0.4 clang7 -O2 30.2 28.2 -6.6 clang13 -O2 29.9 28.2 -5.7 gcc(7) -O3 28.7 28.2 -1.7 gcc12 -O3 28.0 27.2 -2.9 clang7 -O3 30.1 28.3 -6.0 clang13 -O3 29.7 28.3 -4.7 Xeon Gold 6142 before after gcc(7) -O2 32.8 30.5 -7.0 gcc12 -O2 31.8 30.1 -5.3 clang7 -O2 33.1 29.9 -9.7 clang13 -O2 34.1 30.8 -9.7 gcc(7) -O3 32.7 30.2 -7.6 gcc12 -O3 31.6 29.1 -7.9 clang7 -O3 34.3 30.0 -12.5 clang13 -O3 33.3 30.9 -7.2

daviesrob self-assigned this Mar 7, 2023

daviesrob reviewed Mar 9, 2023

View reviewed changes

jkbonfield force-pushed the cram_yak_shaving branch from b9514e2 to d7a7052 Compare March 13, 2023 13:46

jkbonfield force-pushed the cram_yak_shaving branch from d7a7052 to d16e43a Compare March 14, 2023 15:55

daviesrob merged commit 46bcc36 into samtools:develop Mar 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slightly speed up various cram decoding functions #1580

Slightly speed up various cram decoding functions #1580

jkbonfield commented Mar 3, 2023

jkbonfield commented Mar 6, 2023

daviesrob Mar 9, 2023

jkbonfield Mar 9, 2023

jkbonfield Mar 9, 2023 •

edited

Loading

daviesrob Mar 9, 2023

jkbonfield Mar 9, 2023

jkbonfield Mar 9, 2023 •

edited

Loading

jkbonfield commented Mar 13, 2023

Slightly speed up various cram decoding functions #1580

Slightly speed up various cram decoding functions #1580

Conversation

jkbonfield commented Mar 3, 2023

jkbonfield commented Mar 6, 2023

daviesrob Mar 9, 2023

Choose a reason for hiding this comment

jkbonfield Mar 9, 2023

Choose a reason for hiding this comment

jkbonfield Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

daviesrob Mar 9, 2023

Choose a reason for hiding this comment

jkbonfield Mar 9, 2023

Choose a reason for hiding this comment

jkbonfield Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

jkbonfield commented Mar 13, 2023

jkbonfield Mar 9, 2023 •

edited

Loading

jkbonfield Mar 9, 2023 •

edited

Loading