Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly speed up various cram decoding functions #1580

Merged
merged 2 commits into from
Mar 15, 2023

Conversation

jkbonfield
Copy link
Contributor

None of this is huge, but it all adds up.

  • bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation.
    // Old          time   inst        cyc
    // gcc -O2      12.36  78936832183 36853852204
    // gcc -O3      12.37  78713347525 36867027825
    // clang13 -O2  12.43  77451926728 37012866717
    // clang13 -O3  12.32  77627221907 36691623424
    // gcc12 -O2    12.43  78895089091 37081260172
    // gcc12 -O3    12.36  78505904437 36829216967

    // New
    // gcc -O2      12.47  78832021505 37200597109 +
    // gcc -O3      12.14  76499369401 36390334338 --
    // clang13 -O2  12.38  76678460761 36920111561 ~
    // clang13 -O3  12.26  76678023071 36548488492 ~
    // gcc12 -O2    12.38  78581694397 36880034181 -
    // gcc12 -O3    12.15  76356625541 36293921439 --
  • Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed.

  • Changed the block_resize to resize in slightly smaller chunks and to use integer maths.

  • Reduce excessive pointer redirection in cram_decode_seq.

    Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too.

Combined before and after on 10 million NovaSeq CRAM (v3.1)

epyc 7543

               before   after
gcc(7)  -O2    7.67     7.63   -0.5%
gcc12   -O2    7.59     7.60   +0.1%
clang7  -O2    8.12     7.57   -6.8%
clang13 -O2    8.06     7.54   -6.5%

gcc(7)  -O3    7.73     7.46   -3.5%
gcc12   -O3    7.46     7.35   -1.5%
clang7  -O3    8.08     7.57   -6.3%
clang13 -O3    7.95     7.66   -3.6%

Xeon Gold 6142

               before   after
gcc(7)  -O2    9.74     9.14   -6.2%
gcc12   -O2    9.43     8.45  -10.4%
clang7  -O2    9.61     8.64  -10.0%
clang13 -O2    9.95     8.85  -11.1%

gcc(7)  -O3    9.51     8.81   -7.4%
gcc12   -O3    9.15     8.42   -8.0%
clang7  -O3    9.92     8.72  -12.1%
clang13 -O3    9.68     8.91   -8.0%

Biggest change is with clang, but also on Intel we see bigger changes than AMD too.

@jkbonfield
Copy link
Contributor Author

Extra data for other data sets (including duplicating Novaseq data from above). I stuck with a clang 13 -O2 and one CPU rather than testing everything, as that combination seemed both realistic and showed a considerable benefit. Pleasing to see it applies well on other data too.

Xeon Gold 6142, clang13 -O2, diff data sets
novaseq	       9.95	8.85  -11.1%
revio	      23.33    19.46  -16.6%
ultima	     195.20   177.71   -9.0%
ONT	      68.27    60.67  -11.1%

@daviesrob daviesrob self-assigned this Mar 7, 2023
if (refp[i] != 'N')
continue;

if (add_md_char(s, decode_md, refp[i],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this go wrong if there are two 'N's in range? The old code would increment md_dist between them, but in this version the update is outside the inner loop so the second would always get md_dist = 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code either incremented nm or md_dist, so the number of md_dist increments is equal to length - nm increment. Thus I optimised it out by counting up one only.

Copy link
Contributor Author

@jkbonfield jkbonfield Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does however look like the seq[seq_pos-1+i] = base; line got lost, hmm.

Edit: nm - it's just hard to see here due to the way it's presented. It's been folded into the memcpy below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with md_dist is that add_md_char() needs it, and before it was being incremented inside the for (i = 0; i < pos - seq_pos; i++) {} loop, but now it isn't. So I don't see how add_md_char() could be getting the same value passed in as it used to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry! Yes I see...

It probably needs this adding before the add_md_char bit:

md_dist += pos - seq_pos - (nm - nm_start);
nm_start = nm;

Copy link
Contributor Author

@jkbonfield jkbonfield Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if the loop could be further optimised by using memchr to find the next 'N', so we efficiently keep skipping forward. Sometimes the str or mem fucntions are glacial, memchr seemed to work well above.

Thinking more, we can start the loop off at the correct location too (first N) as we used memchr to start with to detect if it's worth doing. So the entire thing is now looking like a while (s=memchr(...)) loop.

None of this is huge, but it all adds up.

- bam_set1 has been refactored so -O3 is more likely to do unrolling
  and vectorisation.

    // Old          time   inst        cyc
    // gcc -O2      12.36  78936832183 36853852204
    // gcc -O3      12.37  78713347525 36867027825
    // clang13 -O2  12.43  77451926728 37012866717
    // clang13 -O3  12.32  77627221907 36691623424
    // gcc12 -O2    12.43  78895089091 37081260172
    // gcc12 -O3    12.36  78505904437 36829216967

    // New
    // gcc -O2      12.47  78832021505 37200597109 +
    // gcc -O3      12.14  76499369401 36390334338 --
    // clang13 -O2  12.38  76678460761 36920111561 ~
    // clang13 -O3  12.26  76678023071 36548488492 ~
    // gcc12 -O2    12.38  78581694397 36880034181 -
    // gcc12 -O3    12.15  76356625541 36293921439 --

- Improve the MD/NM generation in CRAM decoding.
  With decode_md=1 (default) by decode changed from 12.91s to 12.57s
  With decode_md=0 it's 11.92, so that's 1/3rd of the overhead
  removed.

- Changed the block_resize to resize in slightly smaller chunks and to
  use integer maths.

- Reduce excessive pointer redirection in cram_decode_seq.

  Unsure if this speeds things up much (sometimes it seems to), but it
  provides tidier code too.

Combined before and after on 10 million NovaSeq CRAM (v3.1)

epyc 7543
                   before   after
    gcc(7)  -O2    7.67     7.63   -0.5%
    gcc12   -O2    7.59     7.60   +0.1%
    clang7  -O2    8.12     7.57   -6.8%
    clang13 -O2    8.06     7.54   -6.5%

    gcc(7)  -O3    7.73     7.46   -3.5%
    gcc12   -O3    7.46     7.35   -1.5%
    clang7  -O3    8.08     7.57   -6.3%
    clang13 -O3    7.95     7.66   -3.6%

Xeon Gold 6142
                   before   after
    gcc(7)  -O2    9.74     9.14   -6.2%
    gcc12   -O2    9.43     8.45  -10.4%
    clang7  -O2    9.61     8.64  -10.0%
    clang13 -O2    9.95     8.85  -11.1%

    gcc(7)  -O3    9.51     8.81   -7.4%
    gcc12   -O3    9.15     8.42   -8.0%
    clang7  -O3    9.92     8.72  -12.1%
    clang13 -O3    9.68     8.91   -8.0%
@jkbonfield
Copy link
Contributor Author

Working on fixing it! Turns out my trivial 2 line SAM file for testing wasn't exactly enough. :/

Rewrite bits of the decode_md loop again, fixing bug in previous
commit and finding a perhaps faster route too.

Comparisons with Dev(/D) and this commit (/4) on Revio (re/) and
NovaSeq (nv/) with a variety of compilers and optimisations.  Figures
are cycle counts from perf stat

                     Xeon E5-2660         Xeon Gold 6142
re/D gcc12-O2        85699982958          74752510144
re/4 gcc12-O2        82265084038          71947558666 -3.7/3.7

re/D gcc12-O3        85837077212          74392223354
re/4 gcc12-O3        82024293685          71861154116 -4.4/3.4

re/D clang12-3       85608876213          73934329619
re/4 clang12-3       84390364926          73961392095 -1.4/0

re/D clang12-2       86861787827          74255338533
re/4 clang12-2       83186843797          72421845542 -4.2/2.5; better than O3

nv/D gcc12-O2        36694089398          31444641828
nv/4 gcc12-O2        34949122875          30061074125 -4.8/-4.4

nv/D gcc12-O3        36528573980          30792932748
nv/4 gcc12-O3        35069572111          30066058127 -4.0/2.4

nv/D clang12-3       37906764004          32459168883
nv/4 clang12-3       36344679534          30786987972 -4.1/-5.2

nv/D clang12-2       38443827308          32304948037
nv/4 clang12-2       36361384580          31022553379 -5.4/-4.0

Revised benchmarks on 10 million NovaSeq records, now showing billions
of cycles as more robust than CPU time.

    EPYC 7543
                       before   after
        gcc(7)  -O2    28.6     28.3    -1.0
        gcc12   -O2    28.2     28.3    +0.4
        clang7  -O2    30.2     28.2    -6.6
        clang13 -O2    29.9     28.2    -5.7

        gcc(7)  -O3    28.7     28.2    -1.7
        gcc12   -O3    28.0     27.2    -2.9
        clang7  -O3    30.1     28.3    -6.0
        clang13 -O3    29.7     28.3    -4.7

    Xeon Gold 6142
                       before   after
        gcc(7)  -O2    32.8     30.5    -7.0
        gcc12   -O2    31.8     30.1    -5.3
        clang7  -O2    33.1     29.9    -9.7
        clang13 -O2    34.1     30.8    -9.7

        gcc(7)  -O3    32.7     30.2    -7.6
        gcc12   -O3    31.6     29.1    -7.9
        clang7  -O3    34.3     30.0    -12.5
        clang13 -O3    33.3     30.9    -7.2
@daviesrob daviesrob merged commit 46bcc36 into samtools:develop Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants