Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slightly speed up various cram decoding functions #1580

Merged
merged 2 commits into from
Mar 15, 2023

Commits on Mar 9, 2023

  1. Slightly speed up various cram decoding functions

    None of this is huge, but it all adds up.
    
    - bam_set1 has been refactored so -O3 is more likely to do unrolling
      and vectorisation.
    
        // Old          time   inst        cyc
        // gcc -O2      12.36  78936832183 36853852204
        // gcc -O3      12.37  78713347525 36867027825
        // clang13 -O2  12.43  77451926728 37012866717
        // clang13 -O3  12.32  77627221907 36691623424
        // gcc12 -O2    12.43  78895089091 37081260172
        // gcc12 -O3    12.36  78505904437 36829216967
    
        // New
        // gcc -O2      12.47  78832021505 37200597109 +
        // gcc -O3      12.14  76499369401 36390334338 --
        // clang13 -O2  12.38  76678460761 36920111561 ~
        // clang13 -O3  12.26  76678023071 36548488492 ~
        // gcc12 -O2    12.38  78581694397 36880034181 -
        // gcc12 -O3    12.15  76356625541 36293921439 --
    
    - Improve the MD/NM generation in CRAM decoding.
      With decode_md=1 (default) by decode changed from 12.91s to 12.57s
      With decode_md=0 it's 11.92, so that's 1/3rd of the overhead
      removed.
    
    - Changed the block_resize to resize in slightly smaller chunks and to
      use integer maths.
    
    - Reduce excessive pointer redirection in cram_decode_seq.
    
      Unsure if this speeds things up much (sometimes it seems to), but it
      provides tidier code too.
    
    Combined before and after on 10 million NovaSeq CRAM (v3.1)
    
    epyc 7543
                       before   after
        gcc(7)  -O2    7.67     7.63   -0.5%
        gcc12   -O2    7.59     7.60   +0.1%
        clang7  -O2    8.12     7.57   -6.8%
        clang13 -O2    8.06     7.54   -6.5%
    
        gcc(7)  -O3    7.73     7.46   -3.5%
        gcc12   -O3    7.46     7.35   -1.5%
        clang7  -O3    8.08     7.57   -6.3%
        clang13 -O3    7.95     7.66   -3.6%
    
    Xeon Gold 6142
                       before   after
        gcc(7)  -O2    9.74     9.14   -6.2%
        gcc12   -O2    9.43     8.45  -10.4%
        clang7  -O2    9.61     8.64  -10.0%
        clang13 -O2    9.95     8.85  -11.1%
    
        gcc(7)  -O3    9.51     8.81   -7.4%
        gcc12   -O3    9.15     8.42   -8.0%
        clang7  -O3    9.92     8.72  -12.1%
        clang13 -O3    9.68     8.91   -8.0%
    jkbonfield committed Mar 9, 2023
    Configuration menu
    Copy the full SHA
    f6598a9 View commit details
    Browse the repository at this point in the history

Commits on Mar 14, 2023

  1. [SQUASH ME]

    Rewrite bits of the decode_md loop again, fixing bug in previous
    commit and finding a perhaps faster route too.
    
    Comparisons with Dev(/D) and this commit (/4) on Revio (re/) and
    NovaSeq (nv/) with a variety of compilers and optimisations.  Figures
    are cycle counts from perf stat
    
                         Xeon E5-2660         Xeon Gold 6142
    re/D gcc12-O2        85699982958          74752510144
    re/4 gcc12-O2        82265084038          71947558666 -3.7/3.7
    
    re/D gcc12-O3        85837077212          74392223354
    re/4 gcc12-O3        82024293685          71861154116 -4.4/3.4
    
    re/D clang12-3       85608876213          73934329619
    re/4 clang12-3       84390364926          73961392095 -1.4/0
    
    re/D clang12-2       86861787827          74255338533
    re/4 clang12-2       83186843797          72421845542 -4.2/2.5; better than O3
    
    nv/D gcc12-O2        36694089398          31444641828
    nv/4 gcc12-O2        34949122875          30061074125 -4.8/-4.4
    
    nv/D gcc12-O3        36528573980          30792932748
    nv/4 gcc12-O3        35069572111          30066058127 -4.0/2.4
    
    nv/D clang12-3       37906764004          32459168883
    nv/4 clang12-3       36344679534          30786987972 -4.1/-5.2
    
    nv/D clang12-2       38443827308          32304948037
    nv/4 clang12-2       36361384580          31022553379 -5.4/-4.0
    
    Revised benchmarks on 10 million NovaSeq records, now showing billions
    of cycles as more robust than CPU time.
    
        EPYC 7543
                           before   after
            gcc(7)  -O2    28.6     28.3    -1.0
            gcc12   -O2    28.2     28.3    +0.4
            clang7  -O2    30.2     28.2    -6.6
            clang13 -O2    29.9     28.2    -5.7
    
            gcc(7)  -O3    28.7     28.2    -1.7
            gcc12   -O3    28.0     27.2    -2.9
            clang7  -O3    30.1     28.3    -6.0
            clang13 -O3    29.7     28.3    -4.7
    
        Xeon Gold 6142
                           before   after
            gcc(7)  -O2    32.8     30.5    -7.0
            gcc12   -O2    31.8     30.1    -5.3
            clang7  -O2    33.1     29.9    -9.7
            clang13 -O2    34.1     30.8    -9.7
    
            gcc(7)  -O3    32.7     30.2    -7.6
            gcc12   -O3    31.6     29.1    -7.9
            clang7  -O3    34.3     30.0    -12.5
            clang13 -O3    33.3     30.9    -7.2
    jkbonfield committed Mar 14, 2023
    Configuration menu
    Copy the full SHA
    d16e43a View commit details
    Browse the repository at this point in the history