-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slightly speed up various cram decoding functions #1580
Conversation
Extra data for other data sets (including duplicating Novaseq data from above). I stuck with a clang 13 -O2 and one CPU rather than testing everything, as that combination seemed both realistic and showed a considerable benefit. Pleasing to see it applies well on other data too.
|
cram/cram_decode.c
Outdated
if (refp[i] != 'N') | ||
continue; | ||
|
||
if (add_md_char(s, decode_md, refp[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this go wrong if there are two 'N's in range? The old code would increment md_dist
between them, but in this version the update is outside the inner loop so the second would always get md_dist = 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code either incremented nm
or md_dist
, so the number of md_dist
increments is equal to length - nm increment. Thus I optimised it out by counting up one only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does however look like the seq[seq_pos-1+i] = base;
line got lost, hmm.
Edit: nm - it's just hard to see here due to the way it's presented. It's been folded into the memcpy below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with md_dist
is that add_md_char()
needs it, and before it was being incremented inside the for (i = 0; i < pos - seq_pos; i++) {}
loop, but now it isn't. So I don't see how add_md_char()
could be getting the same value passed in as it used to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh sorry! Yes I see...
It probably needs this adding before the add_md_char bit:
md_dist += pos - seq_pos - (nm - nm_start);
nm_start = nm;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if the loop could be further optimised by using memchr to find the next 'N', so we efficiently keep skipping forward. Sometimes the str or mem fucntions are glacial, memchr seemed to work well above.
Thinking more, we can start the loop off at the correct location too (first N) as we used memchr to start with to detect if it's worth doing. So the entire thing is now looking like a while (s=memchr(...))
loop.
None of this is huge, but it all adds up. - bam_set1 has been refactored so -O3 is more likely to do unrolling and vectorisation. // Old time inst cyc // gcc -O2 12.36 78936832183 36853852204 // gcc -O3 12.37 78713347525 36867027825 // clang13 -O2 12.43 77451926728 37012866717 // clang13 -O3 12.32 77627221907 36691623424 // gcc12 -O2 12.43 78895089091 37081260172 // gcc12 -O3 12.36 78505904437 36829216967 // New // gcc -O2 12.47 78832021505 37200597109 + // gcc -O3 12.14 76499369401 36390334338 -- // clang13 -O2 12.38 76678460761 36920111561 ~ // clang13 -O3 12.26 76678023071 36548488492 ~ // gcc12 -O2 12.38 78581694397 36880034181 - // gcc12 -O3 12.15 76356625541 36293921439 -- - Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed. - Changed the block_resize to resize in slightly smaller chunks and to use integer maths. - Reduce excessive pointer redirection in cram_decode_seq. Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too. Combined before and after on 10 million NovaSeq CRAM (v3.1) epyc 7543 before after gcc(7) -O2 7.67 7.63 -0.5% gcc12 -O2 7.59 7.60 +0.1% clang7 -O2 8.12 7.57 -6.8% clang13 -O2 8.06 7.54 -6.5% gcc(7) -O3 7.73 7.46 -3.5% gcc12 -O3 7.46 7.35 -1.5% clang7 -O3 8.08 7.57 -6.3% clang13 -O3 7.95 7.66 -3.6% Xeon Gold 6142 before after gcc(7) -O2 9.74 9.14 -6.2% gcc12 -O2 9.43 8.45 -10.4% clang7 -O2 9.61 8.64 -10.0% clang13 -O2 9.95 8.85 -11.1% gcc(7) -O3 9.51 8.81 -7.4% gcc12 -O3 9.15 8.42 -8.0% clang7 -O3 9.92 8.72 -12.1% clang13 -O3 9.68 8.91 -8.0%
b9514e2
to
d7a7052
Compare
Working on fixing it! Turns out my trivial 2 line SAM file for testing wasn't exactly enough. :/ |
Rewrite bits of the decode_md loop again, fixing bug in previous commit and finding a perhaps faster route too. Comparisons with Dev(/D) and this commit (/4) on Revio (re/) and NovaSeq (nv/) with a variety of compilers and optimisations. Figures are cycle counts from perf stat Xeon E5-2660 Xeon Gold 6142 re/D gcc12-O2 85699982958 74752510144 re/4 gcc12-O2 82265084038 71947558666 -3.7/3.7 re/D gcc12-O3 85837077212 74392223354 re/4 gcc12-O3 82024293685 71861154116 -4.4/3.4 re/D clang12-3 85608876213 73934329619 re/4 clang12-3 84390364926 73961392095 -1.4/0 re/D clang12-2 86861787827 74255338533 re/4 clang12-2 83186843797 72421845542 -4.2/2.5; better than O3 nv/D gcc12-O2 36694089398 31444641828 nv/4 gcc12-O2 34949122875 30061074125 -4.8/-4.4 nv/D gcc12-O3 36528573980 30792932748 nv/4 gcc12-O3 35069572111 30066058127 -4.0/2.4 nv/D clang12-3 37906764004 32459168883 nv/4 clang12-3 36344679534 30786987972 -4.1/-5.2 nv/D clang12-2 38443827308 32304948037 nv/4 clang12-2 36361384580 31022553379 -5.4/-4.0 Revised benchmarks on 10 million NovaSeq records, now showing billions of cycles as more robust than CPU time. EPYC 7543 before after gcc(7) -O2 28.6 28.3 -1.0 gcc12 -O2 28.2 28.3 +0.4 clang7 -O2 30.2 28.2 -6.6 clang13 -O2 29.9 28.2 -5.7 gcc(7) -O3 28.7 28.2 -1.7 gcc12 -O3 28.0 27.2 -2.9 clang7 -O3 30.1 28.3 -6.0 clang13 -O3 29.7 28.3 -4.7 Xeon Gold 6142 before after gcc(7) -O2 32.8 30.5 -7.0 gcc12 -O2 31.8 30.1 -5.3 clang7 -O2 33.1 29.9 -9.7 clang13 -O2 34.1 30.8 -9.7 gcc(7) -O3 32.7 30.2 -7.6 gcc12 -O3 31.6 29.1 -7.9 clang7 -O3 34.3 30.0 -12.5 clang13 -O3 33.3 30.9 -7.2
d7a7052
to
d16e43a
Compare
None of this is huge, but it all adds up.
Improve the MD/NM generation in CRAM decoding. With decode_md=1 (default) by decode changed from 12.91s to 12.57s With decode_md=0 it's 11.92, so that's 1/3rd of the overhead removed.
Changed the block_resize to resize in slightly smaller chunks and to use integer maths.
Reduce excessive pointer redirection in cram_decode_seq.
Unsure if this speeds things up much (sometimes it seems to), but it provides tidier code too.
Combined before and after on 10 million NovaSeq CRAM (v3.1)
epyc 7543
Xeon Gold 6142
Biggest change is with clang, but also on Intel we see bigger changes than AMD too.