1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117

valassi · 2021-02-17T09:11:07Z

This is a followup to #116 (comment)

On a very old version of the code, while trying to understand another issue (BEFORE I understood it is actually a hardware problem on a VM, tsc clock not used, hence large overhead of system calls), I tried to evaluate if gcc9 could fix the issue. On the buggy hardware node this had no effect, but on a good node instead gcc9 was a factor 1.5 faster than gcc8 (5.5E5 throughput instead of 3.5E5).

The difference is in
. /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0/x86_64-centos7/setup.sh
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh

Again, this was on an old version of the code. But one should understand if this speedup also exists for more recent versions - or in any case it would be nice to understand what is going on.

valassi · 2021-02-17T10:14:51Z

A flamegraph is again useful.

This is pmpe04 with gcc8

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 1.023022e+00 sec
MeanTimeInWaveFuncs   = 2.601678e-06 sec
StdDevTimeInWaveFuncs = 5.867377e-07 sec
MinTimeInWaveFuncs    = 2.242000e-06 sec
MaxTimeInWaveFuncs    = 9.060000e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 3.843672e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.667s
user    0m1.617s
sys     0m0.046s

This is pmpe04 with gcc9

time ./check.exe 393216
***********************************
NumberOfEntries       = 393216
TotalTimeInWaveFuncs  = 7.300220e-01 sec
MeanTimeInWaveFuncs   = 1.856542e-06 sec
StdDevTimeInWaveFuncs = 4.367662e-07 sec
MinTimeInWaveFuncs    = 1.561000e-06 sec
MaxTimeInWaveFuncs    = 8.322300e-05 sec
-----------------------------------
NumMatrixElements     = 393216
MatrixElementsPerSec  = 5.386358e+05 sec^-1
***********************************
NumMatrixElements     = 393216
MeanMatrixElemValue   = 1.372012e-02 GeV^0
StdErrMatrixElemValue = 1.307144e-05 GeV^0
StdDevMatrixElemValue = 8.196700e-03 GeV^0
MinMatrixElemValue    = 6.071582e-03 GeV^0
MaxMatrixElemValue    = 3.374887e-02 GeV^0

real    0m1.378s
user    0m1.320s
sys     0m0.055s

The difference between the two is a 0.6 second overhead (on top of 1.3s hard CPU) spent in __muldc3.

Note this post that links __muldc3 to complex number multiplication
https://stackoverflow.com/a/49438578

Takeaways:

probably gcc9 handles better our complex number multiplication (in that old code...)
may need to check our handling of nans and inf in complex numbers, too

Ok this is essentially understood. Needs to be revisited on the latest code.

As for gcc8 or gcc9, it looks like it is better to use gcc9 for any performance tests for the paper? (If the downside is only cuda-gdb thi smight be ok... and again, to be understood if/why I had issues in cuda-gdb with gcc9)

valassi · 2021-02-22T14:28:31Z

Suggestion by Olivier: check compilation flags... (fast math?). Also in his c++ to fortran comparison he had observed issues that may be related...

valassi · 2021-03-15T15:33:18Z

I confirm that I still see a factor 2+ (even more than a factor 1.5!!) between gcc8 and gcc9, even on the current latest master. This clearly means that we should use gcc9 and not gcc8.

This is the latest log and the code I use:
0b4280b

with gcc8: total 18.3s, ME 15.9s
with gcc9: total 9.9s, MEs 7.5s

Flamegraph for gcc8:

Flamegraph for gcc9:

A few additional observations:

not clear why in the flamegraph my main sigmakin function is 'unknown'
there are a few other functions worth investigatingin gcc9, e.g. __divdc3 but also __ieee754_log_avx or __cos_avx or __sin_avx

@oliviermattelaer , I think you are right that, in gcc8, fast math would also solve the issue. However, that would result in an "incorrect" handling of nan and inf. I think that using gcc9 is a much better option: this is explained here, where the patch was created (note that it was not backported to gcc8), https://gcc.gnu.org/bugzilla//show_bug.cgi?id=70291, or see also https://stackoverflow.com/questions/49438158/

By the way, I am now using cuda 11.0, which is happy with gcc9 (while cuda 10.2 requires gcc8). My new reference will therefore be

itscrd70, using centos70, to avoid the TSC issue High CPU use in clock_gettime (TSC clocksource unavailable on itscrd03) #116 in centos8
gcc9, to avoid this issue 1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117 with muldc3 in gcc8 complex numbers
cuda 11.0, because cuda10.2 is no good with gcc9

valassi · 2021-03-15T15:42:13Z

Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet.

oliviermattelaer · 2021-03-15T16:11:10Z

Interesting. But in madgraph nan/Inf should not appear at any stage so we can/should use hard flag(or other tricks) for that. (For example PY8 has it's own complex multiplication class to avoid all those handling/slow down of NaN/Inf. On 15 Mar 2021, at 16:42, Andrea Valassi ***@***.******@***.***>> wrote: Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#117 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6535RMZB6OTRC4D3UDU43TDYTGZANCNFSM4XX7C7BQ>.

valassi · 2021-03-15T16:51:45Z

Hi Olivier, thanks. Ok I will also try the fast math and see if it speeds things up then!

valassi · 2021-03-15T17:55:24Z

Thanks Olivier, you are right :-)

Using fast math speeds up both gcc8 (from 4.0E5MEs/s to 1.09E6 MEs/s) and gcc9 (from 8.4 MEs/s to 1.16E6 MEs/s). The difference between gcc8 and gcc9 decreases a lot, but gcc9 is still a bit faster.

Flamegraph gcc8

Flamegraph gcc9

One thing that is peculiar is that FFV1P0_3 has completely disappeared from gcc9 fast math. Maybe it is somehow optimized away? Of course the code must pass through there. Very strange... I will try to see if there is anything we can do to improve the perf. Also to get rid of thos 'unknown'...

More generally, @oliviermattelaer : I am revisiting all past numbers, but for the moment note that these 1.1E6 throughputs in C++ (without vectorization and without openmp) are already quite better than what I observe in Fortran, around 6E5 at most (timing only the ME part in a production madevent run). This is almost a factor 2 better in C++ than fortran. Is this possible?

Are you using fastmath on Fortran by the way? It does not look like it, https://github.com/madgraph5/madgraph4gpu/blob/master/epoch1/gridpack/eemumu/madevent/Source/makefile
This is a gridpack I got essentially out-of-the-box from Madgraph 2.9.2, so I assume it should have all the fastest options? Or can you suggest some better flags to use to compare our C++/CUDA and Fortran?

valassi · 2021-03-15T18:05:57Z

(I tried yum install elfutils-libelf-devel libunwind-devel audit-libs-devel slang-devel to get rid of unknown, but no effect, https://unix.stackexchange.com/questions/276179/missing-stack-symbols-with-perf-events-perf-report-despite-fno-omit-frame-poi)

valassi · 2021-03-15T19:51:25Z

Ok I found it, abnd of course it was on Brendan Gregg's webpage all the time:
http://www.brendangregg.com/perf.html#StackTraces
This is a known issue.

First I tried to rebuild using -fno-omit-frame-pointer: this changes the flamegraphs adding a few more things, but the result is still incomplete and unsatisafctory.

The next tip worked: add "--call-graph dwarf" to perf. Note that this depends on libunwind, so my previous addition of libunwind-devl (which ALSO installed libunwind) was necessary I think.

I will commit tomorrow the better flamegraphs and a few modified scripts.

Note that indeed FFV1P0_3 is reported as "(inlined)", so libunwind is able to see it somehow, but it is more tricky than other functions.

valassi · 2021-03-17T12:15:12Z

Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math).

The numbers on the graph are consistent with the timings written out

EvtsPerSec[MatrixElems] (3) = ( 1.146790e+06                 )  sec^-1
...
TOTAL       :     7.961286 sec
TOTAL (123) :     7.766326 sec
...
TOTAL   (3) :     5.486143 sec

This is for flgrAV time ./check.exe -p 2048 256 12.

About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it.

About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org.

I will commit the new flgrAV. First I will also check the fortran with fast math.

See madgraph5#117 (comment)

valassi · 2021-03-18T09:12:24Z

Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math).

The numbers on the graph are consistent with the timings written out

EvtsPerSec[MatrixElems] (3) = ( 1.146790e+06                 )  sec^-1
...
TOTAL       :     7.961286 sec
TOTAL (123) :     7.766326 sec
...
TOTAL   (3) :     5.486143 sec

This is for flgrAV time ./check.exe -p 2048 256 12.

About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it.

About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org.

I have also checked Fortran with fast math. It makes a big difference. See the timings here
https://github.com/madgraph5/madgraph4gpu/blob/4728b756c1fbe1ba8427f8e384a57bf24cdbc1a5/epoch1/gridpack/README.md
The flamegraph in the link above have a further hack to limit the height of the flames to 30 (as otherwise the python3 stack depth is almost 100). The latest script I used there is
https://github.com/madgraph5/madgraph4gpu/blob/4728b756c1fbe1ba8427f8e384a57bf24cdbc1a5/epoch1/gridpack/flgrAV
This link was an interesting read in that context: https://www.gabriel.urdhr.fr/2014/05/23/flamegraph/

Note that with the most aggressive compilation flags (fast math and -O3 in both C++ and Fortran), I get throughputs of 1.15E6/s in C++ and 1.50E6/s in Fortran. These are timing two different things (sigmakin in the standalone C++ application, matrix1 in the gridpack madevent Fortran application), but in principle they should be comparable. The Fortran throughput is a factor 1.3 higher (30%) than C++.

This difference of 30% between Fortran and C++ with the most aggressive flags looks comparable to what @oliviermattelaer had found in earlier tests: https://indico.cern.ch/event/907278/contributions/3818707/attachments/2020732/3378680/standalone_speed.pdf
Note that he also experimented with "-0 -fcx-fortran-rules -fcx-limited-range". I considered using these, to use the same approximations for complex arithmentics in Fortran and C++, but in the end it's probably best to compar eFortran and C++ (and CUDA) using the most aggressive options, i.e. fast math. In the CUDA build we have -use_fast_math since a long time ago.

Fast math essentially intervenes here because it breaks IEEE 754 compliance, for instance for NaN and Inf compliance in complex number arithmetics. See these two interesting links
https://gcc.gnu.org/wiki/FloatingPointMath
https://stackoverflow.com/a/49438578
This is also what this issue #117 was originally about (__muldc3 probably also is about IEEE 754 compliance of complex numbers).

All this said, this should settle the question of defining a reasonable environment for comparing our C++/CUDA with the production Fortran. I will use Centos7, gcc9, fast math, and then CUDA11. It would still be interesting to understand what causes the 30% higher throughput in Fortran (disassemble with godbolt?), but that is probably too much.

Final comment, one should check if NaN and Inf are correctly propagated (and those events discarded) in madevent and the other samplers like outr standalone driver. I opened issue #129 about this.

valassi · 2021-03-18T11:05:57Z

For the record, I tried to use "-03 -fcx-fortran-rules -fcx-limited-range" in both Fortran and C++. Both decrease speed by about 20-30%, and Fortran remains faster than C++. This is a bit different from what Olivier had found. Ok, probably not much point investigating these compiler flags further.

valassi · 2021-03-31T15:51:56Z

A small update after this morning's findings on NaN (issue #144): using fast math is quite dangerous. We should make sure we get no NaN whatsoever preferably, otherwise our MC integration is unreliable? Anyway with double precision I have seen none. With single precision I had to implement an ad-hoc NaN checker, just to exclude events.

Another small update after Hadrien's useful talk on cutter this afternoon https://indico.cern.ch/event/1003975/

Olivier: try "-03 -fcx-fortran-rules -fcx-limited-range". I think this may be useful to get rid of NaNs, even if it is slower. To be considered...
Hadrien: Fortran can be faster than c++ because of errno in sqrt. Stephan: try -fmath-no-errno.
Hadrien: Fortran can also be faster because of the implicit restrict. This is one thing I tried to test in issue Use const/restrict for ME calculation parameters #9 but I have seen no difference
Hadrien: thrust complex can be used also in c++ and actually has better vectorization support than std complex

valassi · 2021-04-01T08:59:34Z

A few more compiler flag suggestions from Stephan (thanks!) on vectorisation/performance:

-O3 // clang is happy with O2, but not gcc
-fno-signaling-nans // gcc needs this for auto vectorisation
-fno-trapping-math // same here. Traps destroy vectorisation
-fno-math-errno // switches off the crazy old-style error handling using globals
-fno-rounding-math seems to be relevant only when you want to switch rounding modes. We didn't see differences.

And again the page about fp math the speaker shared yesterday, so it's all in one place:
https://gcc.gnu.org/wiki/FloatingPointMath

…h (issue madgraph5#117) Strangely enough, this does NOT speed up the code in any way...?

Performance is the same even if fast math was added (issue madgraph5#117) - was it really added? ME values have changed, I imagine because I moved the neppR alignment?

valassi · 2021-10-21T10:48:23Z

I am closing this issue because it is very old.

There is an open "standing" issue #252 about compiler flags like -O3, fast math etc. I think that's a better place to reassess these various options, with our latest code (also on vectorized ggttgg and not only eemuymu) and our latest compilers.

Note that I have moved from gcc9.2 to gcc10.3 (with cuda 11.4 in both cases) as the new baseline. See #269.

Closing this.

… checks are located (madgraph5#117) This is the cleanest way to get rid of "tautological" warnings for isnan in fast mode on clang It supersedes the pragmas to disable fast math, but keep these anyway for extra security

…h (issue madgraph5/madgraph4gpu#117) Strangely enough, this does NOT speed up the code in any way...?

valassi changed the title ~~Evaluate performance speedup from gcc8 to gcc9 for C++~~ 1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) Feb 17, 2021

valassi added a commit to valassi/madgraph4gpu that referenced this issue Mar 17, 2021

Use dwarf (http://www.dwarfstd.org) for stack unwinding

c35ff4c

See madgraph5#117 (comment)

This was referenced Mar 17, 2021

Fast math in fortran and c++ #128

Merged

Check the handling of NaN/Inf (both in Fortran madevent and in our standalone C++/CUDA) #129

Open

valassi mentioned this issue Mar 18, 2021

Use const/restrict for ME calculation parameters #9

Closed

valassi mentioned this issue Mar 31, 2021

[epoch1] Single precision: fix build failures, improve NaN determination #144

Merged

valassi mentioned this issue Mar 31, 2021

Try thrust complex in c++? #147

Closed

valassi mentioned this issue Jul 25, 2021

More complete analysis of AVX512 in both gcc and clang #173

Open

valassi added a commit to valassi/madgraph4gpu that referenced this issue Sep 27, 2021

[epochX2] reduce diffs between epoch2 and epochX ggttgg: add fast mat…

31a0bf0

…h (issue madgraph5#117) Strangely enough, this does NOT speed up the code in any way...?

valassi mentioned this issue Oct 21, 2021

Reassess the need for fast math, -O3 and other compilation flags #252

Open

valassi closed this as completed Oct 21, 2021

valassi mentioned this issue Feb 9, 2022

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Open

valassi mentioned this issue Jun 6, 2023

Do we have a timing overhead again? #685

Closed

valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023

[epochX2] reduce diffs between epoch2 and epochX ggttgg: add fast mat…

97984ce

…h (issue madgraph5/madgraph4gpu#117) Strangely enough, this does NOT speed up the code in any way...?

valassi mentioned this issue Jul 4, 2024

Fix intermittent FPE crashes for SIMD (add volatile) and allow no-OpenMP builds #874

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117

1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117

valassi commented Feb 17, 2021 •

edited

Loading

valassi commented Feb 17, 2021

valassi commented Feb 22, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

oliviermattelaer commented Mar 15, 2021 via email

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 17, 2021

valassi commented Mar 18, 2021

valassi commented Mar 18, 2021

valassi commented Mar 31, 2021 •

edited

Loading

valassi commented Apr 1, 2021

valassi commented Oct 21, 2021

1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117

1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117

Comments

valassi commented Feb 17, 2021 • edited Loading

valassi commented Feb 17, 2021

valassi commented Feb 22, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

oliviermattelaer commented Mar 15, 2021 via email

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 15, 2021

valassi commented Mar 17, 2021

valassi commented Mar 18, 2021

valassi commented Mar 18, 2021

valassi commented Mar 31, 2021 • edited Loading

valassi commented Apr 1, 2021

valassi commented Oct 21, 2021

valassi commented Feb 17, 2021 •

edited

Loading

valassi commented Mar 31, 2021 •

edited

Loading