-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.5x speedup from gcc8 to gcc9 in C++ (__muldc3 overhead in gcc8 - for complex numbers?) #117
Comments
A flamegraph is again useful. This is pmpe04 with gcc8
This is pmpe04 with gcc9
The difference between the two is a 0.6 second overhead (on top of 1.3s hard CPU) spent in __muldc3. Note this post that links __muldc3 to complex number multiplication Takeaways:
Ok this is essentially understood. Needs to be revisited on the latest code. As for gcc8 or gcc9, it looks like it is better to use gcc9 for any performance tests for the paper? (If the downside is only cuda-gdb thi smight be ok... and again, to be understood if/why I had issues in cuda-gdb with gcc9) |
Suggestion by Olivier: check compilation flags... (fast math?). Also in his c++ to fortran comparison he had observed issues that may be related... |
I confirm that I still see a factor 2+ (even more than a factor 1.5!!) between gcc8 and gcc9, even on the current latest master. This clearly means that we should use gcc9 and not gcc8. This is the latest log and the code I use:
A few additional observations:
@oliviermattelaer , I think you are right that, in gcc8, fast math would also solve the issue. However, that would result in an "incorrect" handling of nan and inf. I think that using gcc9 is a much better option: this is explained here, where the patch was created (note that it was not backported to gcc8), https://gcc.gnu.org/bugzilla//show_bug.cgi?id=70291, or see also https://stackoverflow.com/questions/49438158/ By the way, I am now using cuda 11.0, which is happy with gcc9 (while cuda 10.2 requires gcc8). My new reference will therefore be
|
Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet. |
Interesting.
But in madgraph nan/Inf should not appear at any stage so we can/should use hard flag(or other tricks) for that.
(For example PY8 has it's own complex multiplication class to avoid all those handling/slow down of NaN/Inf.
On 15 Mar 2021, at 16:42, Andrea Valassi ***@***.******@***.***>> wrote:
Note also that gcc10 is not supported by cuda 11.0 yet. I will stick with gcc9 and not try gcc10 yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#117 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6535RMZB6OTRC4D3UDU43TDYTGZANCNFSM4XX7C7BQ>.
|
Hi Olivier, thanks. Ok I will also try the fast math and see if it speeds things up then! |
Thanks Olivier, you are right :-) Using fast math speeds up both gcc8 (from 4.0E5MEs/s to 1.09E6 MEs/s) and gcc9 (from 8.4 MEs/s to 1.16E6 MEs/s). The difference between gcc8 and gcc9 decreases a lot, but gcc9 is still a bit faster. One thing that is peculiar is that FFV1P0_3 has completely disappeared from gcc9 fast math. Maybe it is somehow optimized away? Of course the code must pass through there. Very strange... I will try to see if there is anything we can do to improve the perf. Also to get rid of thos 'unknown'... More generally, @oliviermattelaer : I am revisiting all past numbers, but for the moment note that these 1.1E6 throughputs in C++ (without vectorization and without openmp) are already quite better than what I observe in Fortran, around 6E5 at most (timing only the ME part in a production madevent run). This is almost a factor 2 better in C++ than fortran. Is this possible? Are you using fastmath on Fortran by the way? It does not look like it, https://github.com/madgraph5/madgraph4gpu/blob/master/epoch1/gridpack/eemumu/madevent/Source/makefile |
(I tried yum install elfutils-libelf-devel libunwind-devel audit-libs-devel slang-devel to get rid of unknown, but no effect, https://unix.stackexchange.com/questions/276179/missing-stack-symbols-with-perf-events-perf-report-despite-fno-omit-frame-poi) |
Ok I found it, abnd of course it was on Brendan Gregg's webpage all the time: First I tried to rebuild using -fno-omit-frame-pointer: this changes the flamegraphs adding a few more things, but the result is still incomplete and unsatisafctory. The next tip worked: add "--call-graph dwarf" to perf. Note that this depends on libunwind, so my previous addition of libunwind-devl (which ALSO installed libunwind) was necessary I think. I will commit tomorrow the better flamegraphs and a few modified scripts. Note that indeed FFV1P0_3 is reported as "(inlined)", so libunwind is able to see it somehow, but it is more tricky than other functions. |
Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math).
This is for About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it. About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org. I will commit the new flgrAV. First I will also check the fortran with fast math. |
Here is a flamegraph for gcc9 with the latest script using dwarf. It is much nicer. The graph for gcc8 is almost indistiguishable (both for fast math).
This is for About build options: apart from fast math, I added nothing specific for flamegraph (neither -fno-omit-frame-pointer nor -fno-inline nor -g). It's best to let dwarf handle it. About libunwind, I removed it and all is ok, there was no need to install it. Probably dwarf uses it internally statically. Note dwarf is http://wiki.dwarfstd.org. I have also checked Fortran with fast math. It makes a big difference. See the timings here Note that with the most aggressive compilation flags (fast math and -O3 in both C++ and Fortran), I get throughputs of 1.15E6/s in C++ and 1.50E6/s in Fortran. These are timing two different things (sigmakin in the standalone C++ application, matrix1 in the gridpack madevent Fortran application), but in principle they should be comparable. The Fortran throughput is a factor 1.3 higher (30%) than C++. This difference of 30% between Fortran and C++ with the most aggressive flags looks comparable to what @oliviermattelaer had found in earlier tests: https://indico.cern.ch/event/907278/contributions/3818707/attachments/2020732/3378680/standalone_speed.pdf Fast math essentially intervenes here because it breaks IEEE 754 compliance, for instance for NaN and Inf compliance in complex number arithmetics. See these two interesting links All this said, this should settle the question of defining a reasonable environment for comparing our C++/CUDA with the production Fortran. I will use Centos7, gcc9, fast math, and then CUDA11. It would still be interesting to understand what causes the 30% higher throughput in Fortran (disassemble with godbolt?), but that is probably too much. Final comment, one should check if NaN and Inf are correctly propagated (and those events discarded) in madevent and the other samplers like outr standalone driver. I opened issue #129 about this. |
For the record, I tried to use "-03 -fcx-fortran-rules -fcx-limited-range" in both Fortran and C++. Both decrease speed by about 20-30%, and Fortran remains faster than C++. This is a bit different from what Olivier had found. Ok, probably not much point investigating these compiler flags further. |
A small update after this morning's findings on NaN (issue #144): using fast math is quite dangerous. We should make sure we get no NaN whatsoever preferably, otherwise our MC integration is unreliable? Anyway with double precision I have seen none. With single precision I had to implement an ad-hoc NaN checker, just to exclude events. Another small update after Hadrien's useful talk on cutter this afternoon https://indico.cern.ch/event/1003975/
|
A few more compiler flag suggestions from Stephan (thanks!) on vectorisation/performance:
And again the page about fp math the speaker shared yesterday, so it's all in one place: |
…h (issue madgraph5#117) Strangely enough, this does NOT speed up the code in any way...?
Performance is the same even if fast math was added (issue madgraph5#117) - was it really added? ME values have changed, I imagine because I moved the neppR alignment?
I am closing this issue because it is very old. There is an open "standing" issue #252 about compiler flags like -O3, fast math etc. I think that's a better place to reassess these various options, with our latest code (also on vectorized ggttgg and not only eemuymu) and our latest compilers. Note that I have moved from gcc9.2 to gcc10.3 (with cuda 11.4 in both cases) as the new baseline. See #269. Closing this. |
… checks are located (madgraph5#117) This is the cleanest way to get rid of "tautological" warnings for isnan in fast mode on clang It supersedes the pragmas to disable fast math, but keep these anyway for extra security
…h (issue madgraph5/madgraph4gpu#117) Strangely enough, this does NOT speed up the code in any way...?
This is a followup to #116 (comment)
On a very old version of the code, while trying to understand another issue (BEFORE I understood it is actually a hardware problem on a VM, tsc clock not used, hence large overhead of system calls), I tried to evaluate if gcc9 could fix the issue. On the buggy hardware node this had no effect, but on a good node instead gcc9 was a factor 1.5 faster than gcc8 (5.5E5 throughput instead of 3.5E5).
The difference is in
. /cvmfs/sft.cern.ch/lcg/releases/gcc/8.3.0/x86_64-centos7/setup.sh
. /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0/x86_64-centos7/setup.sh
Again, this was on an old version of the code. But one should understand if this speedup also exists for more recent versions - or in any case it would be nice to understand what is going on.
The text was updated successfully, but these errors were encountered: