-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfaults and slowness for some build options #346
Comments
Thanks, @srlantz! Few comments (from MIC meeting, August 27):
|
As for the difference in performance between icc and gcc, it was noticed that icc and gcc performance diverged, most likely due to vectorization: At the time of https://github.com/trackreco/mkFit/releases/tag/V3.0.0-0%2Bpr315, difference looked larger: As the difference after PR #340 is concentrated in transition regions, suspicion is that this is related to hit selection windows. From PR #344, a floor is used also in transition regions. I have not tested the difference after such PR within CMSSW. From standalone validation in devel (PR #345), no difference is observed when comparing hit multiplicities across different compilers (reference [black] = make -j 32 AVX2:=1; other [red] = "CXX=g++ VEC_GCC='-msse3'"): |
I started looking at segfaults, and the first one I stumbled on is at Lines 263 to 266 in cb7089b
where
which according to our docs means
I'm assuming this only affects the standalone mkFit reading memory files, so maybe it isn't a priority, but suggestions for an appropriate fix? Should we trim off -9 hits at the end of the track? |
It seems the real problem here is that we now have two types of input files: those containing various types of seeds, suitable for multiple iterations, and those containing seeds suitable just for initialStep. The older build methods, when given a file of the new, expanded type, end up doing pointless calculations on seeds that only make sense for iterations beyond initialStep. The result is "garbage in, garbage out". That being the case, perhaps there should be a quick check of the input file prior to building, to ensure that the contents of the input file are compatible with the chosen build option. I don't think the solution is to always assume that the input is of the multiple-iteration type, because (at least for CE) this case is already covered by The fact that there are two types of files is revealed by the fact that in buildtestMPlex.cc, |
I used gdb to determine that gcc's -ffast-math causes mkFit to crash in propagateHelixToZMPlex, specifically in the (inlined) call to applyMaterialEffects, where the optimizer apparently generates an out-of bounds reference when accessing The crash can be sidestepped by preceding propagateHelixToZMPlex with
Leaving aside the question of why the optimizer screws up the inlining, I ran performance tests in which all the rest of the code was compiled with -ffast-math to see if that option, with and without -funroll-all-loops, would run any faster and possibly use vectorized trig functions from libmvec. Unfortunately, the observed speedup was only a few percent. Furthermore, searching for the symbol names beginning with By contrast, the same types of searches on icc-compiled code--again applying either objdump or strings to the object files--turned up plenty of references to I did the above performance tests on lnx7188 with gcc 8, 9, and 10 in both CentOS 7 and 8 (the latter in a Singularity container). The best marginal performance gain came with gcc 10 in a CentOS 8 container obtained from cvmfs. |
In V3.1.1.+pr344, for num-thr>16, the following methods generally segfault after a non-deterministic number of events: CE, STD, BH. In other words, basically all of the earlier, non-MIMI build methods are not functioning correctly at present.
Also, if num-thr<=16, even though segfaults are mostly (not always!) avoided, the original build methods CE, STD, BH are slower by 1-2 orders of magnitude. Given
--num-thr 16
, for example,--build-mimi --num-iters-cmssw 1
has an event loop time of 4.2 sec (or 3.2 sec. with icc AVX-512), but--build-ce
runs in 147 sec. (or 342 sec. with icc AVX-512), and--build-bh
runs in 107 sec.Finally,
--build-mimi --num-iters-cmssw 1
(just initialStep), which is the closest currently working equivalent of --build-ce from (e.g.) stable-2020 does not segfault and runs at a speed more consistent with past code versions, but it takes about 70% longer than previously. This is not really making an apples-to-apples comparison because the input files include more fields and are much bigger (among other factors).What I did to build, after setting up the shell with
source xeon_scripts/init-env.sh
:make -j 32 AVX2:=1 CXX:=g++
What I did to start a typical test run (always omitting --remove-dup):
./mkFit/mkFit --input-file /data2/slava77/samples/2021/11834.0_TTbar_14TeV+2021/AVE_50_BX01_25ns/memoryFile.fv5.default.210729-d1dc487.bin --cmssw-n2seeds --num-events 640 --num-thr 16 --num-thr-ev 16 --build-XXX
Tests were mostly conducted on lnx7188, but similar behavior was observed on phi3.
Addendum 8/30/21: when compiling with icc 2021.3 and AVX2, the V3.1.1+pr344 version with
--build-mimi --num-iters-cmssw 1
is similarly 90% slower than the stable-2020 version with--build-ce
. Times on lnx7188 for a PU50 test with 6400 events, using the input file appropriate to each version, were 13.57 sec. vs. 7.13 sec. For comparison, the times for the same two mkFit versions when compiled with gcc 8.3.1 were 18.65 vs. 10.74 sec. (All tests were run twice in succession to ensure the input file was cached in RAM.)The text was updated successfully, but these errors were encountered: