Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in testxxx runTest.exe for debug builds (need separate cpu/gpu namespaces) #725

Closed
valassi opened this issue Jul 18, 2023 · 5 comments · Fixed by #723
Closed

Segfault in testxxx runTest.exe for debug builds (need separate cpu/gpu namespaces) #725

valassi opened this issue Jul 18, 2023 · 5 comments · Fixed by #723
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jul 18, 2023

Segfault in testxxx runTest.exe for debug builds on itscrd90

This is related to #701. I wanted to test clang #724 on my MR #723 for this issue. But on Alma8 itscrd80 I have not set up clang. So I went to Alma9 itscrd90. Before trying clang I tried the default gcc11.3. In normal builds the tests succeed. But in debug builds they fail. This was on my fpe branch for MR #723. As a cross check, I went back t upstream/master... and the segfault also happens there!

Rephrase: this is a new segfault which I found while investigating #701, but which has probably nothing to do with it. It affects upstream/master....

[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> git log --oneline -n1
d1d87b649 (HEAD -> fpe, upstream/master) Merge pull request #722 from valassi/f2py
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> make cleanall; make -j debug
...
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe 
Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc
[==========] Running 6 tests from 6 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
Segmentation fault (core dumped)
...
[avalassi@itscrd90 gcc11/usr] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> gdb ./runTest.exe 
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-10.el9
Copyright (C) 2021 Free Software Foundation, Inc.
...
(gdb) run
Starting program: /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/runTest.exe 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc
[New Thread 0x7fffef9de000 (LWP 2673216)]
[New Thread 0x7fffeefa7000 (LWP 2673217)]
[==========] Running 6 tests from 6 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_CPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_CPU_MISC.testmisc (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_CPU_MISC (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx

Thread 1 "runTest.exe" received signal SIGSEGV, Segmentation fault.
0x00000000004113d8 in mgOnGpu::cxtype_v::cxtype_v (this=0x8000000000000000) at ../../src/mgOnGpuVectors.h:75
75          cxtype_v() : m_real{ 0 }, m_imag{ 0 } {} // RRRR=0000 IIII=0000
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64 nvidia-driver-cuda-libs-530.30.02-1.el9.x86_64
(gdb) where
#0  0x00000000004113d8 in mgOnGpu::cxtype_v::cxtype_v (this=0x8000000000000000) at ../../src/mgOnGpuVectors.h:75
#1  0x0000000000412944 in cxzero_sv () at ../../src/mgOnGpuVectors.h:818
#2  0x000000000046ca13 in mg5amcGpu::ixxxxx<KernelAccessMomenta<false>, KernelAccessWavefunctions<false> > (
    momenta=0x7fffcdc0f200, fmass=0, nhel=1, nsf=-1, wavefunctions=0x7fffffff61b0, ipar=0) at ../../src/HelAmps_sm.h:286
#3  0x000000000046af0c in SIGMA_SM_GG_TTX_GPU_XXX_testxxx_Test::TestBody (this=0x1465d30)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/testxxx.cc:247
#4  0x00000000004ae42f in void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#5  0x00000000004a7aeb in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) ()
#6  0x00000000004810d6 in testing::Test::Run() ()
#7  0x00000000004819cd in testing::TestInfo::Run() ()
#8  0x000000000048219c in testing::TestSuite::Run() ()
#9  0x0000000000490c58 in testing::internal::UnitTestImpl::RunAllTests() ()
#10 0x00000000004af1bb in bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#11 0x00000000004a8a67 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) ()
#12 0x000000000048f68c in testing::UnitTest::Run() ()
#13 0x00000000004bdf07 in RUN_ALL_TESTS() ()
#14 0x00000000004bdea0 in main ()
@valassi valassi changed the title Segfault in testxxx runTest.exe for debug builds on itscrd90 Segfault in testxxx runTest.exe for debug builds Jul 18, 2023
@valassi
Copy link
Member Author

valassi commented Jul 18, 2023

I have checked that this happens also on itscrd80. And it is there since ever! It was there even in this first commit for gg_tt.sa

commit ac1cffe3efe24c34eb19308e9b910a818874c22e (HEAD -> fpe)
Author: Andrea Valassi <andrea.valassi@cern.ch>
Date:   Mon Jun 13 18:31:00 2022 +0200

    [gh] RENAME FIVE <proc>.AUTO AS <proc>.SA DIRECTORIES IN THE REPO #478

@valassi
Copy link
Member Author

valassi commented Jul 18, 2023

Using my latest fpe branch now for simplicity.

This is really strange. I get different results depending on which AVX I choose. The tests that segfault are the GPU tests, so they should not even be impacted by AVX (to first approximation)? With AVX=none this test succeeds, with all other AVX they segfault

./runTest.exe --gtest_filter=*GPU*xxx

Note also, in debug mode and with AVX=none, instead, the Compare test on GPU fails...

[ RUN      ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
MadgraphTest.h:298: Failure
Value of: momentumErrors.str().empty()
  Actual: false
Expected: true

particle 0      component 1
         madGraph:   7.500000000000000e+02
         reference:  0.000000000000000e+00
         rel delta:                    inf exceeds tolerance of 1.000000000000000e-10
particle 0      component 2
         madGraph:   7.500000000000000e+02
         reference:  0.000000000000000e+00
         rel delta:                    inf exceeds tolerance of 1.000000000000000e-10
Google Test trace:
MadgraphTest.h:280: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  7.500000000000000e+02  7.500000000000000e+02  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  0.000000000000000e+00  0.000000000000000e+00  0.000000000000000e+00  0.000000000000000e+00
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  0.000000000000000e+00  0.000000000000000e+00  0.000000000000000e+00  0.000000000000000e+00
ref2  7.500000000000000e+02  5.849331413473452e+02 -3.138365726669761e+02 -3.490842674916366e+02

   3  7.500000000000000e+02  7.500000000000000e+02  7.500000000000000e+02  7.500000000000000e+02
ref3  7.500000000000001e+02 -5.849331413473452e+02  3.138365726669761e+02  3.490842674916364e+02

  ME  6.797636301916560e-01
r.ME  6.797636301916544e-01

[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x183ccd0 (158 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest (158 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (158 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x183ccd0

 1 FAILED TEST

With sse4 debug I also get a similar failure

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> make cleanall; make -j AVX=sse4 debug
...
[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe --gtest_filter=*GPU*Comp*
Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *GPU*Comp*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
MadgraphTest.h:298: Failure
Value of: momentumErrors.str().empty()
  Actual: false
Expected: true

particle 0      component 1
         madGraph:   7.500000000000000e+02
         reference:  0.000000000000000e+00
         rel delta:                    inf exceeds tolerance of 1.000000000000000e-10
particle 0      component 3
         madGraph:   0.000000000000000e+00
         reference:  7.500000000000000e+02
         rel delta:  1.000000000000000e+00 exceeds tolerance of 1.000000000000000e-10
Google Test trace:
MadgraphTest.h:280: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02  7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.500000000000000e+02  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00
ref2  7.500000000000000e+02  5.849331413473452e+02 -3.138365726669761e+02 -3.490842674916366e+02

   3  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02 -7.500000000000000e+02
ref3  7.500000000000001e+02 -5.849331413473452e+02  3.138365726669761e+02  3.490842674916364e+02

  ME  6.797636301916560e-01
r.ME  6.797636301916544e-01

[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0xc12cd0 (158 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest (158 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (158 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0xc12cd0

 1 FAILED TEST

With avx2 and 512y the test succeeds, but it fails again with 512z

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx> ./runTest.exe --gtest_filter=*GPU*Comp*
Running main() from /data/avalassi/GPU2023/madgraph4gpuX/test/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *GPU*Comp*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest
[ RUN      ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
MadgraphTest.h:298: Failure
Value of: momentumErrors.str().empty()
  Actual: false
Expected: true

particle 0      component 2
         madGraph:   7.500000000000000e+02
         reference:  0.000000000000000e+00
         rel delta:                    inf exceeds tolerance of 1.000000000000000e-10
particle 0      component 3
         madGraph:   0.000000000000000e+00
         reference:  7.500000000000000e+02
         rel delta:  1.000000000000000e+00 exceeds tolerance of 1.000000000000000e-10
Google Test trace:
MadgraphTest.h:280: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  7.500000000000000e+02  0.000000000000000e+00
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000001e+02 -3.138365726669762e+02  7.500000000000003e+02  3.138365726669763e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.500000000000000e+02  0.000000000000000e+00  7.500000000000000e+02  0.000000000000000e+00
ref2  7.500000000000000e+02  5.849331413473452e+02 -3.138365726669761e+02 -3.490842674916366e+02

   3  7.500000000000000e+02 -7.417834499166065e+02  7.500000000000003e+02  7.417834499166069e+02
ref3  7.500000000000001e+02 -5.849331413473452e+02  3.138365726669761e+02  3.490842674916364e+02

  ME  6.797636301916560e-01
r.ME  6.797636301916544e-01

[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x1907cd0 (158 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU/MadgraphTest (158 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (158 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SIGMA_SM_GG_TTX_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x1907cd0

 1 FAILED TEST

@valassi
Copy link
Member Author

valassi commented Jul 18, 2023

On second thought, this is very weird, from above:

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx

Thread 1 "runTest.exe" received signal SIGSEGV, Segmentation fault.
0x00000000004113d8 in mgOnGpu::cxtype_v::cxtype_v (this=0x8000000000000000) at ../../src/mgOnGpuVectors.h:75
75          cxtype_v() : m_real{ 0 }, m_imag{ 0 } {} // RRRR=0000 IIII=0000
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.34-60.el9.x86_64 libgcc-11.3.1-4.3.el9.alma.x86_64 libgomp-11.3.1-4.3.el9.alma.x86_64 libstdc++-11.3.1-4.3.el9.alma.x86_64 nvidia-driver-cuda-libs-530.30.02-1.el9.x86_64
(gdb) where
#0  0x00000000004113d8 in mgOnGpu::cxtype_v::cxtype_v (this=0x8000000000000000) at ../../src/mgOnGpuVectors.h:75
#1  0x0000000000412944 in cxzero_sv () at ../../src/mgOnGpuVectors.h:818
#2  0x000000000046ca13 in mg5amcGpu::ixxxxx<KernelAccessMomenta<false>, KernelAccessWavefunctions<false> > (
    momenta=0x7fffcdc0f200, fmass=0, nhel=1, nsf=-1, wavefunctions=0x7fffffff61b0, ipar=0) at ../../src/HelAmps_sm.h:286
#3  0x000000000046af0c in SIGMA_SM_GG_TTX_GPU_XXX_testxxx_Test::TestBody (this=0x1465d30)
    at /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.sa/SubProcesses/P1_Sigma_sm_gg_ttx/testxxx.cc:247

In other words: the GPU test is giving a segfault in a cxtype_v type that is only meant to exist for SIMD CPUs!

This is a clear example of #602: we should make CUDA anc C++ builds completely independent from each other (see also #680 and #674). At the very least: we should avoid having a single runTest.exe executable where we mix both types of code. While I used two different Cpu and Gpu namespaces for some parts of the code, the basic types like mgOnGpu::cxtype have different meanings in the two implementations. Mixing them is a very bad idea... Another option is to separate the namespaces (see #318)?..

valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 19, 2023
…nt namespaces

This CONCLUDES the cleanup of namespaces in ggtt.sa: everything builds and runs ok

NB: in debug mode, now runTest succeeds! As intended, this fixes the segfault in madgraph5#725
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 19, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 19, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 19, 2023
… conflict:

"[namespace] in ggtt.sa, fix testmisc.cc and testxxx.cc to use different namespaces
This CONCLUDES the cleanup of namespaces in ggtt.sa: everything builds and runs ok
NB: in debug mode, now runTest succeeds! As intended, this fixes the segfault in madgraph5#725"
@valassi
Copy link
Member Author

valassi commented Jul 19, 2023

This wil be fixed in #723 (which also fixes other things: the fix ONLY for this bug is in #728, which however I will close).

@valassi
Copy link
Member Author

valassi commented Jul 19, 2023

Note, as discussed in #723, the fact that I was getting also strange failures mixing vecor sizes is most likely due to the fact that I was mixing cpu fptype_v and GPU fpytype_v in the same executable...

@valassi valassi self-assigned this Jul 19, 2023
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 19, 2023
… conflict:

"[namespace] in ggtt.sa, fix testmisc.cc and testxxx.cc to use different namespaces
This CONCLUDES the cleanup of namespaces in ggtt.sa: everything builds and runs ok
NB: in debug mode, now runTest succeeds! As intended, this fixes the segfault in madgraph5#725"

Note: runTest.exe now succeeds in all AVX modes, both in debug and no-debug mode
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 21, 2023
…n tput and tmad for easier merging

This completes the fpe and namespace patches, addressing madgraph5#701 and madgraph5#725, respectively.

Unfortunately, I tested that this patch only fixes the IEEE_DIVIDE_BY_ZERO part of madgraph5#701,
but there are still other issues remaining (being debugged in branch nobm).

Revert "[fpe] rerun 15 tmad - ggttgg tests fail again madgraph5#655 as expected"
This reverts commit 9212960.

Revert "[fpe] rerun 78 tput alltees, all ok"
This reverts commit 9a68868.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 21, 2023
… easier merging

This ~completes the fpe and namespace patches, addressing madgraph5#701 and madgraph5#725, respectively.
(HOWEVER, the CI on MacOS failed for this with madgraph5#730 - still a few things to change before merging).

Unfortunately, I tested that this patch only fixes the IEEE_DIVIDE_BY_ZERO part of madgraph5#701,
but there are still other issues remaining (being debugged in branch nobm).

Revert "[fpe] rerun 15 tmad - ggttgg tests fail again madgraph5#655 as expected"
This reverts commit 9212960.

Revert "[fpe] rerun 78 tput alltees, all ok"
This reverts commit 9a68868.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 21, 2023
…madgraph5#730 and madgraph5#731

This completes the fpe and namespace patches, addressing madgraph5#701 and madgraph5#725, respectively.

Unfortunately, I tested that this patch only fixes the IEEE_DIVIDE_BY_ZERO part of madgraph5#701,
but there are still other issues remaining (being debugged in branch nobm and in madgraph5#733):
  IEEE_INVALID_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
@valassi valassi changed the title Segfault in testxxx runTest.exe for debug builds Segfault in testxxx runTest.exe for debug builds (need separate cpu/gpu namespaces) Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment