Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[epoch1] Single precision: fix build failures, improve NaN determination #144

Merged
merged 12 commits into from
Mar 31, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Mar 31, 2021

This patch is another spinoff from isue #139 about merging epoch2 and epoch1.

At some point I started looking at single precision. This is because in epoch2 there was one hardcoded "double" which I transformed into "fptype". For completeness I tried to build in single precision and this opened up another pandora's box also in epoch1, about single precision. There are two issues, both addressed (partially) by this PR:

valassi added 12 commits March 30, 2021 19:06
…rs as in epoch1

Indeed, check.cc was not compiling in SINGLE mode otherwise:

Makefile:44: CUDA_HOME is not set or is invalid. Export CUDA_HOME to compile with cuda
/cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++  -O3  -std=c++11 -I. -I../../src -I../../../../../tools  -Wall -Wshadow -Wextra -fopenmp -DMGONGPU_COMMONRAND_ONHOST -ffast-math   -c check.cc -o check.o
check.cc: In function ‘int main(int, char**)’:
check.cc:312:81: error: conversion from ‘vector<float>’ to non-scalar type ‘vector<double>’ requested
  312 |     std::vector<double> commonRnd = commonRandomPromises[iiter].get_future().get();
      |                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
make: *** [check.o] Error 1

Note (issue madgraph5#143) that neither epoch2 nor epoch1 build in single precision, anyway...
madgraph5#143)

However check.exe gives nans for 2048/256/12 (but not for fewer events!)
time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.872868e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.788832e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.403606e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.559249e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 6.229583e+00                 )  sec
MeanTimeInMatrixElems       = ( 5.191319e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 5.188715e-01 ,  5.195643e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.991314e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.077535e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.009932e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( -nan +- -nan )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.004423e-03 ,  4.260640e-02 ]  GeV^0
StdDevMatrixElemValue       = ( -nan                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000401 sec
0b MemAlloc :     0.037210 sec
0c GenCreat :     0.000448 sec
1b GenRnGen :     0.084036 sec
2a RamboIni :     0.072432 sec
2b RamboFin :     1.486817 sec
3a SigmaKin :     6.229582 sec
4a DumpLoop :     0.066444 sec
8a CompStat :     0.016088 sec
9a GenDestr :     0.000003 sec
9b DumpScrn :     0.009665 sec
9c DumpJson :     0.000005 sec
TOTAL       :     8.003131 sec
TOTAL (123) :     7.872867 sec
TOTAL  (23) :     7.788831 sec
TOTAL   (1) :     0.084036 sec
TOTAL   (2) :     1.559249 sec
TOTAL   (3) :     6.229582 sec
***********************************************************************
real    0m8.024s
user    0m8.203s
sys     0m0.247s

time ./check.exe -p 64 256 12
***********************************************************************
NumBlocksPerGrid            = 64
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 2.440562e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 2.419238e-01                 )  sec
TotalTime[RndNumGen]    (1) = ( 2.132474e-03                 )  sec
TotalTime[Rambo]        (2) = ( 4.697915e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.949446e-01                 )  sec
MeanTimeInMatrixElems       = ( 1.624538e-02                 )  sec
[Min,Max]TimeInMatrixElems  = [ 1.623230e-02 ,  1.628157e-02 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 196608
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.055848e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.126858e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.008533e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 196608
MeanMatrixElemValue         = ( 1.373064e-02 +- 1.849783e-05 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.069088e-03 ,  3.721447e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202031e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000356 sec
0b MemAlloc :     0.001231 sec
0c GenCreat :     0.000292 sec
1b GenRnGen :     0.002132 sec
2a RamboIni :     0.001163 sec
2b RamboFin :     0.045817 sec
3a SigmaKin :     0.194945 sec
4a DumpLoop :     0.001906 sec
8a CompStat :     0.000395 sec
9a GenDestr :     0.000001 sec
9b DumpScrn :     0.004020 sec
9c DumpJson :     0.000004 sec
TOTAL       :     0.252260 sec
TOTAL (123) :     0.244056 sec
TOTAL  (23) :     0.241924 sec
TOTAL   (1) :     0.002132 sec
TOTAL   (2) :     0.046979 sec
TOTAL   (3) :     0.194945 sec
***********************************************************************
real    0m0.259s
user    0m0.259s
sys     0m0.011s
…graph5#129

time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=5)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.021230e+01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.012776e+01                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.454761e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.987170e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.140586e+00                 )  sec
MeanTimeInMatrixElems       = ( 6.783822e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 6.779236e-01 ,  6.788596e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.160663e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.212093e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 7.728505e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291451
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268987e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.084707e-03 ,  8.123530e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199524e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000362 sec
0b MemAlloc :     0.037545 sec
0c GenCreat :     0.000468 sec
1b GenRnGen :     0.084548 sec
2a RamboIni :     0.072931 sec
2b RamboFin :     1.914239 sec
3a SigmaKin :     8.140587 sec
4a DumpLoop :     0.065026 sec
8a CompStat :     0.042349 sec
9a GenDestr :     0.000002 sec
9b DumpScrn :     0.008876 sec
9c DumpJson :     0.000007 sec
TOTAL       :    10.366940 sec
TOTAL (123) :    10.212305 sec
TOTAL  (23) :    10.127757 sec
TOTAL   (1) :     0.084548 sec
TOTAL   (2) :     1.987170 sec
TOTAL   (3) :     8.140587 sec
***********************************************************************
real    0m10.384s
user    0m10.569s
sys     0m0.233s
Declare ME is nan if both ME==0 and ME==1 are true.
For future studies, include also the number of ME==0 found.

This is without fast math (which would find nans correctly anyway).
Note that the MEs which are nan are not also equal to zero.

time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
WARNING! ME[310744] is nan
WARNING! ME[451171] is nan
WARNING! ME[3007871] is nan
WARNING! ME[3163868] is nan
WARNING! ME[4471038] is nan
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=5, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.022795e+01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.014364e+01                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.430775e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.985546e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.158094e+00                 )  sec
MeanTimeInMatrixElems       = ( 6.798411e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 6.793488e-01 ,  6.803043e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.151240e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.202366e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 7.711919e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291451
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268987e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.084707e-03 ,  8.123530e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199524e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000296 sec
0b MemAlloc :     0.037358 sec
0c GenCreat :     0.000326 sec
1b GenRnGen :     0.084308 sec
2a RamboIni :     0.073522 sec
2b RamboFin :     1.912024 sec
3a SigmaKin :     8.158094 sec
4a DumpLoop :     0.068626 sec
8a CompStat :     0.089024 sec
9a GenDestr :     0.000005 sec
9b DumpScrn :     0.009184 sec
9c DumpJson :     0.000008 sec
TOTAL       :    10.432773 sec
TOTAL (123) :    10.227947 sec
TOTAL  (23) :    10.143640 sec
TOTAL   (1) :     0.084308 sec
TOTAL   (2) :     1.985546 sec
TOTAL   (3) :     8.158094 sec
***********************************************************************
real    0m10.450s
user    0m10.621s
sys     0m0.284s
This is without fast math:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG [310744] ME=nan isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1
WARNING! ME[310744] is nan
WARNING! ME[451171] is nan
WARNING! ME[3007871] is nan
WARNING! ME[3163868] is nan
WARNING! ME[4471038] is nan
DEBUG [5473927] ME=0.0124186 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=5, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.021769e+01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.013290e+01                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.479511e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.983129e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.149767e+00                 )  sec
MeanTimeInMatrixElems       = ( 6.791473e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 6.788551e-01 ,  6.795027e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.157414e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.208941e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 7.719798e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291451
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268987e-06 )  GeV^0
WARNING! fast math is very unreliable... When I enabled fast math globally,
and before I disabled it on selected function, I was getting contradictory
results in the same unit: me==0 or me_is_nan(me) were giving different results
depending on the order of some calls, for instance.
This is only a temporary patch to get the tests ok, but for production
usage for physics this must be carefully checked...

Without global fast math:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG[310744] ME=nan (me==me)=0 (me==me+1)=0 meisnan=1 isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1
WARNING! ME[310744] is nan
WARNING! ME[451171] is nan
WARNING! ME[3007871] is nan
WARNING! ME[3163868] is nan
WARNING! ME[4471038] is nan
DEBUG[5473927] ME=0.0124186 (me==me)=1 (me==me+1)=0 meisnan=0 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=5, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.021900e+01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.013451e+01                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.449329e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.996315e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.138196e+00                 )  sec
MeanTimeInMatrixElems       = ( 6.781830e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 6.778324e-01 ,  6.785845e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.156624e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.207953e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 7.730775e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291451
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268987e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.084707e-03 ,  8.123530e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199524e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000320 sec
0b MemAlloc :     0.037336 sec
0c GenCreat :     0.000242 sec
1b GenRnGen :     0.084493 sec
2a RamboIni :     0.073379 sec
2b RamboFin :     1.922936 sec
3a SigmaKin :     8.138196 sec
4a DumpLoop :     0.065426 sec
8a CompStat :     0.047453 sec
9a GenDestr :     0.000011 sec
9b DumpScrn :     0.008653 sec
9c DumpJson :     0.000007 sec
TOTAL       :    10.378454 sec
TOTAL (123) :    10.219004 sec
TOTAL  (23) :    10.134511 sec
TOTAL   (1) :     0.084493 sec
TOTAL   (2) :     1.996315 sec
TOTAL   (3) :     8.138196 sec
***********************************************************************
real    0m10.395s
user    0m10.569s
sys     0m0.221s
Without global fast math:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG[310744] ME=nan fpisabnormal=1 fpclass=NaN (me==me)=0 (me==me+1)=0 isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1
WARNING! ME[310744] is NaN/abnormal
WARNING! ME[451171] is NaN/abnormal
WARNING! ME[3007871] is NaN/abnormal
WARNING! ME[3163868] is NaN/abnormal
WARNING! ME[4471038] is NaN/abnormal
DEBUG[5473927] ME=0.0124186 fpisabnormal=0 fpclass=normal (me==me)=1 (me==me+1)=0 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.021716e+01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 1.013309e+01                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.406736e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.989944e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 8.143144e+00                 )  sec
MeanTimeInMatrixElems       = ( 6.785954e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 6.782546e-01 ,  6.792746e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.157737e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.208824e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 7.726077e+05                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291451
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268987e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.084707e-03 ,  8.123530e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199524e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000337 sec
0b MemAlloc :     0.037063 sec
0c GenCreat :     0.000455 sec
1b GenRnGen :     0.084067 sec
2a RamboIni :     0.071985 sec
2b RamboFin :     1.917959 sec
3a SigmaKin :     8.143145 sec
4a DumpLoop :     0.065788 sec
8a CompStat :     0.047362 sec
9a GenDestr :     0.000010 sec
9b DumpScrn :     0.008671 sec
9c DumpJson :     0.000007 sec
TOTAL       :    10.376849 sec
TOTAL (123) :    10.217155 sec
TOTAL  (23) :    10.133088 sec
TOTAL   (1) :     0.084067 sec
TOTAL   (2) :     1.989944 sec
TOTAL   (3) :     8.143145 sec
***********************************************************************
real    0m10.393s
user    0m10.575s
sys     0m0.218s
Note that std::isnormal is false but fpclassify says 'normal'... ?!

Presently this says nan=6 zero=0, but before I added the per-function nofastmath,
I was getting nan=6 zero=6 or nan=0 zero=6 (and perfect averages!) depending
on other lines of the code...

With global fast math now:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG[310744] ME=-nan fpisabnormal=1 fpclass=normal (me==me)=0 (me==me+1)=0 isnan=0 isfinite=1 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=0
WARNING! ME[310744] is NaN/abnormal
WARNING! ME[451171] is NaN/abnormal
WARNING! ME[3007871] is NaN/abnormal
WARNING! ME[3163868] is NaN/abnormal
WARNING! ME[4471038] is NaN/abnormal
DEBUG[5473927] ME=-nan fpisabnormal=1 fpclass=normal (me==me)=0 (me==me+1)=0 isnan=0 isfinite=1 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=0
WARNING! ME[5473927] is NaN/abnormal
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.878380e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.792592e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.578802e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.548612e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 6.243980e+00                 )  sec
MeanTimeInMatrixElems       = ( 5.203317e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 5.201201e-01 ,  5.208153e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.985722e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.073637e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.007603e+06                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291450
MeanMatrixElemValue         = ( 1.371779e-02 +- 3.268970e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.088710e-03 ,  6.299551e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199479e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000295 sec
0b MemAlloc :     0.037309 sec
0c GenCreat :     0.000520 sec
1b GenRnGen :     0.085788 sec
2a RamboIni :     0.073077 sec
2b RamboFin :     1.475536 sec
3a SigmaKin :     6.243980 sec
4a DumpLoop :     0.066375 sec
8a CompStat :     0.069531 sec
9a GenDestr :     0.000004 sec
9b DumpScrn :     0.008606 sec
9c DumpJson :     0.000007 sec
TOTAL       :     8.061027 sec
TOTAL (123) :     7.878381 sec
TOTAL  (23) :     7.792593 sec
TOTAL   (1) :     0.085788 sec
TOTAL   (2) :     1.548612 sec
TOTAL   (3) :     6.243980 sec
***********************************************************************
real    0m8.078s
user    0m8.252s
sys     0m0.224s

For reference, a very old piece of code (unclear which one) with global fast math,
before I added per function nofastmath:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG[310744] ME=-nan me==me 0 meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=0 abs(ME)=nan isnan=0
WARNING! ME[310744] is nan
WARNING! ME[451171] is nan
WARNING! ME[3007871] is nan
WARNING! ME[3163868] is nan
WARNING! ME[4471038] is nan
DEBUG[5473927] ME=-nan me==me 0 meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=0 abs(ME)=nan isnan=0
WARNING! ME[5473927] is nan
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=6, zero=6)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.868979e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.784011e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.496877e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.553193e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 6.230817e+00                 )  sec
MeanTimeInMatrixElems       = ( 5.192348e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 5.190312e-01 ,  5.196109e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.995263e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.082538e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.009732e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291450
MeanMatrixElemValue         = ( 1.371779e-02 +- 3.268970e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.088710e-03 ,  6.299551e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199479e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000299 sec
0b MemAlloc :     0.037553 sec
0c GenCreat :     0.000527 sec
1b GenRnGen :     0.084969 sec
2a RamboIni :     0.072797 sec
2b RamboFin :     1.480396 sec
3a SigmaKin :     6.230817 sec
4a DumpLoop :     0.067709 sec
8a CompStat :     0.076509 sec
9a GenDestr :     0.000004 sec
9b DumpScrn :     0.009063 sec
9c DumpJson :     0.000008 sec
TOTAL       :     8.060652 sec

And another with some lines moved around:
(note how weird: there are 6 more events, yet the minimum ME is higher...
I guess the < and > operators were giving unreliable results)
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
DEBUG[310744] ME=-nan meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=1 is1=1 abs(ME)=nan isnan=0
DEBUG[5473927] ME=-nan meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=1 is1=1 abs(ME)=nan isnan=0
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (nan=0, zero=6)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.910227e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.824203e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 8.602432e-02                 )  sec
TotalTime[Rambo]        (2) = ( 1.573157e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 6.251046e+00                 )  sec
MeanTimeInMatrixElems       = ( 5.209205e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 5.206928e-01 ,  5.211877e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.953572e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.041018e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.006464e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.371779e-02 +- 3.268966e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.061680e-03 ,  6.299551e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.199475e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000294 sec
0b MemAlloc :     0.037586 sec
0c GenCreat :     0.000289 sec
1b GenRnGen :     0.086024 sec
2a RamboIni :     0.083310 sec
2b RamboFin :     1.489847 sec
3a SigmaKin :     6.251046 sec
4a DumpLoop :     0.065824 sec
8a CompStat :     0.062115 sec
9a GenDestr :     0.000003 sec
9b DumpScrn :     0.008629 sec
9c DumpJson :     0.000005 sec
TOTAL       :     8.084972 sec
TOTAL (123) :     7.910227 sec
…set".

Not surprisingly, the events which are problematic in c++ differ when using curand.
But they are 6 in both cases, on 6M.

Some events are also problematic in cuda, but only 2 in 6M,
and they are different from those of c++ (with the same curand).

C++, fast math, curand:
time ./check.exe -p 2048 256 12 -d
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 4
DEBUG: ${OMP_NUM_THREADS}    = '[not set]'
DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread
DEBUG: omp_get_num_threads() = 1
DEBUG: omp_get_max_threads() = 1
WARNING! ME[578162] is NaN/abnormal
WARNING! ME[1725762] is NaN/abnormal
WARNING! ME[2163579] is NaN/abnormal
WARNING! ME[5407629] is NaN/abnormal
WARNING! ME[5435532] is NaN/abnormal
WARNING! ME[6014690] is NaN/abnormal
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 8.106228e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.780294e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.259343e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.541099e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 6.239194e+00                 )  sec
MeanTimeInMatrixElems       = ( 5.199329e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 5.196525e-01 ,  5.203252e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.761262e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.086399e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.008376e+06                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291450
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 2.430001e-03 ,  1.086722e-01 ]  GeV^0
StdDevMatrixElemValue       = ( 8.203006e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000331 sec
0b MemAlloc :     0.037504 sec
0c GenCreat :     0.000976 sec
1a GenSeed  :     0.000023 sec
1b GenRnGen :     0.325912 sec
2a RamboIni :     0.072463 sec
2b RamboFin :     1.468636 sec
3a SigmaKin :     6.239194 sec
4a DumpLoop :     0.045577 sec
8a CompStat :     0.069998 sec
9a GenDestr :     0.000115 sec
9b DumpScrn :     0.009210 sec
9c DumpJson :     0.000002 sec
TOTAL       :     8.269940 sec
TOTAL (123) :     8.106228 sec
TOTAL  (23) :     7.780294 sec
TOTAL   (1) :     0.325934 sec
TOTAL   (2) :     1.541100 sec
TOTAL   (3) :     6.239194 sec
***********************************************************************
real    0m8.290s
user    0m8.208s
sys     0m0.079s

CUDA, fast math, curand:
ime ./gcheck.exe -p 2048 256 12 -d
WARNING! ME[596016] is NaN/abnormal
WARNING! ME[1446938] is NaN/abnormal
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[8]
Momenta memory layout       = AOSOA[8]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 5.650546e-02                 )  sec
TotalTime[Rambo+ME]    (23) = ( 4.909613e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.409333e-03                 )  sec
TotalTime[Rambo]        (2) = ( 4.459424e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 4.501892e-03                 )  sec
MeanTimeInMatrixElems       = ( 3.751577e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 3.693620e-04 ,  3.823010e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.113424e+08                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 1.281457e+08                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.397514e+09                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291454
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 1.463952e-03 ,  4.733844e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.202616e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     0.916559 sec
0a ProcInit :     0.000461 sec
0b MemAlloc :     0.018835 sec
0c GenCreat :     0.010257 sec
0d SGoodHel :     0.000711 sec
1a GenSeed  :     0.000022 sec
1b GenRnGen :     0.007387 sec
2a RamboIni :     0.000104 sec
2b RamboFin :     0.000048 sec
2c CpDTHwgt :     0.004187 sec
2d CpDTHmom :     0.040255 sec
3a SigmaKin :     0.000086 sec
3b CpDTHmes :     0.004416 sec
4a DumpLoop :     0.050881 sec
8a CompStat :     0.045280 sec
9a GenDestr :     0.000060 sec
9b DumpScrn :     0.000176 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.099729 sec
TOTAL (123) :     0.056505 sec
TOTAL  (23) :     0.049096 sec
TOTAL   (1) :     0.007409 sec
TOTAL   (2) :     0.044594 sec
TOTAL   (3) :     0.004502 sec
***********************************************************************
real    0m1.394s
user    0m0.378s
sys     0m0.793s
Find back the usual 1.15E6 for c++ and 6.2e8 for cuda

time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.054557e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 9.802793e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.427770e-03                 )  sec
TotalTime[Rambo]        (2) = ( 8.781144e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.021650e-02                 )  sec
MeanTimeInMatrixElems       = ( 8.513748e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 7.957010e-04 ,  8.747180e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.965970e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.418024e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.158134e+08                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************

time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (NaN/abnormal=0, zero=0 )
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.707499e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.381717e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.257818e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.940907e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.440811e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.534009e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.532392e-01 ,  4.536635e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.162772e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.523025e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.156345e+06                 )  sec^-1
***********************************************************************
NumMatrixElems(notAbnormal) = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Mar 31, 2021
…piler

This also requires adding Process::getCompiler to ep2 CPPProcess.cc/h.

Now check.cc is identical in both epoch2 and epoch1
(and runTest.cc is almost identical, except for the test name).
Will now include PR madgraph5#144 for single precision in epoch1,
and will copy check.cc again (and runTest.cc with some changes).

Epoch2 baseline remains epoch2 C++ 1.10e6, cuda 6.6e8

time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.968041e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.643061e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.249804e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.928639e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.714422e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.762018e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.760149e-01 ,  4.765775e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.895863e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.231592e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.100979e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************

time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.044441e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 9.709213e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.351930e-03                 )  sec
TotalTime[Rambo]        (2) = ( 8.758798e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 9.504147e-03                 )  sec
MeanTimeInMatrixElems       = ( 7.920122e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 7.825940e-04 ,  8.001750e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.023757e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.479882e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.619696e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
@valassi
Copy link
Member Author

valassi commented Mar 31, 2021

Ok I am ready in branch ep2toep1 to include this PR. So I self-merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant