-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[epoch1] Single precision: fix build failures, improve NaN determination #144
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…rs as in epoch1 Indeed, check.cc was not compiling in SINGLE mode otherwise: Makefile:44: CUDA_HOME is not set or is invalid. Export CUDA_HOME to compile with cuda /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++ -O3 -std=c++11 -I. -I../../src -I../../../../../tools -Wall -Wshadow -Wextra -fopenmp -DMGONGPU_COMMONRAND_ONHOST -ffast-math -c check.cc -o check.o check.cc: In function ‘int main(int, char**)’: check.cc:312:81: error: conversion from ‘vector<float>’ to non-scalar type ‘vector<double>’ requested 312 | std::vector<double> commonRnd = commonRandomPromises[iiter].get_future().get(); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ make: *** [check.o] Error 1 Note (issue madgraph5#143) that neither epoch2 nor epoch1 build in single precision, anyway...
madgraph5#143) However check.exe gives nans for 2048/256/12 (but not for fewer events!) time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.872868e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.788832e+00 ) sec TotalTime[RndNumGen] (1) = ( 8.403606e-02 ) sec TotalTime[Rambo] (2) = ( 1.559249e+00 ) sec TotalTime[MatrixElems] (3) = ( 6.229583e+00 ) sec MeanTimeInMatrixElems = ( 5.191319e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.188715e-01 , 5.195643e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.991314e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.077535e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.009932e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( -nan +- -nan ) GeV^0 [Min,Max]MatrixElemValue = [ 6.004423e-03 , 4.260640e-02 ] GeV^0 StdDevMatrixElemValue = ( -nan ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000401 sec 0b MemAlloc : 0.037210 sec 0c GenCreat : 0.000448 sec 1b GenRnGen : 0.084036 sec 2a RamboIni : 0.072432 sec 2b RamboFin : 1.486817 sec 3a SigmaKin : 6.229582 sec 4a DumpLoop : 0.066444 sec 8a CompStat : 0.016088 sec 9a GenDestr : 0.000003 sec 9b DumpScrn : 0.009665 sec 9c DumpJson : 0.000005 sec TOTAL : 8.003131 sec TOTAL (123) : 7.872867 sec TOTAL (23) : 7.788831 sec TOTAL (1) : 0.084036 sec TOTAL (2) : 1.559249 sec TOTAL (3) : 6.229582 sec *********************************************************************** real 0m8.024s user 0m8.203s sys 0m0.247s time ./check.exe -p 64 256 12 *********************************************************************** NumBlocksPerGrid = 64 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 2.440562e-01 ) sec TotalTime[Rambo+ME] (23) = ( 2.419238e-01 ) sec TotalTime[RndNumGen] (1) = ( 2.132474e-03 ) sec TotalTime[Rambo] (2) = ( 4.697915e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.949446e-01 ) sec MeanTimeInMatrixElems = ( 1.624538e-02 ) sec [Min,Max]TimeInMatrixElems = [ 1.623230e-02 , 1.628157e-02 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 196608 EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.055848e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.126858e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.008533e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 196608 MeanMatrixElemValue = ( 1.373064e-02 +- 1.849783e-05 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.069088e-03 , 3.721447e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202031e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000356 sec 0b MemAlloc : 0.001231 sec 0c GenCreat : 0.000292 sec 1b GenRnGen : 0.002132 sec 2a RamboIni : 0.001163 sec 2b RamboFin : 0.045817 sec 3a SigmaKin : 0.194945 sec 4a DumpLoop : 0.001906 sec 8a CompStat : 0.000395 sec 9a GenDestr : 0.000001 sec 9b DumpScrn : 0.004020 sec 9c DumpJson : 0.000004 sec TOTAL : 0.252260 sec TOTAL (123) : 0.244056 sec TOTAL (23) : 0.241924 sec TOTAL (1) : 0.002132 sec TOTAL (2) : 0.046979 sec TOTAL (3) : 0.194945 sec *********************************************************************** real 0m0.259s user 0m0.259s sys 0m0.011s
…graph5#129 time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=5) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.021230e+01 ) sec TotalTime[Rambo+ME] (23) = ( 1.012776e+01 ) sec TotalTime[RndNumGen] (1) = ( 8.454761e-02 ) sec TotalTime[Rambo] (2) = ( 1.987170e+00 ) sec TotalTime[MatrixElems] (3) = ( 8.140586e+00 ) sec MeanTimeInMatrixElems = ( 6.783822e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.779236e-01 , 6.788596e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.160663e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.212093e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 7.728505e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291451 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268987e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.084707e-03 , 8.123530e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199524e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000362 sec 0b MemAlloc : 0.037545 sec 0c GenCreat : 0.000468 sec 1b GenRnGen : 0.084548 sec 2a RamboIni : 0.072931 sec 2b RamboFin : 1.914239 sec 3a SigmaKin : 8.140587 sec 4a DumpLoop : 0.065026 sec 8a CompStat : 0.042349 sec 9a GenDestr : 0.000002 sec 9b DumpScrn : 0.008876 sec 9c DumpJson : 0.000007 sec TOTAL : 10.366940 sec TOTAL (123) : 10.212305 sec TOTAL (23) : 10.127757 sec TOTAL (1) : 0.084548 sec TOTAL (2) : 1.987170 sec TOTAL (3) : 8.140587 sec *********************************************************************** real 0m10.384s user 0m10.569s sys 0m0.233s
Declare ME is nan if both ME==0 and ME==1 are true. For future studies, include also the number of ME==0 found. This is without fast math (which would find nans correctly anyway). Note that the MEs which are nan are not also equal to zero. time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 WARNING! ME[310744] is nan WARNING! ME[451171] is nan WARNING! ME[3007871] is nan WARNING! ME[3163868] is nan WARNING! ME[4471038] is nan *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=5, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.022795e+01 ) sec TotalTime[Rambo+ME] (23) = ( 1.014364e+01 ) sec TotalTime[RndNumGen] (1) = ( 8.430775e-02 ) sec TotalTime[Rambo] (2) = ( 1.985546e+00 ) sec TotalTime[MatrixElems] (3) = ( 8.158094e+00 ) sec MeanTimeInMatrixElems = ( 6.798411e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.793488e-01 , 6.803043e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.151240e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.202366e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 7.711919e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291451 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268987e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.084707e-03 , 8.123530e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199524e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000296 sec 0b MemAlloc : 0.037358 sec 0c GenCreat : 0.000326 sec 1b GenRnGen : 0.084308 sec 2a RamboIni : 0.073522 sec 2b RamboFin : 1.912024 sec 3a SigmaKin : 8.158094 sec 4a DumpLoop : 0.068626 sec 8a CompStat : 0.089024 sec 9a GenDestr : 0.000005 sec 9b DumpScrn : 0.009184 sec 9c DumpJson : 0.000008 sec TOTAL : 10.432773 sec TOTAL (123) : 10.227947 sec TOTAL (23) : 10.143640 sec TOTAL (1) : 0.084308 sec TOTAL (2) : 1.985546 sec TOTAL (3) : 8.158094 sec *********************************************************************** real 0m10.450s user 0m10.621s sys 0m0.284s
This is without fast math: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG [310744] ME=nan isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1 WARNING! ME[310744] is nan WARNING! ME[451171] is nan WARNING! ME[3007871] is nan WARNING! ME[3163868] is nan WARNING! ME[4471038] is nan DEBUG [5473927] ME=0.0124186 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=5, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.021769e+01 ) sec TotalTime[Rambo+ME] (23) = ( 1.013290e+01 ) sec TotalTime[RndNumGen] (1) = ( 8.479511e-02 ) sec TotalTime[Rambo] (2) = ( 1.983129e+00 ) sec TotalTime[MatrixElems] (3) = ( 8.149767e+00 ) sec MeanTimeInMatrixElems = ( 6.791473e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.788551e-01 , 6.795027e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.157414e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.208941e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 7.719798e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291451 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268987e-06 ) GeV^0
WARNING! fast math is very unreliable... When I enabled fast math globally, and before I disabled it on selected function, I was getting contradictory results in the same unit: me==0 or me_is_nan(me) were giving different results depending on the order of some calls, for instance. This is only a temporary patch to get the tests ok, but for production usage for physics this must be carefully checked... Without global fast math: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG[310744] ME=nan (me==me)=0 (me==me+1)=0 meisnan=1 isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1 WARNING! ME[310744] is nan WARNING! ME[451171] is nan WARNING! ME[3007871] is nan WARNING! ME[3163868] is nan WARNING! ME[4471038] is nan DEBUG[5473927] ME=0.0124186 (me==me)=1 (me==me+1)=0 meisnan=0 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=5, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.021900e+01 ) sec TotalTime[Rambo+ME] (23) = ( 1.013451e+01 ) sec TotalTime[RndNumGen] (1) = ( 8.449329e-02 ) sec TotalTime[Rambo] (2) = ( 1.996315e+00 ) sec TotalTime[MatrixElems] (3) = ( 8.138196e+00 ) sec MeanTimeInMatrixElems = ( 6.781830e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.778324e-01 , 6.785845e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.156624e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.207953e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 7.730775e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291451 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268987e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.084707e-03 , 8.123530e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199524e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000320 sec 0b MemAlloc : 0.037336 sec 0c GenCreat : 0.000242 sec 1b GenRnGen : 0.084493 sec 2a RamboIni : 0.073379 sec 2b RamboFin : 1.922936 sec 3a SigmaKin : 8.138196 sec 4a DumpLoop : 0.065426 sec 8a CompStat : 0.047453 sec 9a GenDestr : 0.000011 sec 9b DumpScrn : 0.008653 sec 9c DumpJson : 0.000007 sec TOTAL : 10.378454 sec TOTAL (123) : 10.219004 sec TOTAL (23) : 10.134511 sec TOTAL (1) : 0.084493 sec TOTAL (2) : 1.996315 sec TOTAL (3) : 8.138196 sec *********************************************************************** real 0m10.395s user 0m10.569s sys 0m0.221s
Without global fast math: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG[310744] ME=nan fpisabnormal=1 fpclass=NaN (me==me)=0 (me==me+1)=0 isnan=1 isfinite=0 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=1 WARNING! ME[310744] is NaN/abnormal WARNING! ME[451171] is NaN/abnormal WARNING! ME[3007871] is NaN/abnormal WARNING! ME[3163868] is NaN/abnormal WARNING! ME[4471038] is NaN/abnormal DEBUG[5473927] ME=0.0124186 fpisabnormal=0 fpclass=normal (me==me)=1 (me==me+1)=0 isnan=0 isfinite=1 isnormal=1 is0=0 is1=0 abs(ME)=0.0124186 isnan=0 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (NaN/abnormal=5, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.021716e+01 ) sec TotalTime[Rambo+ME] (23) = ( 1.013309e+01 ) sec TotalTime[RndNumGen] (1) = ( 8.406736e-02 ) sec TotalTime[Rambo] (2) = ( 1.989944e+00 ) sec TotalTime[MatrixElems] (3) = ( 8.143144e+00 ) sec MeanTimeInMatrixElems = ( 6.785954e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.782546e-01 , 6.792746e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.157737e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.208824e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 7.726077e+05 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291451 MeanMatrixElemValue = ( 1.371780e-02 +- 3.268987e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.084707e-03 , 8.123530e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199524e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000337 sec 0b MemAlloc : 0.037063 sec 0c GenCreat : 0.000455 sec 1b GenRnGen : 0.084067 sec 2a RamboIni : 0.071985 sec 2b RamboFin : 1.917959 sec 3a SigmaKin : 8.143145 sec 4a DumpLoop : 0.065788 sec 8a CompStat : 0.047362 sec 9a GenDestr : 0.000010 sec 9b DumpScrn : 0.008671 sec 9c DumpJson : 0.000007 sec TOTAL : 10.376849 sec TOTAL (123) : 10.217155 sec TOTAL (23) : 10.133088 sec TOTAL (1) : 0.084067 sec TOTAL (2) : 1.989944 sec TOTAL (3) : 8.143145 sec *********************************************************************** real 0m10.393s user 0m10.575s sys 0m0.218s
Note that std::isnormal is false but fpclassify says 'normal'... ?! Presently this says nan=6 zero=0, but before I added the per-function nofastmath, I was getting nan=6 zero=6 or nan=0 zero=6 (and perfect averages!) depending on other lines of the code... With global fast math now: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG[310744] ME=-nan fpisabnormal=1 fpclass=normal (me==me)=0 (me==me+1)=0 isnan=0 isfinite=1 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=0 WARNING! ME[310744] is NaN/abnormal WARNING! ME[451171] is NaN/abnormal WARNING! ME[3007871] is NaN/abnormal WARNING! ME[3163868] is NaN/abnormal WARNING! ME[4471038] is NaN/abnormal DEBUG[5473927] ME=-nan fpisabnormal=1 fpclass=normal (me==me)=0 (me==me+1)=0 isnan=0 isfinite=1 isnormal=0 is0=0 is1=0 abs(ME)=nan isnan=0 WARNING! ME[5473927] is NaN/abnormal *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (NaN/abnormal=6, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.878380e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.792592e+00 ) sec TotalTime[RndNumGen] (1) = ( 8.578802e-02 ) sec TotalTime[Rambo] (2) = ( 1.548612e+00 ) sec TotalTime[MatrixElems] (3) = ( 6.243980e+00 ) sec MeanTimeInMatrixElems = ( 5.203317e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.201201e-01 , 5.208153e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.985722e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.073637e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.007603e+06 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291450 MeanMatrixElemValue = ( 1.371779e-02 +- 3.268970e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.088710e-03 , 6.299551e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199479e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000295 sec 0b MemAlloc : 0.037309 sec 0c GenCreat : 0.000520 sec 1b GenRnGen : 0.085788 sec 2a RamboIni : 0.073077 sec 2b RamboFin : 1.475536 sec 3a SigmaKin : 6.243980 sec 4a DumpLoop : 0.066375 sec 8a CompStat : 0.069531 sec 9a GenDestr : 0.000004 sec 9b DumpScrn : 0.008606 sec 9c DumpJson : 0.000007 sec TOTAL : 8.061027 sec TOTAL (123) : 7.878381 sec TOTAL (23) : 7.792593 sec TOTAL (1) : 0.085788 sec TOTAL (2) : 1.548612 sec TOTAL (3) : 6.243980 sec *********************************************************************** real 0m8.078s user 0m8.252s sys 0m0.224s For reference, a very old piece of code (unclear which one) with global fast math, before I added per function nofastmath: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG[310744] ME=-nan me==me 0 meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=0 abs(ME)=nan isnan=0 WARNING! ME[310744] is nan WARNING! ME[451171] is nan WARNING! ME[3007871] is nan WARNING! ME[3163868] is nan WARNING! ME[4471038] is nan DEBUG[5473927] ME=-nan me==me 0 meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=0 abs(ME)=nan isnan=0 WARNING! ME[5473927] is nan *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=6, zero=6) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.868979e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.784011e+00 ) sec TotalTime[RndNumGen] (1) = ( 8.496877e-02 ) sec TotalTime[Rambo] (2) = ( 1.553193e+00 ) sec TotalTime[MatrixElems] (3) = ( 6.230817e+00 ) sec MeanTimeInMatrixElems = ( 5.192348e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.190312e-01 , 5.196109e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.995263e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.082538e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.009732e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291450 MeanMatrixElemValue = ( 1.371779e-02 +- 3.268970e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.088710e-03 , 6.299551e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199479e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000299 sec 0b MemAlloc : 0.037553 sec 0c GenCreat : 0.000527 sec 1b GenRnGen : 0.084969 sec 2a RamboIni : 0.072797 sec 2b RamboFin : 1.480396 sec 3a SigmaKin : 6.230817 sec 4a DumpLoop : 0.067709 sec 8a CompStat : 0.076509 sec 9a GenDestr : 0.000004 sec 9b DumpScrn : 0.009063 sec 9c DumpJson : 0.000008 sec TOTAL : 8.060652 sec And another with some lines moved around: (note how weird: there are 6 more events, yet the minimum ME is higher... I guess the < and > operators were giving unreliable results) time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 DEBUG[310744] ME=-nan meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=1 is1=1 abs(ME)=nan isnan=0 DEBUG[5473927] ME=-nan meisnan=1 isnan=0 isfinite=1 isnormal=0 is0=1 is1=1 abs(ME)=nan isnan=0 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (nan=0, zero=6) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = COMMON RANDOM (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.910227e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.824203e+00 ) sec TotalTime[RndNumGen] (1) = ( 8.602432e-02 ) sec TotalTime[Rambo] (2) = ( 1.573157e+00 ) sec TotalTime[MatrixElems] (3) = ( 6.251046e+00 ) sec MeanTimeInMatrixElems = ( 5.209205e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.206928e-01 , 5.211877e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.953572e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.041018e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.006464e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.371779e-02 +- 3.268966e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.061680e-03 , 6.299551e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.199475e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000294 sec 0b MemAlloc : 0.037586 sec 0c GenCreat : 0.000289 sec 1b GenRnGen : 0.086024 sec 2a RamboIni : 0.083310 sec 2b RamboFin : 1.489847 sec 3a SigmaKin : 6.251046 sec 4a DumpLoop : 0.065824 sec 8a CompStat : 0.062115 sec 9a GenDestr : 0.000003 sec 9b DumpScrn : 0.008629 sec 9c DumpJson : 0.000005 sec TOTAL : 8.084972 sec TOTAL (123) : 7.910227 sec
…set". Not surprisingly, the events which are problematic in c++ differ when using curand. But they are 6 in both cases, on 6M. Some events are also problematic in cuda, but only 2 in 6M, and they are different from those of c++ (with the same curand). C++, fast math, curand: time ./check.exe -p 2048 256 12 -d DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 4 DEBUG: ${OMP_NUM_THREADS} = '[not set]' DEBUG: OMP_NUM_THREADS is not set: will use only 1 thread DEBUG: omp_get_num_threads() = 1 DEBUG: omp_get_max_threads() = 1 WARNING! ME[578162] is NaN/abnormal WARNING! ME[1725762] is NaN/abnormal WARNING! ME[2163579] is NaN/abnormal WARNING! ME[5407629] is NaN/abnormal WARNING! ME[5435532] is NaN/abnormal WARNING! ME[6014690] is NaN/abnormal *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (NaN/abnormal=6, zero=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 8.106228e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.780294e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.259343e-01 ) sec TotalTime[Rambo] (2) = ( 1.541099e+00 ) sec TotalTime[MatrixElems] (3) = ( 6.239194e+00 ) sec MeanTimeInMatrixElems = ( 5.199329e-01 ) sec [Min,Max]TimeInMatrixElems = [ 5.196525e-01 , 5.203252e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.761262e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.086399e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.008376e+06 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291450 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 2.430001e-03 , 1.086722e-01 ] GeV^0 StdDevMatrixElemValue = ( 8.203006e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000331 sec 0b MemAlloc : 0.037504 sec 0c GenCreat : 0.000976 sec 1a GenSeed : 0.000023 sec 1b GenRnGen : 0.325912 sec 2a RamboIni : 0.072463 sec 2b RamboFin : 1.468636 sec 3a SigmaKin : 6.239194 sec 4a DumpLoop : 0.045577 sec 8a CompStat : 0.069998 sec 9a GenDestr : 0.000115 sec 9b DumpScrn : 0.009210 sec 9c DumpJson : 0.000002 sec TOTAL : 8.269940 sec TOTAL (123) : 8.106228 sec TOTAL (23) : 7.780294 sec TOTAL (1) : 0.325934 sec TOTAL (2) : 1.541100 sec TOTAL (3) : 6.239194 sec *********************************************************************** real 0m8.290s user 0m8.208s sys 0m0.079s CUDA, fast math, curand: ime ./gcheck.exe -p 2048 256 12 -d WARNING! ME[596016] is NaN/abnormal WARNING! ME[1446938] is NaN/abnormal *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = FLOAT (NaN/abnormal=2, zero=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[8] Momenta memory layout = AOSOA[8] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 5.650546e-02 ) sec TotalTime[Rambo+ME] (23) = ( 4.909613e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.409333e-03 ) sec TotalTime[Rambo] (2) = ( 4.459424e-02 ) sec TotalTime[MatrixElems] (3) = ( 4.501892e-03 ) sec MeanTimeInMatrixElems = ( 3.751577e-04 ) sec [Min,Max]TimeInMatrixElems = [ 3.693620e-04 , 3.823010e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 1.113424e+08 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 1.281457e+08 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.397514e+09 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291454 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 1.463952e-03 , 4.733844e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.202616e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 0.916559 sec 0a ProcInit : 0.000461 sec 0b MemAlloc : 0.018835 sec 0c GenCreat : 0.010257 sec 0d SGoodHel : 0.000711 sec 1a GenSeed : 0.000022 sec 1b GenRnGen : 0.007387 sec 2a RamboIni : 0.000104 sec 2b RamboFin : 0.000048 sec 2c CpDTHwgt : 0.004187 sec 2d CpDTHmom : 0.040255 sec 3a SigmaKin : 0.000086 sec 3b CpDTHmes : 0.004416 sec 4a DumpLoop : 0.050881 sec 8a CompStat : 0.045280 sec 9a GenDestr : 0.000060 sec 9b DumpScrn : 0.000176 sec 9c DumpJson : 0.000002 sec TOTAL : 1.099729 sec TOTAL (123) : 0.056505 sec TOTAL (23) : 0.049096 sec TOTAL (1) : 0.007409 sec TOTAL (2) : 0.044594 sec TOTAL (3) : 0.004502 sec *********************************************************************** real 0m1.394s user 0m0.378s sys 0m0.793s
Find back the usual 1.15E6 for c++ and 6.2e8 for cuda time ./gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.054557e-01 ) sec TotalTime[Rambo+ME] (23) = ( 9.802793e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.427770e-03 ) sec TotalTime[Rambo] (2) = ( 8.781144e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.021650e-02 ) sec MeanTimeInMatrixElems = ( 8.513748e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.957010e-04 , 8.747180e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.965970e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.418024e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.158134e+08 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (NaN/abnormal=0, zero=0 ) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.707499e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.381717e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.257818e-01 ) sec TotalTime[Rambo] (2) = ( 1.940907e+00 ) sec TotalTime[MatrixElems] (3) = ( 5.440811e+00 ) sec MeanTimeInMatrixElems = ( 4.534009e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.532392e-01 , 4.536635e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.162772e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.523025e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.156345e+06 ) sec^-1 *********************************************************************** NumMatrixElems(notAbnormal) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Mar 31, 2021
…piler This also requires adding Process::getCompiler to ep2 CPPProcess.cc/h. Now check.cc is identical in both epoch2 and epoch1 (and runTest.cc is almost identical, except for the test name). Will now include PR madgraph5#144 for single precision in epoch1, and will copy check.cc again (and runTest.cc with some changes). Epoch2 baseline remains epoch2 C++ 1.10e6, cuda 6.6e8 time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.968041e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.643061e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.249804e-01 ) sec TotalTime[Rambo] (2) = ( 1.928639e+00 ) sec TotalTime[MatrixElems] (3) = ( 5.714422e+00 ) sec MeanTimeInMatrixElems = ( 4.762018e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.760149e-01 , 4.765775e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.895863e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.231592e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.100979e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** time ./gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.044441e-01 ) sec TotalTime[Rambo+ME] (23) = ( 9.709213e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.351930e-03 ) sec TotalTime[Rambo] (2) = ( 8.758798e-02 ) sec TotalTime[MatrixElems] (3) = ( 9.504147e-03 ) sec MeanTimeInMatrixElems = ( 7.920122e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.825940e-04 , 8.001750e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.023757e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.479882e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.619696e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************
Ok I am ready in branch ep2toep1 to include this PR. So I self-merge. |
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Mar 31, 2021
valassi
added a commit
to valassi/madgraph4gpu
that referenced
this pull request
Mar 31, 2021
…adgraph5#144 Note that now ep2 and ep1 runTest.cc are identical except for the test name EP2/EP1
This was referenced Mar 31, 2021
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This patch is another spinoff from isue #139 about merging epoch2 and epoch1.
At some point I started looking at single precision. This is because in epoch2 there was one hardcoded "double" which I transformed into "fptype". For completeness I tried to build in single precision and this opened up another pandora's box also in epoch1, about single precision. There are two issues, both addressed (partially) by this PR: