Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

valassi · 2021-03-31T16:44:48Z

This is the PR to complete issue #139.

I keep it as WIP for now, it's 80% done but still needs a few (quite important as performance-relevant) tweaks.

…equal

…value in implementation)

…o ep2

… - copy it to ep2 Epoch2 before fastmath: time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 9.806006e+00 ) sec TotalTime[Rambo+ME] (23)= ( 9.456839e+00 ) sec TotalTime[RndNumGen] (1)= ( 3.491671e-01 ) sec TotalTime[Rambo] (2)= ( 2.018251e+00 ) sec TotalTime[MatrixElems] (3)= ( 7.438588e+00 ) sec MeanTimeInMatrixElems = ( 6.198823e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.183559e-01 , 6.259246e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.415921e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 6.652811e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 8.457864e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000397 sec 0b MemAlloc : 0.000043 sec 0c GenCreat : 0.000955 sec 1a GenSeed : 0.000031 sec 1b GenRnGen : 0.349136 sec 2a RamboIni : 0.138318 sec 2b RamboFin : 1.879934 sec 3a SigmaKin : 7.438588 sec 4a DumpLoop : 0.087978 sec 8a CompStat : 0.045155 sec 9a GenDestr : 0.000113 sec 9b DumpScrn : 0.000223 sec 9c DumpJson : 0.000001 sec TOTAL : 9.940873 sec TOTAL (123) : 9.806006 sec TOTAL (23) : 9.456840 sec TOTAL (1) : 0.349167 sec TOTAL (2) : 2.018251 sec TOTAL (3) : 7.438588 sec *********************************************************************** real 0m9.971s user 0m9.812s sys 0m0.157s Epoch2 after fastmath: NOT FASTER (?!) time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 9.747692e+00 ) sec TotalTime[Rambo+ME] (23)= ( 9.397507e+00 ) sec TotalTime[RndNumGen] (1)= ( 3.501850e-01 ) sec TotalTime[Rambo] (2)= ( 1.976519e+00 ) sec TotalTime[MatrixElems] (3)= ( 7.420988e+00 ) sec MeanTimeInMatrixElems = ( 6.184157e-01 ) sec [Min,Max]TimeInMatrixElems = [ 6.178201e-01 , 6.216142e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.454303e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 6.694814e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 8.477922e+05 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000400 sec 0b MemAlloc : 0.000043 sec 0c GenCreat : 0.001004 sec 1a GenSeed : 0.000032 sec 1b GenRnGen : 0.350153 sec 2a RamboIni : 0.140705 sec 2b RamboFin : 1.835814 sec 3a SigmaKin : 7.420989 sec 4a DumpLoop : 0.083478 sec 8a CompStat : 0.045091 sec 9a GenDestr : 0.000119 sec 9b DumpScrn : 0.000269 sec 9c DumpJson : 0.000001 sec TOTAL : 9.878097 sec TOTAL (123) : 9.747692 sec TOTAL (23) : 9.397507 sec TOTAL (1) : 0.350185 sec TOTAL (2) : 1.976519 sec TOTAL (3) : 7.420989 sec *********************************************************************** real 0m9.908s user 0m9.769s sys 0m0.138s

…osmetics and copy ep1 to ep2 What ep1 had which is now added also to ep2: OMP, fastmath, Wextra, clang patch, host info Using fastmath also here, the speed does increase in epoch2 (note that HelAmps is compiled here via an include, so it makes sense) Epoch2: time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 8.066252e+00 ) sec TotalTime[Rambo+ME] (23)= ( 7.716077e+00 ) sec TotalTime[RndNumGen] (1)= ( 3.501755e-01 ) sec TotalTime[Rambo] (2)= ( 1.981157e+00 ) sec TotalTime[MatrixElems] (3)= ( 5.734920e+00 ) sec MeanTimeInMatrixElems = ( 4.779100e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.771928e-01 , 4.813840e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.799726e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 8.153698e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 1.097043e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000383 sec 0b MemAlloc : 0.000041 sec 0c GenCreat : 0.001009 sec 1a GenSeed : 0.000049 sec 1b GenRnGen : 0.350127 sec 2a RamboIni : 0.137961 sec 2b RamboFin : 1.843195 sec 3a SigmaKin : 5.734920 sec 4a DumpLoop : 0.085327 sec 8a CompStat : 0.027027 sec 9a GenDestr : 0.000147 sec 9b DumpScrn : 0.000251 sec 9c DumpJson : 0.000001 sec TOTAL : 8.180439 sec TOTAL (123) : 8.066252 sec TOTAL (23) : 7.716077 sec TOTAL (1) : 0.350176 sec TOTAL (2) : 1.981157 sec TOTAL (3) : 5.734920 sec *********************************************************************** real 0m8.211s user 0m8.072s sys 0m0.137s Note that epoch1 is always a bit faster... Epoch1: time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.710680e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.382994e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.276863e-01 ) sec TotalTime[Rambo] (2) = ( 1.939835e+00 ) sec TotalTime[MatrixElems] (3) = ( 5.443159e+00 ) sec MeanTimeInMatrixElems = ( 4.535966e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.533969e-01 , 4.538179e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.159405e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.521551e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.155846e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 0a ProcInit : 0.000411 sec 0b MemAlloc : 0.074275 sec 0c GenCreat : 0.000958 sec 1a GenSeed : 0.000023 sec 1b GenRnGen : 0.327663 sec 2a RamboIni : 0.100796 sec 2b RamboFin : 1.839039 sec 3a SigmaKin : 5.443159 sec 4a DumpLoop : 0.082644 sec 8a CompStat : 0.027072 sec 9a GenDestr : 0.000104 sec 9b DumpScrn : 0.013933 sec 9c DumpJson : 0.000006 sec TOTAL : 7.910083 sec TOTAL (123) : 7.710680 sec TOTAL (23) : 7.382994 sec TOTAL (1) : 0.327686 sec TOTAL (2) : 1.939835 sec TOTAL (3) : 5.443159 sec *********************************************************************** real 0m7.939s user 0m7.790s sys 0m0.147s Conversely, epoch2 is 10% faster than epoch1 in CUDA??? Epoch2: time ./gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123)= ( 1.042367e-01 ) sec TotalTime[Rambo+ME] (23)= ( 9.679775e-02 ) sec TotalTime[RndNumGen] (1)= ( 7.438907e-03 ) sec TotalTime[Rambo] (2)= ( 8.743204e-02 ) sec TotalTime[MatrixElems] (3)= ( 9.365707e-03 ) sec MeanTimeInMatrixElems = ( 7.804756e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.767680e-04 , 7.837020e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.035742e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23)= ( 6.499589e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3)= ( 6.717545e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 2.037707 sec 0a ProcInit : 0.000523 sec 0b MemAlloc : 0.035856 sec 0c GenCreat : 0.009784 sec 0d SGoodHel : 0.001597 sec 1a GenSeed : 0.000021 sec 1b GenRnGen : 0.007418 sec 2a RamboIni : 0.000088 sec 2b RamboFin : 0.000045 sec 2c CpDTHwgt : 0.007396 sec 2d CpDTHmom : 0.079903 sec 3a SigmaKin : 0.000087 sec 3b CpDTHmes : 0.009279 sec 4a DumpLoop : 0.087360 sec 8a CompStat : 0.044967 sec 9a GenDestr : 0.000068 sec 9b DumpScrn : 0.000254 sec 9c DumpJson : 0.000002 sec TOTAL : 2.322353 sec TOTAL (123) : 0.104237 sec TOTAL (23) : 0.096798 sec TOTAL (1) : 0.007439 sec TOTAL (2) : 0.087432 sec TOTAL (3) : 0.009366 sec *********************************************************************** real 0m2.630s user 0m0.426s sys 0m0.781s Epoch1: time ./gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.056586e-01 ) sec TotalTime[Rambo+ME] (23) = ( 9.805914e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.599440e-03 ) sec TotalTime[Rambo] (2) = ( 8.761816e-02 ) sec TotalTime[MatrixElems] (3) = ( 1.044098e-02 ) sec MeanTimeInMatrixElems = ( 8.700821e-04 ) sec [Min,Max]TimeInMatrixElems = [ 8.588060e-04 , 8.841980e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.954515e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.415981e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.025730e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071582e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** 00 CudaFree : 1.039487 sec 0a ProcInit : 0.000524 sec 0b MemAlloc : 0.035999 sec 0c GenCreat : 0.011516 sec 0d SGoodHel : 0.001738 sec 1a GenSeed : 0.000021 sec 1b GenRnGen : 0.007579 sec 2a RamboIni : 0.000098 sec 2b RamboFin : 0.000061 sec 2c CpDTHwgt : 0.007369 sec 2d CpDTHmom : 0.080091 sec 3a SigmaKin : 0.000084 sec 3b CpDTHmes : 0.010357 sec 4a DumpLoop : 0.087430 sec 8a CompStat : 0.045176 sec 9a GenDestr : 0.000067 sec 9b DumpScrn : 0.000222 sec 9c DumpJson : 0.000002 sec TOTAL : 1.327819 sec TOTAL (123) : 0.105659 sec TOTAL (23) : 0.098059 sec TOTAL (1) : 0.007599 sec TOTAL (2) : 0.087618 sec TOTAL (3) : 0.010441 sec *********************************************************************** real 0m1.636s user 0m0.523s sys 0m0.867s

…or nvtx warnings

…smetics) to ep2 Minimal changes in epoch1: - remove unused headers in epoch1 - remove two empty lines in the code doing the performance dump Port to epoch2 many changes from epoch1: - add omp.h in epoch2 - use the ep1 printout about '-d' also in epoch2 - use the ep1 printout about OMP_NUM_THREADS also in epoch2 - export OMP_NUM_THREADS=1 if not set also in epoch2 - initialize T() in hstMakeUnique also in epoch2 - comment out unused stdwtim also in epoch2 - add one space per line in the performance dump also in epoch2 - add OMP info in the performance dump also in epoch2 - add gcc compiler info in the performance dump also in epoch2 - return 0 at the end of main also in epoch2

…smetics) to ep2 Minimal changes in epoch1: - remove unused headers in epoch1 - remove two empty lines in the code doing the performance dump Port to epoch2 many changes from epoch1: - add omp.h in epoch2 - use the ep1 printout about '-d' also in epoch2 - use the ep1 printout about OMP_NUM_THREADS also in epoch2 - export OMP_NUM_THREADS=1 if not set also in epoch2 - initialize T() in hstMakeUnique also in epoch2 - comment out unused stdwtim also in epoch2 - add one space per line in the performance dump also in epoch2 - add OMP info in the performance dump also in epoch2 - [commented out] add gcc compiler info in the performance dump also in epoch2 - return 0 at the end of main also in epoch2 No change in performance in epoch2: c++ 1.09E6, cuda 6.71E8

…rs as in epoch1 Indeed, check.cc was not compiling in SINGLE mode otherwise: Makefile:44: CUDA_HOME is not set or is invalid. Export CUDA_HOME to compile with cuda /cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++ -O3 -std=c++11 -I. -I../../src -I../../../../../tools -Wall -Wshadow -Wextra -fopenmp -DMGONGPU_COMMONRAND_ONHOST -ffast-math -c check.cc -o check.o check.cc: In function ‘int main(int, char**)’: check.cc:312:81: error: conversion from ‘vector<float>’ to non-scalar type ‘vector<double>’ requested 312 | std::vector<double> commonRnd = commonRandomPromises[iiter].get_future().get(); | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ make: *** [check.o] Error 1 Note (issue madgraph5#143) that neither epoch2 nor epoch1 build in single precision, anyway...

…piler This also requires adding Process::getCompiler to ep2 CPPProcess.cc/h. Now check.cc is identical in both epoch2 and epoch1 (and runTest.cc is almost identical, except for the test name). Will now include PR madgraph5#144 for single precision in epoch1, and will copy check.cc again (and runTest.cc with some changes). Epoch2 baseline remains epoch2 C++ 1.10e6, cuda 6.6e8 time ./check.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = STD::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Random number generation = CURAND (C++ code) OMP threads / `nproc --all` = 1 / 4 MatrixElements compiler = gcc (GCC) 9.2.0 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 7.968041e+00 ) sec TotalTime[Rambo+ME] (23) = ( 7.643061e+00 ) sec TotalTime[RndNumGen] (1) = ( 3.249804e-01 ) sec TotalTime[Rambo] (2) = ( 1.928639e+00 ) sec TotalTime[MatrixElems] (3) = ( 5.714422e+00 ) sec MeanTimeInMatrixElems = ( 4.762018e-01 ) sec [Min,Max]TimeInMatrixElems = [ 4.760149e-01 , 4.765775e-01 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.895863e+05 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 8.231592e+05 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 1.100979e+06 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) *********************************************************************** time ./gcheck.exe -p 2048 256 12 *********************************************************************** NumBlocksPerGrid = 2048 NumThreadsPerBlock = 256 NumIterations = 12 ----------------------------------------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[4] Momenta memory layout = AOSOA[4] Wavefunction GPU memory = LOCAL Random number generation = CURAND DEVICE (CUDA code) MatrixElements compiler = nvcc 11.0.221 ----------------------------------------------------------------------- NumberOfEntries = 12 TotalTime[Rnd+Rmb+ME] (123) = ( 1.044441e-01 ) sec TotalTime[Rambo+ME] (23) = ( 9.709213e-02 ) sec TotalTime[RndNumGen] (1) = ( 7.351930e-03 ) sec TotalTime[Rambo] (2) = ( 8.758798e-02 ) sec TotalTime[MatrixElems] (3) = ( 9.504147e-03 ) sec MeanTimeInMatrixElems = ( 7.920122e-04 ) sec [Min,Max]TimeInMatrixElems = [ 7.825940e-04 , 8.001750e-04 ] sec ----------------------------------------------------------------------- TotalEventsComputed = 6291456 EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.023757e+07 ) sec^-1 EvtsPerSec[Rmb+ME] (23) = ( 6.479882e+07 ) sec^-1 EvtsPerSec[MatrixElems] (3) = ( 6.619696e+08 ) sec^-1 *********************************************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = ( 1.372152e-02 +- 3.269516e-06 ) GeV^0 [Min,Max]MatrixElemValue = [ 6.071581e-03 , 3.374925e-02 ] GeV^0 StdDevMatrixElemValue = ( 8.200854e-03 ) GeV^0 MeanWeight = ( 4.515827e-01 +- 0.000000e+00 ) [Min,Max]Weight = [ 4.515827e-01 , 4.515827e-01 ] StdDevWeight = ( 0.000000e+00 ) ***********************************************************************

…graph5#144)

…adgraph5#144 Note that now ep2 and ep1 runTest.cc are identical except for the test name EP2/EP1

…ferent names in epoch_process_id.h

… _GPU suffix

valassi · 2021-04-01T15:13:34Z

I have decided to split this further into two PR. I have done everything except CPPProcess, but this is the most complex part (and I actually even see a minor performance differences). I will split that out in a third PR.

Recap about issue #139

first part, merged PR Clean up and rename files in epoch2/eemumu #140 : clean up and rename files in epoch2/eemumu
second part, this PR Merge epoch2 and epoch1 - second part (still without CPPProcess) #149 that I will soon merge: ensure that all files in epoch1 and epoch2 are identical in master, except for CPPProcess (and the related HelAmps_sm)
third part, an upcoming PR: address also the differences between epoch2 and epoch1 in CPPProcess

More in detail about this PR #149 below.

In src:

Parameters_sm.h
Remove "using namespace std;" in epoch2. Otherwise almost identical.
Copy epoch2 to epoch1.
Parameters_sm.cc
Add explicit std:: in epoch2. Otherwise almost identical.
Copy epoch2 to epoch1.
read_slha.h
Identical but for indentation: fix them manually and make them equal.
(clang-format would bring too many changes)
read_slha.cc
Identical but for a default parameter value in implementation in epoch2.
Fix by copying epoch1 to epoch2.
rambo.h/cc
Identical in epoch2 and epoch1, nothing to do
mgOnGpuConfig.h
Identical, except for a comment (did the percent sign disturn the metacode?).
Fix by copying epoch1 to epoch2.
mgOnGpuTypes.h
Identical.
Makefile
Almost identical, but ep1 has OMP, fastmath, Wextra.
Fix by copying epoch1 to epoch2.
HelAmps.h/cc
MISSING IN EPOCH1! Do this later...

In SubProcesses and below:

timer.h
Identical
Makefile
Almost identical but epoch1 has much more, cosmetics and copy ep1 to ep2
Now added also to ep2, as in epoch1: OMP, fastmath, Wextra, clang patch, host info

Note: at this stage, epoch1 is slightly faster than epoch2 in c++, but the inverse in CUDA.

Memory.h, nvtx.h, perf.py
Identical (but a symlink is missing, to be added in epoch1)
timermap.h
Copy epoch1 to epoch2 to add missing gcc pragmas for nvtx warnings
perf/data
Only in epoch1 - one json file, keep it there
profile.sh
Only in epoch1 - should bring it forward eventually
(anyway the basis will be epoch1)
runTest.cc
Initially identical, but tests had different name (e.g. EP1_CUDA_GPU vs EP2_CUDA_GPU). This is fixed by adding epoch_process_id.h where a different macro is defined per epoch, then runTest.cc is now identical.
check.cc

First batch of changes

Minimal changes in epoch1:

remove unused headers in epoch1
remove two empty lines in the code doing the performance dump

Port to epoch2 many changes from epoch1:

add omp.h in epoch2
use the ep1 printout about '-d' also in epoch2
use the ep1 printout about OMP_NUM_THREADS also in epoch2
export OMP_NUM_THREADS=1 if not set also in epoch2
initialize T() in hstMakeUnique also in epoch2
comment out unused stdwtim also in epoch2
add one space per line in the performance dump also in epoch2
add OMP info in the performance dump also in epoch2
[commented out] add gcc compiler info in the performance dump also in epoch2
return 0 at the end of main also in epoch2

7bis) runTest.cc
8bis) check.cc

A large batch of additional changes (mainly in PR #144) came from fixing epoch2 check.cc to use fptype for random numbers as in epoch1. This triggered many additional checks about single precision, included in PR #144, which also includes a better treatment of NaNs.

This is all at the time of this PR (after some previous ones). Then the rest will be about CPPProcess.

valassi · 2021-04-01T15:14:02Z

Self-merging.

valassi added 21 commits March 30, 2021 16:41

Further bug fix for cuda includes

c3402a5

Comment out cuda includes where not needed

da12367

[ep2to2ep1] Parameters_sm.h : remove "using std" in ep2, copy ep2 to ep1

72ef9e7

[ep2to2ep1] Parameters_sm.cc : prepend std:: in ep2, copy ep2 to ep1

fe6bae0

[ep2to2ep1] read_slha.h : identical except for formatting, make them …

660ed06

…equal

[ep2to2ep1] read_slha.cc : copy ep1 to ep2 (remove default parameter …

fece0dd

…value in implementation)

[ep2to2ep1] mgOnGpuConfig.h : identical but for a comment, copy ep1 t…

afe6796

…o ep2

[ep2to2ep1] Parameters_sm.cc : had forgotten to copy ep2 to ep1

4fb5163

[ep2to2ep1] timermap.h - copy ep2 to ep1 to add missing gcc pragmas f…

62e4aa5

…or nvtx warnings

Merge remote-tracking branch 'upstream/master' into ep2to2ep1

75ce445

Merge remote-tracking branch 'upstream/master' into ep2to2ep1

3448f14

[ep2to2ep1] check.cc : copy again ep1 to ep2 (ep1 updated from PR mad…

479c801

…graph5#144)

[ep2to2ep1] runTest.cc : port into ep2 the latest ep1 changes from PR m…

9c1a09c

…adgraph5#144 Note that now ep2 and ep1 runTest.cc are identical except for the test name EP2/EP1

[ep2to2ep1] runTest.cc : use same runTest.cc in ep2 and ep1, with dif…

edc1eee

…ferent names in epoch_process_id.h

[ep2to2ep1] runTest.cc : same runTest.cc in ep2 and ep1, add _CPU and…

09a337a

… _GPU suffix

valassi marked this pull request as draft March 31, 2021 16:45

valassi added the enhancement A feature we want to develop label Mar 31, 2021

Merge remote-tracking branch 'upstream/master' into ep2to2ep1

c93f896

valassi changed the title ~~Merge epoch2 and epoch1~~ Merge epoch2 and epoch1 - first part (without CPPProcess) Apr 1, 2021

valassi marked this pull request as ready for review April 1, 2021 14:49

valassi changed the title ~~Merge epoch2 and epoch1 - first part (without CPPProcess)~~ Merge epoch2 and epoch1 - second part (still without CPPProcess) Apr 1, 2021

valassi merged commit d047ea4 into madgraph5:master Apr 1, 2021

This was referenced Apr 1, 2021

Move latest eemumu developments from epoch1 to epoch2 ("merge" epoch2 into epoch1) #139

Closed

Merge epoch2 and epoch1 - third part (CPPProcess and HelAmps) #151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

valassi commented Mar 31, 2021

valassi commented Apr 1, 2021

valassi commented Apr 1, 2021

Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

Conversation

valassi commented Mar 31, 2021

valassi commented Apr 1, 2021

valassi commented Apr 1, 2021