Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge epoch2 and epoch1 - second part (still without CPPProcess) #149

Merged
merged 22 commits into from
Apr 1, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Mar 31, 2021

This is the PR to complete issue #139.

I keep it as WIP for now, it's 80% done but still needs a few (quite important as performance-relevant) tweaks.

valassi added 21 commits March 30, 2021 16:41
… - copy it to ep2

Epoch2 before fastmath:
time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 9.806006e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 9.456839e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.491671e-01                 )  sec
TotalTime[Rambo]        (2)= ( 2.018251e+00                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.438588e+00                 )  sec
MeanTimeInMatrixElems      = ( 6.198823e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 6.183559e-01 ,  6.259246e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.415921e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 6.652811e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 8.457864e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000397 sec
0b MemAlloc :     0.000043 sec
0c GenCreat :     0.000955 sec
1a GenSeed  :     0.000031 sec
1b GenRnGen :     0.349136 sec
2a RamboIni :     0.138318 sec
2b RamboFin :     1.879934 sec
3a SigmaKin :     7.438588 sec
4a DumpLoop :     0.087978 sec
8a CompStat :     0.045155 sec
9a GenDestr :     0.000113 sec
9b DumpScrn :     0.000223 sec
9c DumpJson :     0.000001 sec
TOTAL       :     9.940873 sec
TOTAL (123) :     9.806006 sec
TOTAL  (23) :     9.456840 sec
TOTAL   (1) :     0.349167 sec
TOTAL   (2) :     2.018251 sec
TOTAL   (3) :     7.438588 sec
***********************************************************************
real    0m9.971s
user    0m9.812s
sys     0m0.157s

Epoch2 after fastmath: NOT FASTER (?!)
time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 9.747692e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 9.397507e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.501850e-01                 )  sec
TotalTime[Rambo]        (2)= ( 1.976519e+00                 )  sec
TotalTime[MatrixElems]  (3)= ( 7.420988e+00                 )  sec
MeanTimeInMatrixElems      = ( 6.184157e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 6.178201e-01 ,  6.216142e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.454303e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 6.694814e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 8.477922e+05                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000400 sec
0b MemAlloc :     0.000043 sec
0c GenCreat :     0.001004 sec
1a GenSeed  :     0.000032 sec
1b GenRnGen :     0.350153 sec
2a RamboIni :     0.140705 sec
2b RamboFin :     1.835814 sec
3a SigmaKin :     7.420989 sec
4a DumpLoop :     0.083478 sec
8a CompStat :     0.045091 sec
9a GenDestr :     0.000119 sec
9b DumpScrn :     0.000269 sec
9c DumpJson :     0.000001 sec
TOTAL       :     9.878097 sec
TOTAL (123) :     9.747692 sec
TOTAL  (23) :     9.397507 sec
TOTAL   (1) :     0.350185 sec
TOTAL   (2) :     1.976519 sec
TOTAL   (3) :     7.420989 sec
***********************************************************************
real    0m9.908s
user    0m9.769s
sys     0m0.138s
…osmetics and copy ep1 to ep2

What ep1 had which is now added also to ep2: OMP, fastmath, Wextra, clang patch, host info

Using fastmath also here, the speed does increase in epoch2
(note that HelAmps is compiled here via an include, so it makes sense)

Epoch2: time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = STD::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Random number generation   = CURAND (C++ code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 8.066252e+00                 )  sec
TotalTime[Rambo+ME]    (23)= ( 7.716077e+00                 )  sec
TotalTime[RndNumGen]    (1)= ( 3.501755e-01                 )  sec
TotalTime[Rambo]        (2)= ( 1.981157e+00                 )  sec
TotalTime[MatrixElems]  (3)= ( 5.734920e+00                 )  sec
MeanTimeInMatrixElems      = ( 4.779100e-01                 )  sec
[Min,Max]TimeInMatrixElems = [ 4.771928e-01 ,  4.813840e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 7.799726e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 8.153698e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 1.097043e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000383 sec
0b MemAlloc :     0.000041 sec
0c GenCreat :     0.001009 sec
1a GenSeed  :     0.000049 sec
1b GenRnGen :     0.350127 sec
2a RamboIni :     0.137961 sec
2b RamboFin :     1.843195 sec
3a SigmaKin :     5.734920 sec
4a DumpLoop :     0.085327 sec
8a CompStat :     0.027027 sec
9a GenDestr :     0.000147 sec
9b DumpScrn :     0.000251 sec
9c DumpJson :     0.000001 sec
TOTAL       :     8.180439 sec
TOTAL (123) :     8.066252 sec
TOTAL  (23) :     7.716077 sec
TOTAL   (1) :     0.350176 sec
TOTAL   (2) :     1.981157 sec
TOTAL   (3) :     5.734920 sec
***********************************************************************
real    0m8.211s
user    0m8.072s
sys     0m0.137s

Note that epoch1 is always a bit faster...
Epoch1: time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.710680e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.382994e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.276863e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.939835e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.443159e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.535966e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.533969e-01 ,  4.538179e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 8.159405e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.521551e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.155846e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
0a ProcInit :     0.000411 sec
0b MemAlloc :     0.074275 sec
0c GenCreat :     0.000958 sec
1a GenSeed  :     0.000023 sec
1b GenRnGen :     0.327663 sec
2a RamboIni :     0.100796 sec
2b RamboFin :     1.839039 sec
3a SigmaKin :     5.443159 sec
4a DumpLoop :     0.082644 sec
8a CompStat :     0.027072 sec
9a GenDestr :     0.000104 sec
9b DumpScrn :     0.013933 sec
9c DumpJson :     0.000006 sec
TOTAL       :     7.910083 sec
TOTAL (123) :     7.710680 sec
TOTAL  (23) :     7.382994 sec
TOTAL   (1) :     0.327686 sec
TOTAL   (2) :     1.939835 sec
TOTAL   (3) :     5.443159 sec
***********************************************************************
real    0m7.939s
user    0m7.790s
sys     0m0.147s

Conversely, epoch2 is 10% faster than epoch1 in CUDA???

Epoch2: time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid           = 2048
NumThreadsPerBlock         = 256
NumIterations              = 12
-----------------------------------------------------------------------
FP precision               = DOUBLE (nan=0)
Complex type               = THRUST::COMPLEX
RanNumb memory layout      = AOSOA[4]
Momenta memory layout      = AOSOA[4]
Wavefunction GPU memory    = LOCAL
Random number generation   = CURAND DEVICE (CUDA code)
-----------------------------------------------------------------------
NumberOfEntries            = 12
TotalTime[Rnd+Rmb+ME] (123)= ( 1.042367e-01                 )  sec
TotalTime[Rambo+ME]    (23)= ( 9.679775e-02                 )  sec
TotalTime[RndNumGen]    (1)= ( 7.438907e-03                 )  sec
TotalTime[Rambo]        (2)= ( 8.743204e-02                 )  sec
TotalTime[MatrixElems]  (3)= ( 9.365707e-03                 )  sec
MeanTimeInMatrixElems      = ( 7.804756e-04                 )  sec
[Min,Max]TimeInMatrixElems = [ 7.767680e-04 ,  7.837020e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed        = 6291456
EvtsPerSec[Rnd+Rmb+ME](123)= ( 6.035742e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23)= ( 6.499589e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3)= ( 6.717545e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)  = 6291456
MeanMatrixElemValue        = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue   = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue      = ( 8.200854e-03                 )  GeV^0
MeanWeight                 = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight            = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight               = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     2.037707 sec
0a ProcInit :     0.000523 sec
0b MemAlloc :     0.035856 sec
0c GenCreat :     0.009784 sec
0d SGoodHel :     0.001597 sec
1a GenSeed  :     0.000021 sec
1b GenRnGen :     0.007418 sec
2a RamboIni :     0.000088 sec
2b RamboFin :     0.000045 sec
2c CpDTHwgt :     0.007396 sec
2d CpDTHmom :     0.079903 sec
3a SigmaKin :     0.000087 sec
3b CpDTHmes :     0.009279 sec
4a DumpLoop :     0.087360 sec
8a CompStat :     0.044967 sec
9a GenDestr :     0.000068 sec
9b DumpScrn :     0.000254 sec
9c DumpJson :     0.000002 sec
TOTAL       :     2.322353 sec
TOTAL (123) :     0.104237 sec
TOTAL  (23) :     0.096798 sec
TOTAL   (1) :     0.007439 sec
TOTAL   (2) :     0.087432 sec
TOTAL   (3) :     0.009366 sec
***********************************************************************
real    0m2.630s
user    0m0.426s
sys     0m0.781s

Epoch1: time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.056586e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 9.805914e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.599440e-03                 )  sec
TotalTime[Rambo]        (2) = ( 8.761816e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 1.044098e-02                 )  sec
MeanTimeInMatrixElems       = ( 8.700821e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 8.588060e-04 ,  8.841980e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 5.954515e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.415981e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.025730e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071582e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
00 CudaFree :     1.039487 sec
0a ProcInit :     0.000524 sec
0b MemAlloc :     0.035999 sec
0c GenCreat :     0.011516 sec
0d SGoodHel :     0.001738 sec
1a GenSeed  :     0.000021 sec
1b GenRnGen :     0.007579 sec
2a RamboIni :     0.000098 sec
2b RamboFin :     0.000061 sec
2c CpDTHwgt :     0.007369 sec
2d CpDTHmom :     0.080091 sec
3a SigmaKin :     0.000084 sec
3b CpDTHmes :     0.010357 sec
4a DumpLoop :     0.087430 sec
8a CompStat :     0.045176 sec
9a GenDestr :     0.000067 sec
9b DumpScrn :     0.000222 sec
9c DumpJson :     0.000002 sec
TOTAL       :     1.327819 sec
TOTAL (123) :     0.105659 sec
TOTAL  (23) :     0.098059 sec
TOTAL   (1) :     0.007599 sec
TOTAL   (2) :     0.087618 sec
TOTAL   (3) :     0.010441 sec
***********************************************************************
real    0m1.636s
user    0m0.523s
sys     0m0.867s
…smetics) to ep2

Minimal changes in epoch1:
- remove unused headers in epoch1
- remove two empty lines in the code doing the performance dump

Port to epoch2 many changes from epoch1:
- add omp.h in epoch2
- use the ep1 printout about '-d' also in epoch2
- use the ep1 printout about OMP_NUM_THREADS also in epoch2
- export OMP_NUM_THREADS=1 if not set also in epoch2
- initialize T() in hstMakeUnique also in epoch2
- comment out unused stdwtim also in epoch2
- add one space per line in the performance dump also in epoch2
- add OMP info in the performance dump also in epoch2
- add gcc compiler info in the performance dump also in epoch2
- return 0 at the end of main also in epoch2
…smetics) to ep2

Minimal changes in epoch1:
- remove unused headers in epoch1
- remove two empty lines in the code doing the performance dump

Port to epoch2 many changes from epoch1:
- add omp.h in epoch2
- use the ep1 printout about '-d' also in epoch2
- use the ep1 printout about OMP_NUM_THREADS also in epoch2
- export OMP_NUM_THREADS=1 if not set also in epoch2
- initialize T() in hstMakeUnique also in epoch2
- comment out unused stdwtim also in epoch2
- add one space per line in the performance dump also in epoch2
- add OMP info in the performance dump also in epoch2
- [commented out] add gcc compiler info in the performance dump also in epoch2
- return 0 at the end of main also in epoch2

No change in performance in epoch2: c++ 1.09E6, cuda 6.71E8
…rs as in epoch1

Indeed, check.cc was not compiling in SINGLE mode otherwise:

Makefile:44: CUDA_HOME is not set or is invalid. Export CUDA_HOME to compile with cuda
/cvmfs/sft.cern.ch/lcg/releases/gcc/9.2.0-afc57/x86_64-centos7/bin/g++  -O3  -std=c++11 -I. -I../../src -I../../../../../tools  -Wall -Wshadow -Wextra -fopenmp -DMGONGPU_COMMONRAND_ONHOST -ffast-math   -c check.cc -o check.o
check.cc: In function ‘int main(int, char**)’:
check.cc:312:81: error: conversion from ‘vector<float>’ to non-scalar type ‘vector<double>’ requested
  312 |     std::vector<double> commonRnd = commonRandomPromises[iiter].get_future().get();
      |                                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~
make: *** [check.o] Error 1

Note (issue madgraph5#143) that neither epoch2 nor epoch1 build in single precision, anyway...
…piler

This also requires adding Process::getCompiler to ep2 CPPProcess.cc/h.

Now check.cc is identical in both epoch2 and epoch1
(and runTest.cc is almost identical, except for the test name).
Will now include PR madgraph5#144 for single precision in epoch1,
and will copy check.cc again (and runTest.cc with some changes).

Epoch2 baseline remains epoch2 C++ 1.10e6, cuda 6.6e8

time ./check.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = STD::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Random number generation    = CURAND (C++ code)
OMP threads / `nproc --all` = 1 / 4
MatrixElements compiler     = gcc (GCC) 9.2.0
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 7.968041e+00                 )  sec
TotalTime[Rambo+ME]    (23) = ( 7.643061e+00                 )  sec
TotalTime[RndNumGen]    (1) = ( 3.249804e-01                 )  sec
TotalTime[Rambo]        (2) = ( 1.928639e+00                 )  sec
TotalTime[MatrixElems]  (3) = ( 5.714422e+00                 )  sec
MeanTimeInMatrixElems       = ( 4.762018e-01                 )  sec
[Min,Max]TimeInMatrixElems  = [ 4.760149e-01 ,  4.765775e-01 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 7.895863e+05                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 8.231592e+05                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 1.100979e+06                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************

time ./gcheck.exe -p 2048 256 12
***********************************************************************
NumBlocksPerGrid            = 2048
NumThreadsPerBlock          = 256
NumIterations               = 12
-----------------------------------------------------------------------
FP precision                = DOUBLE (nan=0)
Complex type                = THRUST::COMPLEX
RanNumb memory layout       = AOSOA[4]
Momenta memory layout       = AOSOA[4]
Wavefunction GPU memory     = LOCAL
Random number generation    = CURAND DEVICE (CUDA code)
MatrixElements compiler     = nvcc 11.0.221
-----------------------------------------------------------------------
NumberOfEntries             = 12
TotalTime[Rnd+Rmb+ME] (123) = ( 1.044441e-01                 )  sec
TotalTime[Rambo+ME]    (23) = ( 9.709213e-02                 )  sec
TotalTime[RndNumGen]    (1) = ( 7.351930e-03                 )  sec
TotalTime[Rambo]        (2) = ( 8.758798e-02                 )  sec
TotalTime[MatrixElems]  (3) = ( 9.504147e-03                 )  sec
MeanTimeInMatrixElems       = ( 7.920122e-04                 )  sec
[Min,Max]TimeInMatrixElems  = [ 7.825940e-04 ,  8.001750e-04 ]  sec
-----------------------------------------------------------------------
TotalEventsComputed         = 6291456
EvtsPerSec[Rnd+Rmb+ME](123) = ( 6.023757e+07                 )  sec^-1
EvtsPerSec[Rmb+ME]     (23) = ( 6.479882e+07                 )  sec^-1
EvtsPerSec[MatrixElems] (3) = ( 6.619696e+08                 )  sec^-1
***********************************************************************
NumMatrixElements(notNan)   = 6291456
MeanMatrixElemValue         = ( 1.372152e-02 +- 3.269516e-06 )  GeV^0
[Min,Max]MatrixElemValue    = [ 6.071581e-03 ,  3.374925e-02 ]  GeV^0
StdDevMatrixElemValue       = ( 8.200854e-03                 )  GeV^0
MeanWeight                  = ( 4.515827e-01 +- 0.000000e+00 )
[Min,Max]Weight             = [ 4.515827e-01 ,  4.515827e-01 ]
StdDevWeight                = ( 0.000000e+00                 )
***********************************************************************
…adgraph5#144

Note that now ep2 and ep1 runTest.cc are identical except for the test name EP2/EP1
@valassi valassi marked this pull request as draft March 31, 2021 16:45
@valassi valassi added the enhancement A feature we want to develop label Mar 31, 2021
@valassi valassi changed the title Merge epoch2 and epoch1 Merge epoch2 and epoch1 - first part (without CPPProcess) Apr 1, 2021
@valassi valassi marked this pull request as ready for review April 1, 2021 14:49
@valassi valassi changed the title Merge epoch2 and epoch1 - first part (without CPPProcess) Merge epoch2 and epoch1 - second part (still without CPPProcess) Apr 1, 2021
@valassi
Copy link
Member Author

valassi commented Apr 1, 2021

I have decided to split this further into two PR. I have done everything except CPPProcess, but this is the most complex part (and I actually even see a minor performance differences). I will split that out in a third PR.

Recap about issue #139

More in detail about this PR #149 below.


In src:

  1. Parameters_sm.h
    Remove "using namespace std;" in epoch2. Otherwise almost identical.
    Copy epoch2 to epoch1.

  2. Parameters_sm.cc
    Add explicit std:: in epoch2. Otherwise almost identical.
    Copy epoch2 to epoch1.

  3. read_slha.h
    Identical but for indentation: fix them manually and make them equal.
    (clang-format would bring too many changes)

  4. read_slha.cc
    Identical but for a default parameter value in implementation in epoch2.
    Fix by copying epoch1 to epoch2.

  5. rambo.h/cc
    Identical in epoch2 and epoch1, nothing to do

  6. mgOnGpuConfig.h
    Identical, except for a comment (did the percent sign disturn the metacode?).
    Fix by copying epoch1 to epoch2.

  7. mgOnGpuTypes.h
    Identical.

  8. Makefile
    Almost identical, but ep1 has OMP, fastmath, Wextra.
    Fix by copying epoch1 to epoch2.

  9. HelAmps.h/cc
    MISSING IN EPOCH1! Do this later...


In SubProcesses and below:

  1. timer.h
    Identical

  2. Makefile
    Almost identical but epoch1 has much more, cosmetics and copy ep1 to ep2
    Now added also to ep2, as in epoch1: OMP, fastmath, Wextra, clang patch, host info

Note: at this stage, epoch1 is slightly faster than epoch2 in c++, but the inverse in CUDA.

  1. Memory.h, nvtx.h, perf.py
    Identical (but a symlink is missing, to be added in epoch1)

  2. timermap.h
    Copy epoch1 to epoch2 to add missing gcc pragmas for nvtx warnings

  3. perf/data
    Only in epoch1 - one json file, keep it there

  4. profile.sh
    Only in epoch1 - should bring it forward eventually
    (anyway the basis will be epoch1)

  5. runTest.cc
    Initially identical, but tests had different name (e.g. EP1_CUDA_GPU vs EP2_CUDA_GPU). This is fixed by adding epoch_process_id.h where a different macro is defined per epoch, then runTest.cc is now identical.

  6. check.cc

First batch of changes

Minimal changes in epoch1:

  • remove unused headers in epoch1
  • remove two empty lines in the code doing the performance dump

Port to epoch2 many changes from epoch1:

  • add omp.h in epoch2
  • use the ep1 printout about '-d' also in epoch2
  • use the ep1 printout about OMP_NUM_THREADS also in epoch2
  • export OMP_NUM_THREADS=1 if not set also in epoch2
  • initialize T() in hstMakeUnique also in epoch2
  • comment out unused stdwtim also in epoch2
  • add one space per line in the performance dump also in epoch2
  • add OMP info in the performance dump also in epoch2
  • [commented out] add gcc compiler info in the performance dump also in epoch2
  • return 0 at the end of main also in epoch2

7bis) runTest.cc
8bis) check.cc

A large batch of additional changes (mainly in PR #144) came from fixing epoch2 check.cc to use fptype for random numbers as in epoch1. This triggered many additional checks about single precision, included in PR #144, which also includes a better treatment of NaNs.


This is all at the time of this PR (after some previous ones). Then the rest will be about CPPProcess.

@valassi
Copy link
Member Author

valassi commented Apr 1, 2021

Self-merging.

@valassi valassi merged commit d047ea4 into madgraph5:master Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant