Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

valassi · 2021-06-14T17:24:41Z

As discussed in PR #211, single precision average ME is not the same for CUDA and C++ in single-precision ggttgg

See for instance valassi@a75ee3b#diff-45e40fdc2f6b7c71419c9f5e7e36267d7951e21c32488d6ecf35de3ec28ced57

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 6.610975e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.059594e+00 +- 2.368052e+00 )  GeV^-4
TOTAL       :     5.920932 sec
    15,536,792,487      cycles                    #    2.654 GHz
    28,689,538,755      instructions              #    1.85  insn per cycle
       6.207201648 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.786471e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.060118e+00 +- 2.367901e+00 )  GeV^-4
TOTAL       :     9.183867 sec
    24,604,689,155      cycles                    #    2.677 GHz
    73,872,471,813      instructions              #    3.00  insn per cycle
       9.193035302 seconds time elapsed

In double precision, results are similar to those, butnot the same, and they are the same as each other to more digits
valassi@33e7c04#diff-45e40fdc2f6b7c71419c9f5e7e36267d7951e21c32488d6ecf35de3ec28ced57

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.438062e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     5.929722 sec
    14,377,877,684      cycles                    #    2.653 GHz
    24,406,140,862      instructions              #    1.70  insn per cycle
       6.229614368 seconds time elapsed

FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.825369e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.991304 sec
    24,089,227,557      cycles                    #    2.677 GHz
    73,968,938,757      instructions              #    3.07  insn per cycle
       8.999893583 seconds time elapsed

Note that for eemumu, in single precision the same average ME is printed out (if I remember correctly?)

NO, I remember badly. On eemumu, on MANY more events, I get a different number of NaNs!
And as a consequence also a different average ME
7173757#diff-6716e7ab4317b4e76c92074d38021be37ad0eda68f248fb16f11e679f26114a6

On lxplus770.cern.ch (T4):
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.3.58]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.304735e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     1.016515 sec
real    0m1.137s
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 8.3.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.190025e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.257611 sec
real    0m7.274s
=Symbols in CPPProcess.o= (~sse4:  540) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 8.3.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 9.141683e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.633856 sec
real    0m2.651s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2941) (512y:   89) (512z:    0)
-------------------------------------------------------------------------

So there is clearly some numerical precision to investigate also for eemumu

The text was updated successfully, but these errors were encountered:

valassi · 2022-02-09T18:19:22Z

(This is related to #5 by the way).

Having a quick update on this after a few months. The issue is always there: in single precision, there are a few Nans, example

madgraph4gpu/epochX/cudacpp/tput/logs_eemumu_manu/log_eemumu_manu_f_inl0_hrd0.txt

Line 112 in a698c62

FP precision = FLOAT (NaN/abnormal=6, zero=0)

I had even done some minimal debugging at some point (mainly to understand how to detect "NaN" at all, when fast math is enabled! See

madgraph4gpu/epochX/cudacpp/ee_mumu/SubProcesses/CrossSectionKernels.cc

Line 129 in a698c62

    
           // - check.exe/commonrand: ME[310744,451171,3007871,3163868,4471038,5473927] with fast math

There is some interesting work to be done here, which however is largely debugging. For instance:

clearly identify some events in some process (eg the 6 events in eemumu above) for which results are abnormal
debug in detail why these events are abnormal: are there some formulas that get close to 0, with a sqrt that goes to take a negative value? or with a division by something close to 0? or what else?
(bonus, further puzzling question: why are the number of 'nans' different for different vectorization levels, if it is the exact same data and the exact same formulas??)
(cross check, is the current implementation of 'is abnormal' reasonable? are there other cases we are not detecting?!)
understand if it is possible to prototype some alternative implementations of the same formulas, to be less sensitive to numerical instabilities (example: the ME averaging done in this class uses some tricks to get more stable results for averages and standard deviations):

madgraph4gpu/epochX/cudacpp/ee_mumu/SubProcesses/CrossSectionKernels.cc

Line 119 in a698c62

void CrossSectionKernelHost::updateEventStatistics( const bool debug )
(further idea, maybe try to use always the same random numbers in float and double, eg just generate them in double and convert them to float if needed, and see if the final ME average is different)

This is not an academic exercise. The final goal of this study is to try and understand if the matrix element calculations can be moved from double to single precision. This would mean a factor 2 speedup both in vectorized c++ (twice as many elements in SIMD vectors) and in CUDA (typically, twice as many FLOPs on Nvidia data center cards)

valassi · 2022-02-09T18:35:09Z

(This is also related to #117 where fast math first appeared..)

…sults are the same in double, but nans differ in float

…ower, results are the same in double, but nans differ in float" This reverts commit 45b7b33.

valassi · 2022-02-09T18:53:31Z

I have just made a small test in a PT that I am about to merge
45b7b33

I have disabled fast math in eemumu and run double and float, results

Throughputs are around 30% slower if fast math is disabled (it would be interesting to see what happens also in ggtt etc)
There are nans in float also without fast math: in other words, it is not the fast math switch that causes the problem. The fast math switch is only something that potentially makes it more difficult to use the results.
In any case, the goal of reqriting formulas should be to have 0 nans in the calculation. Having even only one nan may give unreliable physics results...
Further thing to understand: why is the number of nasn observed different with and without fast math?...

… see madgraph5#5, madgraph5#212

…y done for C++) - now 'make FPTYPE=f check' succeeds! - see madgraph5#5, madgraph5#212

…e and float (see madgraph5#5 and madgraph5#212)

…sults are the same in double, but nans differ in float

…ower, results are the same in double, but nans differ in float" This reverts commit 45b7b33.

…e and float (see madgraph5/madgraph4gpu#5 and madgraph5/madgraph4gpu#212)

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 9, 2022

[nan] test disabling fast math (madgraph5#212): around 30% slower, re…

45b7b33

…sults are the same in double, but nans differ in float

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 9, 2022

Revert "[nan] test disabling fast math (madgraph5#212): around 30% sl…

32a2a92

…ower, results are the same in double, but nans differ in float" This reverts commit 45b7b33.

valassi mentioned this issue Feb 9, 2022

(nochange) Throughput logs in double/float eemumu with fast math disabled #379

Merged

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] allow double in Fortran and float in C++ (not yet in CUDA) -…

d1d5a95

… see madgraph5#5, madgraph5#212

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] allow double in Fortran and float also in CUDA (as previousl…

d430ef1

…y done for C++) - now 'make FPTYPE=f check' succeeds! - see madgraph5#5, madgraph5#212

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 23, 2022

[shared] in codegen, improve a comment in Bridge.h about mixing doubl…

12ba4aa

…e and float (see madgraph5#5 and madgraph5#212)

valassi mentioned this issue Feb 24, 2022

Shared libraries + Bridge + Cleaner Makefiles #367

Merged

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 24, 2022

[nan] test disabling fast math (madgraph5#212): around 30% slower, re…

ef809d4

…sults are the same in double, but nans differ in float

valassi added a commit to valassi/madgraph4gpu that referenced this issue Feb 24, 2022

Revert "[nan] test disabling fast math (madgraph5#212): around 30% sl…

2d02bc1

…ower, results are the same in double, but nans differ in float" This reverts commit 45b7b33.

valassi added a commit to mg5amcnlo/mg5amcnlo_cudacpp that referenced this issue Aug 16, 2023

[shared] in codegen, improve a comment in Bridge.h about mixing doubl…

64185a3

…e and float (see madgraph5/madgraph4gpu#5 and madgraph5/madgraph4gpu#212)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

valassi commented Jun 14, 2021

valassi commented Feb 9, 2022 •

edited

Loading

valassi commented Feb 9, 2022

valassi commented Feb 9, 2022 •

edited

Loading

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Comments

valassi commented Jun 14, 2021

valassi commented Feb 9, 2022 • edited Loading

valassi commented Feb 9, 2022

valassi commented Feb 9, 2022 • edited Loading

valassi commented Feb 9, 2022 •

edited

Loading

valassi commented Feb 9, 2022 •

edited

Loading