[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

valassi · 2021-06-14T16:39:55Z

Hi @roiser @oliviermattelaer @hageboeck @cvuosalo this is a comprehensive PR to fix float support in ggttgg, using the same code and techniques as in eemumu (which may be debatable, but at least are consistent).

Do you have any comments? Thanks
Andrea

PS Strangely, I do not get the same exact results in cuda and c++, unlike eemumu. Maybe this is a real numerical instability? To be checked, but the basics look ok.

…as in eemumu All is ok, except that C++ and CUDA now give different results perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 6.610975e+05 ) sec^-1 MeanMatrixElemValue = ( 4.059594e+00 +- 2.368052e+00 ) GeV^-4 TOTAL : 5.920932 sec 15,536,792,487 cycles # 2.654 GHz 28,689,538,755 instructions # 1.85 insn per cycle 6.207201648 seconds time elapsed perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.786471e+03 ) sec^-1 MeanMatrixElemValue = ( 4.060118e+00 +- 2.367901e+00 ) GeV^-4 TOTAL : 9.183867 sec 24,604,689,155 cycles # 2.677 GHz 73,872,471,813 instructions # 3.00 insn per cycle 9.193035302 seconds time elapsed This is not due to a neppR mismatch (whic explains float vs double differences instead). It seems to be due to intrinsic numeric instabilities? ./check.exe -v 1 8 1 | tail -20 Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 1.005777e+02 -2.778140e+01 -8.595747e+01 4.421965e+01 4 6.388790e+02 6.255574e+02 5.316766e+01 -1.183958e+02 5 6.289353e+02 -6.113242e+02 -9.114253e+01 1.163415e+02 6 1.316082e+02 1.354830e+01 1.239324e+02 -4.216534e+01 -------------------------------------------------------------------------------- Matrix element = 5.42301e-07 GeV^-4 -------------------------------------------------------------------------------- Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 5.907211e+02 3.267731e+01 2.475092e+02 -5.353716e+02 4 4.160566e+02 -3.375994e+02 -7.378165e+01 2.317025e+02 5 2.329562e+02 1.924018e+02 -1.046342e+02 7.938418e+01 6 2.602661e+02 1.125203e+02 -6.909338e+01 2.242849e+02 -------------------------------------------------------------------------------- Matrix element = 8.71459e-07 GeV^-4 -------------------------------------------------------------------------------- ./gcheck.exe -v 1 8 1 | tail -20 Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 1.005777e+02 -2.778136e+01 -8.595747e+01 4.421965e+01 4 6.388790e+02 6.255573e+02 5.316774e+01 -1.183958e+02 5 6.289354e+02 -6.113242e+02 -9.114262e+01 1.163414e+02 6 1.316082e+02 1.354829e+01 1.239324e+02 -4.216533e+01 -------------------------------------------------------------------------------- Matrix element = 5.42298e-07 GeV^-4 -------------------------------------------------------------------------------- Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 5.907211e+02 3.267728e+01 2.475092e+02 -5.353716e+02 4 4.160566e+02 -3.375993e+02 -7.378172e+01 2.317025e+02 5 2.329562e+02 1.924017e+02 -1.046341e+02 7.938418e+01 6 2.602661e+02 1.125203e+02 -6.909334e+01 2.242850e+02 -------------------------------------------------------------------------------- Matrix element = 8.7146e-07 GeV^-4 -------------------------------------------------------------------------------- To be investigated... anyway, the port itself can be considered complete

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.410098e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.605451 sec 2,201,358,677 cycles # 2.653 GHz 2,950,126,260 instructions # 1.34 insn per cycle 0.891091212 seconds time elapsed FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.826984e+03 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 8.983166 sec 24,063,706,157 cycles # 2.677 GHz 73,968,944,348 instructions # 3.07 insn per cycle 8.991664842 seconds time elapsed

This ensures the same physics results fro float and double Note that in double precision I get the same physics in CUDA and C++ (this is not exactly so in single precision...) perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.438062e+05 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 5.929722 sec 14,377,877,684 cycles # 2.653 GHz 24,406,140,862 instructions # 1.70 insn per cycle 6.229614368 seconds time elapsed FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.825369e+03 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 8.991304 sec 24,089,227,557 cycles # 2.677 GHz 73,968,938,757 instructions # 3.07 insn per cycle 8.999893583 seconds time elapsed

hageboeck

It's too much too read in detail, but I have an idea for testing it:
We could add a build to the CI that compiles with -DMGONGPU_FPTYPE_FLOAT. This should prevent future compile problems. I suspect that tests would fail, but at least we could add compilation without testing in float mode as a first step.

hageboeck · 2021-06-14T17:07:46Z

epoch2/cuda/gg_ttgg/src/HelAmps_sm.cu

@@ -822,24 +1614,24 @@ __device__ void FFV1_1(const cxtype F2[], const cxtype V3[], const cxtype COUP,
      P1[3]) - M1 * (M1 - cI * W1));
  F1[2] = denom * cI * (F2[2] * (P1[0] * (-V3[2] + V3[5]) + (P1[1] * (V3[3] -
      cI * (V3[4])) + (P1[2] * (+cI * (V3[3]) + V3[4]) + P1[3] * (-V3[2] +
-      V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-1.) *
-      (V3[2] + V3[5]) + (P1[2] * (-1.) * (+cI * (V3[2] + V3[5])) + P1[3] *
+      V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-one) *


This could also be done as ((fptype)-1.) if desired.

This could also be done as ((fptype)-1.) if desired.
Yes it is also one I had considered. I think that in the end I had so many 1., 2. and 1/2 all over the place that this seemed a more readable form

valassi · 2021-06-14T17:32:04Z

It's too much too read in detail, but I have an idea for testing it:
We could add a build to the CI that compiles with -DMGONGPU_FPTYPE_FLOAT. This should prevent future compile problems. I suspect that tests would fail, but at least we could add compilation without testing in float mode as a first step.

Adding CI tests for floats would certainly be useful at some point. I would start with eemumu however, because there were many issues in the googletest infrasttructure for floats, which I fixed in the meantime (for manual tests but not the CI).

The oonly issue is that we should clean up those #defines and just allow a definition from outside. But then some guards must be added all over the place to make sure one of the options is added, and not more. I have it in my mental to do for later... since som etime.

Otherwise, about the tests of ggttgg, one would need to include the changes of eemumu, similarly tto what have done. I think that most of them are here
#168
plus later additions in upstream/master
I actually have the impression that ggttgg tests are disabled (runTest is not built), for similar issues to what I had in eemumu. Probably now it should be easier to port that, using the changes from eemumu

valassi · 2021-06-14T17:34:13Z

Ah, actually I see thta #148 was exactly abut these two issues for eemumu

one, fix the google tests (done for eemumu)
two, massage makefiules, allow float/double from outside, and pass to the CI... this second part has been stripped off to Allow float/double choice from 'make' (with a view on later adding float/double tests in the CI) #167

…shageboe On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.157971e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364536e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.309761 sec 3,360,485,158 cycles # 2.629 GHz 4,797,157,032 instructions # 1.43 insn per cycle 1.611265391 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.303193e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.177774 sec 19,206,775,927 cycles # 2.674 GHz 48,584,293,198 instructions # 2.53 insn per cycle 7.188780540 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.905824e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.586663 sec 9,075,146,996 cycles # 2.528 GHz 16,500,290,915 instructions # 1.82 insn per cycle 3.597145062 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ========================================================================= On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.549522e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.277927e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 1.017317 sec 2,897,857,903 cycles # 2.650 GHz 4,226,294,279 instructions # 1.46 insn per cycle 1.304468210 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 48 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.210885e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.116317 sec 19,058,179,119 cycles # 2.676 GHz 47,729,322,065 instructions # 2.50 insn per cycle 7.127149489 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 578) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 8.869307e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.640200 sec 6,751,569,345 cycles # 2.550 GHz 12,567,251,666 instructions # 1.86 insn per cycle 2.651457868 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2917) (512y: 81) (512z: 0) =========================================================================

…td::sqrt Peculiar however, in CUDA the float version is only 1.5x faster (and with many more CPU cycles?) ========================================================================= FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.404051e+05 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 0.610642 sec 2,230,873,353 cycles # 2.651 GHz 2,977,902,011 instructions # 1.33 insn per cycle 0.907521540 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= FP precision = DOUBLE (nan=0) MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 8.994013 sec 24,096,614,355 cycles # 2.678 GHz 73,974,244,492 instructions # 3.07 insn per cycle 9.002209325 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 1251) (avx2: 0) (512y: 0) (512z: 0) ========================================================================= ========================================================================= FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 6.611444e+05 ) sec^-1 MeanMatrixElemValue = ( 4.059594e+00 +- 2.368052e+00 ) GeV^-4 TOTAL : 6.359318 sec 15,575,819,415 cycles # 2.652 GHz 28,679,182,891 instructions # 1.84 insn per cycle 6.644244352 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= FP precision = FLOAT (nan=0) MeanMatrixElemValue = ( 4.060118e+00 +- 2.367901e+00 ) GeV^-4 TOTAL : 9.177608 sec 24,591,918,482 cycles # 2.678 GHz 73,870,482,502 instructions # 3.00 insn per cycle 9.185536632 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 1133) (avx2: 0) (512y: 0) (512z: 0) =========================================================================

No apparent change in performance or in register pressure On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.157887e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.359792e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.080823 sec 3,363,679,269 cycles # 2.648 GHz 4,814,545,325 instructions # 1.43 insn per cycle 1.387019116 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.520431e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.271315e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 0.897779 sec 2,900,559,004 cycles # 2.648 GHz 4,227,863,147 instructions # 1.46 insn per cycle 1.186573538 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 48 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% =========================================================================

valassi · 2021-06-15T09:36:05Z

Ok I have done a couple changes following some hintgs from @hageboeck (thanks!)

replace sqrt and sqrtf by std::sqrt in c++
conversely, add the distinction between sqrt and sqrtf in CUDA: previously sqrt was used also for floats... th echange does not seem to make any change in results or performance, in any case

I wil merge

valassi added 8 commits June 14, 2021 18:30

[ggttgg] remove '/*' within a comment

da25ccd

hageboeck approved these changes Jun 14, 2021

View reviewed changes

valassi mentioned this pull request Jun 14, 2021

Single precision average ME is not the same for CUDA and C++ in single-precision (ggttgg and eemumu) #212

Open

valassi added 4 commits June 14, 2021 22:07

valassi changed the title ~~[ggttgg] fix float support in ggttgg~~ [ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) Jun 15, 2021

valassi merged commit ed7c076 into madgraph5:master Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

valassi commented Jun 14, 2021

hageboeck left a comment

hageboeck Jun 14, 2021

valassi Jun 14, 2021

valassi commented Jun 14, 2021

valassi commented Jun 14, 2021

valassi commented Jun 15, 2021

[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

Conversation

valassi commented Jun 14, 2021

hageboeck left a comment

Choose a reason for hiding this comment

hageboeck Jun 14, 2021

Choose a reason for hiding this comment

valassi Jun 14, 2021

Choose a reason for hiding this comment

valassi commented Jun 14, 2021

valassi commented Jun 14, 2021

valassi commented Jun 15, 2021