Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211

Merged
merged 12 commits into from
Jun 15, 2021

Conversation

valassi
Copy link
Member

@valassi valassi commented Jun 14, 2021

Hi @roiser @oliviermattelaer @hageboeck @cvuosalo this is a comprehensive PR to fix float support in ggttgg, using the same code and techniques as in eemumu (which may be debatable, but at least are consistent).

Do you have any comments? Thanks
Andrea

PS Strangely, I do not get the same exact results in cuda and c++, unlike eemumu. Maybe this is a real numerical instability? To be checked, but the basics look ok.

valassi added 8 commits June 14, 2021 18:30
perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256
 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.435680e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.623691 sec
     2,196,798,887      cycles                    #    2.646 GHz
     2,967,729,989      instructions              #    1.35  insn per cycle
       0.910385717 seconds time elapsed
perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.431257e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     5.664586 sec
    14,242,402,487      cycles                    #    2.645 GHz
    24,438,431,992      instructions              #    1.72  insn per cycle
       5.955651007 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.827445e+03                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     8.981804 sec
    24,058,806,303      cycles                    #    2.677 GHz
    73,925,907,974      instructions              #    3.07  insn per cycle
       8.991724619 seconds time elapsed
All is ok for cuda, but C++ has some issues...

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.468617e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.605243 sec
     2,198,121,835      cycles                    #    2.654 GHz
     2,959,832,649      instructions              #    1.35  insn per cycle
       0.890934664 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.831296e+03                 )  sec^-1
MeanMatrixElemValue        = ( -nan +- -nan )  GeV^-4
TOTAL       :     8.962130 sec
    24,011,718,485      cycles                    #    2.678 GHz
    73,861,287,531      instructions              #    3.08  insn per cycle
       8.970276172 seconds time elapsed
… SIMD

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.443390e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     6.297783 sec
    14,185,425,916      cycles                    #    2.652 GHz
    24,428,982,656      instructions              #    1.72  insn per cycle
       6.595729859 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.825301e+03                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     8.992239 sec
    24,087,433,131      cycles                    #    2.677 GHz
    73,957,118,424      instructions              #    3.07  insn per cycle
       9.001373453 seconds time elapsed
…as in eemumu

All is ok, except that C++ and CUDA now give different results

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 6.610975e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.059594e+00 +- 2.368052e+00 )  GeV^-4
TOTAL       :     5.920932 sec
    15,536,792,487      cycles                    #    2.654 GHz
    28,689,538,755      instructions              #    1.85  insn per cycle
       6.207201648 seconds time elapsed

perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.786471e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.060118e+00 +- 2.367901e+00 )  GeV^-4
TOTAL       :     9.183867 sec
    24,604,689,155      cycles                    #    2.677 GHz
    73,872,471,813      instructions              #    3.00  insn per cycle
       9.193035302 seconds time elapsed

This is not due to a neppR mismatch (whic explains float vs double differences instead).
It seems to be due to intrinsic numeric instabilities?

./check.exe -v 1 8 1 | tail -20
Momenta:
   1  7.500000e+02  0.000000e+00  0.000000e+00  7.500000e+02
   2  7.500000e+02  0.000000e+00  0.000000e+00 -7.500000e+02
   3  1.005777e+02 -2.778140e+01 -8.595747e+01  4.421965e+01
   4  6.388790e+02  6.255574e+02  5.316766e+01 -1.183958e+02
   5  6.289353e+02 -6.113242e+02 -9.114253e+01  1.163415e+02
   6  1.316082e+02  1.354830e+01  1.239324e+02 -4.216534e+01
--------------------------------------------------------------------------------
 Matrix element = 5.42301e-07 GeV^-4
--------------------------------------------------------------------------------
Momenta:
   1  7.500000e+02  0.000000e+00  0.000000e+00  7.500000e+02
   2  7.500000e+02  0.000000e+00  0.000000e+00 -7.500000e+02
   3  5.907211e+02  3.267731e+01  2.475092e+02 -5.353716e+02
   4  4.160566e+02 -3.375994e+02 -7.378165e+01  2.317025e+02
   5  2.329562e+02  1.924018e+02 -1.046342e+02  7.938418e+01
   6  2.602661e+02  1.125203e+02 -6.909338e+01  2.242849e+02
--------------------------------------------------------------------------------
 Matrix element = 8.71459e-07 GeV^-4
--------------------------------------------------------------------------------

./gcheck.exe -v 1 8 1 | tail -20
Momenta:
   1  7.500000e+02  0.000000e+00  0.000000e+00  7.500000e+02
   2  7.500000e+02  0.000000e+00  0.000000e+00 -7.500000e+02
   3  1.005777e+02 -2.778136e+01 -8.595747e+01  4.421965e+01
   4  6.388790e+02  6.255573e+02  5.316774e+01 -1.183958e+02
   5  6.289354e+02 -6.113242e+02 -9.114262e+01  1.163414e+02
   6  1.316082e+02  1.354829e+01  1.239324e+02 -4.216533e+01
--------------------------------------------------------------------------------
 Matrix element = 5.42298e-07 GeV^-4
--------------------------------------------------------------------------------
Momenta:
   1  7.500000e+02  0.000000e+00  0.000000e+00  7.500000e+02
   2  7.500000e+02  0.000000e+00  0.000000e+00 -7.500000e+02
   3  5.907211e+02  3.267728e+01  2.475092e+02 -5.353716e+02
   4  4.160566e+02 -3.375993e+02 -7.378172e+01  2.317025e+02
   5  2.329562e+02  1.924017e+02 -1.046341e+02  7.938418e+01
   6  2.602661e+02  1.125203e+02 -6.909334e+01  2.242850e+02
--------------------------------------------------------------------------------
 Matrix element = 8.7146e-07 GeV^-4
--------------------------------------------------------------------------------

To be investigated... anyway, the port itself can be considered complete
 perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.410098e+05                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     0.605451 sec
     2,201,358,677      cycles                    #    2.653 GHz
     2,950,126,260      instructions              #    1.34  insn per cycle
       0.891091212 seconds time elapsed

FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.826984e+03                 )  sec^-1
MeanMatrixElemValue        = ( 5.532387e+01 +- 5.501866e+01 )  GeV^-4
TOTAL       :     8.983166 sec
    24,063,706,157      cycles                    #    2.677 GHz
    73,968,944,348      instructions              #    3.07  insn per cycle
       8.991664842 seconds time elapsed
This ensures the same physics results fro float and double

Note that in double precision I get the same physics in CUDA and C++
(this is not exactly so in single precision...)

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.438062e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     5.929722 sec
    14,377,877,684      cycles                    #    2.653 GHz
    24,406,140,862      instructions              #    1.70  insn per cycle
       6.229614368 seconds time elapsed

FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.825369e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.991304 sec
    24,089,227,557      cycles                    #    2.677 GHz
    73,968,938,757      instructions              #    3.07  insn per cycle
       8.999893583 seconds time elapsed
Copy link
Member

@hageboeck hageboeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too much too read in detail, but I have an idea for testing it:
We could add a build to the CI that compiles with -DMGONGPU_FPTYPE_FLOAT. This should prevent future compile problems. I suspect that tests would fail, but at least we could add compilation without testing in float mode as a first step.

@@ -822,24 +1614,24 @@ __device__ void FFV1_1(const cxtype F2[], const cxtype V3[], const cxtype COUP,
P1[3]) - M1 * (M1 - cI * W1));
F1[2] = denom * cI * (F2[2] * (P1[0] * (-V3[2] + V3[5]) + (P1[1] * (V3[3] -
cI * (V3[4])) + (P1[2] * (+cI * (V3[3]) + V3[4]) + P1[3] * (-V3[2] +
V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-1.) *
(V3[2] + V3[5]) + (P1[2] * (-1.) * (+cI * (V3[2] + V3[5])) + P1[3] *
V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-one) *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be done as ((fptype)-1.) if desired.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be done as ((fptype)-1.) if desired.
Yes it is also one I had considered. I think that in the end I had so many 1., 2. and 1/2 all over the place that this seemed a more readable form

@valassi
Copy link
Member Author

valassi commented Jun 14, 2021

It's too much too read in detail, but I have an idea for testing it:
We could add a build to the CI that compiles with -DMGONGPU_FPTYPE_FLOAT. This should prevent future compile problems. I suspect that tests would fail, but at least we could add compilation without testing in float mode as a first step.

Adding CI tests for floats would certainly be useful at some point. I would start with eemumu however, because there were many issues in the googletest infrasttructure for floats, which I fixed in the meantime (for manual tests but not the CI).

The oonly issue is that we should clean up those #defines and just allow a definition from outside. But then some guards must be added all over the place to make sure one of the options is added, and not more. I have it in my mental to do for later... since som etime.

Otherwise, about the tests of ggttgg, one would need to include the changes of eemumu, similarly tto what have done. I think that most of them are here
#168
plus later additions in upstream/master
I actually have the impression that ggttgg tests are disabled (runTest is not built), for similar issues to what I had in eemumu. Probably now it should be easier to port that, using the changes from eemumu

@valassi
Copy link
Member Author

valassi commented Jun 14, 2021

Ah, actually I see thta #148 was exactly abut these two issues for eemumu

valassi added 4 commits June 14, 2021 22:07
…shageboe

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.157971e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.364536e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.309761 sec
     3,360,485,158      cycles                    #    2.629 GHz
     4,797,157,032      instructions              #    1.43  insn per cycle
       1.611265391 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.303193e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.177774 sec
    19,206,775,927      cycles                    #    2.674 GHz
    48,584,293,198      instructions              #    2.53  insn per cycle
       7.188780540 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.905824e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.586663 sec
     9,075,146,996      cycles                    #    2.528 GHz
    16,500,290,915      instructions              #    1.82  insn per cycle
       3.597145062 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
=========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.549522e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.277927e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     1.017317 sec
     2,897,857,903      cycles                    #    2.650 GHz
     4,226,294,279      instructions              #    1.46  insn per cycle
       1.304468210 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.210885e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.116317 sec
    19,058,179,119      cycles                    #    2.676 GHz
    47,729,322,065      instructions              #    2.50  insn per cycle
       7.127149489 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  578) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.869307e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.640200 sec
     6,751,569,345      cycles                    #    2.550 GHz
    12,567,251,666      instructions              #    1.86  insn per cycle
       2.651457868 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2917) (512y:   81) (512z:    0)
=========================================================================
…td::sqrt

Peculiar however, in CUDA the float version is only 1.5x faster (and with many more CPU cycles?)

=========================================================================
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.404051e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     0.610642 sec
     2,230,873,353      cycles                    #    2.651 GHz
     2,977,902,011      instructions              #    1.33  insn per cycle
       0.907521540 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
FP precision               = DOUBLE (nan=0)
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.994013 sec
    24,096,614,355      cycles                    #    2.678 GHz
    73,974,244,492      instructions              #    3.07  insn per cycle
       9.002209325 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1251) (avx2:    0) (512y:    0) (512z:    0)
=========================================================================

=========================================================================
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 6.611444e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.059594e+00 +- 2.368052e+00 )  GeV^-4
TOTAL       :     6.359318 sec
    15,575,819,415      cycles                    #    2.652 GHz
    28,679,182,891      instructions              #    1.84  insn per cycle
       6.644244352 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
FP precision               = FLOAT (nan=0)
MeanMatrixElemValue        = ( 4.060118e+00 +- 2.367901e+00 )  GeV^-4
TOTAL       :     9.177608 sec
    24,591,918,482      cycles                    #    2.678 GHz
    73,870,482,502      instructions              #    3.00  insn per cycle
       9.185536632 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 1133) (avx2:    0) (512y:    0) (512z:    0)
=========================================================================
No apparent change in performance or in register pressure

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.157887e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.359792e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.080823 sec
     3,363,679,269      cycles                    #    2.648 GHz
     4,814,545,325      instructions              #    1.43  insn per cycle
       1.387019116 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================

=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.520431e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.271315e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.897779 sec
     2,900,559,004      cycles                    #    2.648 GHz
     4,227,863,147      instructions              #    1.46  insn per cycle
       1.186573538 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
No change in performance

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 4.439796e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     5.562262 sec
    14,405,502,552      cycles                    #    2.653 GHz
    24,414,816,540      instructions              #    1.69  insn per cycle
       5.861623903 seconds time elapsed

perf stat -d ./check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanM
atrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = DOUBLE (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.825615e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.063123e+00 +- 2.368970e+00 )  GeV^-4
TOTAL       :     8.989909 sec
    24,094,848,239      cycles                    #    2.678 GHz
    73,973,772,724      instructions              #    3.07  insn per cycle
       8.999129092 seconds time elapsed

perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|Mean
Matrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 6.564200e+05                 )  sec^-1
MeanMatrixElemValue        = ( 4.059594e+00 +- 2.368052e+00 )  GeV^-4
TOTAL       :     0.578372 sec
     2,127,469,936      cycles                    #    2.649 GHz
     2,844,920,010      instructions              #    1.34  insn per cycle
       0.864958378 seconds time elapsed

perf stat -d ./check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL       :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats'
FP precision               = FLOAT (nan=0)
EvtsPerSec[MatrixElems] (3)= ( 1.786116e+03                 )  sec^-1
MeanMatrixElemValue        = ( 4.060118e+00 +- 2.367901e+00 )  GeV^-4
TOTAL       :     9.185647 sec
    24,612,863,893      cycles                    #    2.678 GHz
    73,875,084,250      instructions              #    3.00  insn per cycle
       9.193727253 seconds time elapsed
@valassi
Copy link
Member Author

valassi commented Jun 15, 2021

Ok I have done a couple changes following some hintgs from @hageboeck (thanks!)

  • replace sqrt and sqrtf by std::sqrt in c++
  • conversely, add the distinction between sqrt and sqrtf in CUDA: previously sqrt was used also for floats... th echange does not seem to make any change in results or performance, in any case

I wil merge

@valassi valassi changed the title [ggttgg] fix float support in ggttgg [ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) Jun 15, 2021
@valassi valassi merged commit ed7c076 into madgraph5:master Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants