-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ggttgg] fix float support in ggttgg (+ fix sqrt/sqrtf/std::sqrt in cuda/c++ also in eemumu) #211
Conversation
perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.435680e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.623691 sec 2,196,798,887 cycles # 2.646 GHz 2,967,729,989 instructions # 1.35 insn per cycle 0.910385717 seconds time elapsed
perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.431257e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 5.664586 sec 14,242,402,487 cycles # 2.645 GHz 24,438,431,992 instructions # 1.72 insn per cycle 5.955651007 seconds time elapsed perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.827445e+03 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 8.981804 sec 24,058,806,303 cycles # 2.677 GHz 73,925,907,974 instructions # 3.07 insn per cycle 8.991724619 seconds time elapsed
All is ok for cuda, but C++ has some issues... perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.468617e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.605243 sec 2,198,121,835 cycles # 2.654 GHz 2,959,832,649 instructions # 1.35 insn per cycle 0.890934664 seconds time elapsed perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.831296e+03 ) sec^-1 MeanMatrixElemValue = ( -nan +- -nan ) GeV^-4 TOTAL : 8.962130 sec 24,011,718,485 cycles # 2.678 GHz 73,861,287,531 instructions # 3.08 insn per cycle 8.970276172 seconds time elapsed
… SIMD perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.443390e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 6.297783 sec 14,185,425,916 cycles # 2.652 GHz 24,428,982,656 instructions # 1.72 insn per cycle 6.595729859 seconds time elapsed perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.825301e+03 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 8.992239 sec 24,087,433,131 cycles # 2.677 GHz 73,957,118,424 instructions # 3.07 insn per cycle 9.001373453 seconds time elapsed
…as in eemumu All is ok, except that C++ and CUDA now give different results perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 6.610975e+05 ) sec^-1 MeanMatrixElemValue = ( 4.059594e+00 +- 2.368052e+00 ) GeV^-4 TOTAL : 5.920932 sec 15,536,792,487 cycles # 2.654 GHz 28,689,538,755 instructions # 1.85 insn per cycle 6.207201648 seconds time elapsed perf stat -d ../../../../../epoch2/cuda/gg_ttgg/SubProcesses/P1_Sigma_sm_gg_ttxgg/check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.786471e+03 ) sec^-1 MeanMatrixElemValue = ( 4.060118e+00 +- 2.367901e+00 ) GeV^-4 TOTAL : 9.183867 sec 24,604,689,155 cycles # 2.677 GHz 73,872,471,813 instructions # 3.00 insn per cycle 9.193035302 seconds time elapsed This is not due to a neppR mismatch (whic explains float vs double differences instead). It seems to be due to intrinsic numeric instabilities? ./check.exe -v 1 8 1 | tail -20 Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 1.005777e+02 -2.778140e+01 -8.595747e+01 4.421965e+01 4 6.388790e+02 6.255574e+02 5.316766e+01 -1.183958e+02 5 6.289353e+02 -6.113242e+02 -9.114253e+01 1.163415e+02 6 1.316082e+02 1.354830e+01 1.239324e+02 -4.216534e+01 -------------------------------------------------------------------------------- Matrix element = 5.42301e-07 GeV^-4 -------------------------------------------------------------------------------- Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 5.907211e+02 3.267731e+01 2.475092e+02 -5.353716e+02 4 4.160566e+02 -3.375994e+02 -7.378165e+01 2.317025e+02 5 2.329562e+02 1.924018e+02 -1.046342e+02 7.938418e+01 6 2.602661e+02 1.125203e+02 -6.909338e+01 2.242849e+02 -------------------------------------------------------------------------------- Matrix element = 8.71459e-07 GeV^-4 -------------------------------------------------------------------------------- ./gcheck.exe -v 1 8 1 | tail -20 Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 1.005777e+02 -2.778136e+01 -8.595747e+01 4.421965e+01 4 6.388790e+02 6.255573e+02 5.316774e+01 -1.183958e+02 5 6.289354e+02 -6.113242e+02 -9.114262e+01 1.163414e+02 6 1.316082e+02 1.354829e+01 1.239324e+02 -4.216533e+01 -------------------------------------------------------------------------------- Matrix element = 5.42298e-07 GeV^-4 -------------------------------------------------------------------------------- Momenta: 1 7.500000e+02 0.000000e+00 0.000000e+00 7.500000e+02 2 7.500000e+02 0.000000e+00 0.000000e+00 -7.500000e+02 3 5.907211e+02 3.267728e+01 2.475092e+02 -5.353716e+02 4 4.160566e+02 -3.375993e+02 -7.378172e+01 2.317025e+02 5 2.329562e+02 1.924017e+02 -1.046341e+02 7.938418e+01 6 2.602661e+02 1.125203e+02 -6.909334e+01 2.242850e+02 -------------------------------------------------------------------------------- Matrix element = 8.7146e-07 GeV^-4 -------------------------------------------------------------------------------- To be investigated... anyway, the port itself can be considered complete
perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.410098e+05 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 0.605451 sec 2,201,358,677 cycles # 2.653 GHz 2,950,126,260 instructions # 1.34 insn per cycle 0.891091212 seconds time elapsed FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.826984e+03 ) sec^-1 MeanMatrixElemValue = ( 5.532387e+01 +- 5.501866e+01 ) GeV^-4 TOTAL : 8.983166 sec 24,063,706,157 cycles # 2.677 GHz 73,968,944,348 instructions # 3.07 insn per cycle 8.991664842 seconds time elapsed
This ensures the same physics results fro float and double Note that in double precision I get the same physics in CUDA and C++ (this is not exactly so in single precision...) perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.438062e+05 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 5.929722 sec 14,377,877,684 cycles # 2.653 GHz 24,406,140,862 instructions # 1.70 insn per cycle 6.229614368 seconds time elapsed FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.825369e+03 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 8.991304 sec 24,089,227,557 cycles # 2.677 GHz 73,968,938,757 instructions # 3.07 insn per cycle 8.999893583 seconds time elapsed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too much too read in detail, but I have an idea for testing it:
We could add a build to the CI that compiles with -DMGONGPU_FPTYPE_FLOAT
. This should prevent future compile problems. I suspect that tests would fail, but at least we could add compilation without testing in float
mode as a first step.
@@ -822,24 +1614,24 @@ __device__ void FFV1_1(const cxtype F2[], const cxtype V3[], const cxtype COUP, | |||
P1[3]) - M1 * (M1 - cI * W1)); | |||
F1[2] = denom * cI * (F2[2] * (P1[0] * (-V3[2] + V3[5]) + (P1[1] * (V3[3] - | |||
cI * (V3[4])) + (P1[2] * (+cI * (V3[3]) + V3[4]) + P1[3] * (-V3[2] + | |||
V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-1.) * | |||
(V3[2] + V3[5]) + (P1[2] * (-1.) * (+cI * (V3[2] + V3[5])) + P1[3] * | |||
V3[5])))) + (F2[3] * (P1[0] * (V3[3] + cI * (V3[4])) + (P1[1] * (-one) * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be done as ((fptype)-1.)
if desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also be done as
((fptype)-1.)
if desired.
Yes it is also one I had considered. I think that in the end I had so many 1., 2. and 1/2 all over the place that this seemed a more readable form
Adding CI tests for floats would certainly be useful at some point. I would start with eemumu however, because there were many issues in the googletest infrasttructure for floats, which I fixed in the meantime (for manual tests but not the CI). The oonly issue is that we should clean up those #defines and just allow a definition from outside. But then some guards must be added all over the place to make sure one of the options is added, and not more. I have it in my mental to do for later... since som etime. Otherwise, about the tests of ggttgg, one would need to include the changes of eemumu, similarly tto what have done. I think that most of them are here |
Ah, actually I see thta #148 was exactly abut these two issues for eemumu
|
…shageboe On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.157971e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.364536e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.309761 sec 3,360,485,158 cycles # 2.629 GHz 4,797,157,032 instructions # 1.43 insn per cycle 1.611265391 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.303193e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.177774 sec 19,206,775,927 cycles # 2.674 GHz 48,584,293,198 instructions # 2.53 insn per cycle 7.188780540 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 614) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 4.905824e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 3.586663 sec 9,075,146,996 cycles # 2.528 GHz 16,500,290,915 instructions # 1.82 insn per cycle 3.597145062 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2572) (512y: 95) (512z: 0) ========================================================================= On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.549522e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.277927e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 1.017317 sec 2,897,857,903 cycles # 2.650 GHz 4,226,294,279 instructions # 1.46 insn per cycle 1.304468210 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 48 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=6, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 1.210885e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371707e-02 +- 3.270376e-06 ) GeV^0 TOTAL : 7.116317 sec 19,058,179,119 cycles # 2.676 GHz 47,729,322,065 instructions # 2.50 insn per cycle 7.127149489 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 578) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = FLOAT (NaN/abnormal=5, zero=0) Internal loops fptype_sv = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MECalcOnly] (3a) = ( 8.869307e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371705e-02 +- 3.270339e-06 ) GeV^0 TOTAL : 2.640200 sec 6,751,569,345 cycles # 2.550 GHz 12,567,251,666 instructions # 1.86 insn per cycle 2.651457868 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2917) (512y: 81) (512z: 0) =========================================================================
…td::sqrt Peculiar however, in CUDA the float version is only 1.5x faster (and with many more CPU cycles?) ========================================================================= FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.404051e+05 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 0.610642 sec 2,230,873,353 cycles # 2.651 GHz 2,977,902,011 instructions # 1.33 insn per cycle 0.907521540 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= FP precision = DOUBLE (nan=0) MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 8.994013 sec 24,096,614,355 cycles # 2.678 GHz 73,974,244,492 instructions # 3.07 insn per cycle 9.002209325 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 1251) (avx2: 0) (512y: 0) (512z: 0) ========================================================================= ========================================================================= FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 6.611444e+05 ) sec^-1 MeanMatrixElemValue = ( 4.059594e+00 +- 2.368052e+00 ) GeV^-4 TOTAL : 6.359318 sec 15,575,819,415 cycles # 2.652 GHz 28,679,182,891 instructions # 1.84 insn per cycle 6.644244352 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= FP precision = FLOAT (nan=0) MeanMatrixElemValue = ( 4.060118e+00 +- 2.367901e+00 ) GeV^-4 TOTAL : 9.177608 sec 24,591,918,482 cycles # 2.678 GHz 73,870,482,502 instructions # 3.00 insn per cycle 9.185536632 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 1133) (avx2: 0) (512y: 0) (512z: 0) =========================================================================
No apparent change in performance or in register pressure On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.157887e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.359792e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 1.080823 sec 3,363,679,269 cycles # 2.648 GHz 4,814,545,325 instructions # 1.43 insn per cycle 1.387019116 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ========================================================================= ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = FLOAT (NaN/abnormal=2, zero=0) EvtsPerSec[MatrixElems] (3) = ( 1.520431e+09 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 3.271315e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371686e-02 +- 3.270219e-06 ) GeV^0 TOTAL : 0.897779 sec 2,900,559,004 cycles # 2.648 GHz 4,227,863,147 instructions # 1.46 insn per cycle 1.186573538 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 48 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% =========================================================================
No change in performance perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 4.439796e+05 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 5.562262 sec 14,405,502,552 cycles # 2.653 GHz 24,414,816,540 instructions # 1.69 insn per cycle 5.861623903 seconds time elapsed perf stat -d ./check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanM atrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = DOUBLE (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.825615e+03 ) sec^-1 MeanMatrixElemValue = ( 4.063123e+00 +- 2.368970e+00 ) GeV^-4 TOTAL : 8.989909 sec 24,094,848,239 cycles # 2.678 GHz 73,973,772,724 instructions # 3.07 insn per cycle 8.999129092 seconds time elapsed perf stat -d ./gcheck.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|Mean Matrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 6.564200e+05 ) sec^-1 MeanMatrixElemValue = ( 4.059594e+00 +- 2.368052e+00 ) GeV^-4 TOTAL : 0.578372 sec 2,127,469,936 cycles # 2.649 GHz 2,844,920,010 instructions # 1.34 insn per cycle 0.864958378 seconds time elapsed perf stat -d ./check.exe -p 64 256 1 |& egrep '(Process|fptype_sv|OMP threads|EvtsPerSec\[MECalc|MeanMatrix|FP precision|TOTAL :|EvtsPerSec\[Matrix|CUCOMPLEX|COMMON RANDOM|ERROR|instructions|cycles|elapsed)' | grep -v 'Performance counter stats' FP precision = FLOAT (nan=0) EvtsPerSec[MatrixElems] (3)= ( 1.786116e+03 ) sec^-1 MeanMatrixElemValue = ( 4.060118e+00 +- 2.367901e+00 ) GeV^-4 TOTAL : 9.185647 sec 24,612,863,893 cycles # 2.678 GHz 73,875,084,250 instructions # 3.00 insn per cycle 9.193727253 seconds time elapsed
Ok I have done a couple changes following some hintgs from @hageboeck (thanks!)
I wil merge |
Hi @roiser @oliviermattelaer @hageboeck @cvuosalo this is a comprehensive PR to fix float support in ggttgg, using the same code and techniques as in eemumu (which may be debatable, but at least are consistent).
Do you have any comments? Thanks
Andrea
PS Strangely, I do not get the same exact results in cuda and c++, unlike eemumu. Maybe this is a real numerical instability? To be checked, but the basics look ok.