Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP - Split kernels and more #242

Closed
wants to merge 17 commits into from
Closed

Conversation

roiser
Copy link
Member

@roiser roiser commented Jul 29, 2021

This PR is about

  • splitting the sigmaKin kernel into smaller ones
  • on top of it try some other cuda features whether they are helpful (graphs, streams)

@roiser roiser changed the title WIP: Split kernels and more WIP - Split kernels and more Aug 3, 2021
roiser and others added 14 commits August 6, 2021 08:50
Interesting, first time I see AVX512z better than AVX512y...
Unfortunately I have no perf to chech better

On cori04 [CPU: Intel(R) Xeon(R) Gold 6148 CPU] [GPU: 1x Tesla V100-SXM2-16GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 10.1.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.520234e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.425600e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.542458 sec
real    0m0.663s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 10.1.0)]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.636039e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.453580e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.580860 sec
real    0m0.685s
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 1.728110e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     5.001435 sec
real    0m5.017s
=Symbols in CPPProcess.o= (~sse4:  638) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 3.367340e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.211082 sec
real    0m3.226s
=Symbols in CPPProcess.o= (~sse4: 3291) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 6.840681e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.201916 sec
real    0m2.217s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2792) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 6.926176e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.213485 sec
real    0m2.229s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2690) (512y:   51) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 7.523325e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.148971 sec
real    0m2.165s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1283) (512y:   64) (512z: 2125)
-------------------------------------------------------------------------Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 1.565369e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     5.129722 sec
real    0m5.140s
=Symbols in CPPProcess.o= (~sse4:  584) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 5.567020e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270375e-06 )  GeV^0
TOTAL       :     2.318101 sec
real    0m2.329s
=Symbols in CPPProcess.o= (~sse4: 3974) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 1.227348e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     1.632430 sec
real    0m1.643s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3130) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 1.315622e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     1.571972 sec
real    0m1.583s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3027) (512y:   26) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 10.1.0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 80
EvtsPerSec[MECalcOnly] (3a) = ( 1.656728e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270340e-06 )  GeV^0
TOTAL       :     1.449660 sec
real    0m1.461s
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1724) (512y:   13) (512z: 2235)
=========================================================================
…Cori

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.243102e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.357723e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.300173 sec
     3,500,824,482      cycles                    #    2.587 GHz
     5,234,468,980      instructions              #    1.50  insn per cycle
       1.613028163 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.320422e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.204340 sec
    19,302,582,115      cycles                    #    2.673 GHz
    48,770,284,217      instructions              #    2.53  insn per cycle
       7.226044917 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.917645e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.690642 sec
     9,396,541,653      cycles                    #    2.534 GHz
    16,682,270,994      instructions              #    1.78  insn per cycle
       3.712107596 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
=========================================================================
…t aggressive inlining

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.212939e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.400295e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.849224 sec
     2,808,114,769      cycles                    #    2.588 GHz
     3,905,267,544      instructions              #    1.39  insn per cycle
       1.153374486 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.391700e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.261830e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.720050 sec
     2,519,873,643      cycles                    #    2.646 GHz
     3,695,253,338      instructions              #    1.47  insn per cycle
       1.015517484 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 6.941261e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.363094e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.811018 sec
     2,783,800,176      cycles                    #    2.651 GHz
     3,912,304,707      instructions              #    1.41  insn per cycle
       1.114209528 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 122
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.4.48 (gcc 9.2.0)] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=2, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 1.400312e+09                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 3.286396e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371686e-02 +- 3.270219e-06 )  GeV^0
TOTAL       :     0.719431 sec
     2,506,452,444      cycles                    #    2.650 GHz
     3,694,417,124      instructions              #    1.47  insn per cycle
       1.011940239 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 48
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.318427e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.235546 sec
    19,380,832,165      cycles                    #    2.672 GHz
    48,764,608,415      instructions              #    2.52  insn per cycle
       7.259182766 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.544157e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.928514 sec
    13,203,952,070      cycles                    #    2.669 GHz
    30,124,460,635      instructions              #    2.28  insn per cycle
       4.952199188 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.599620e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.760575 sec
     9,551,000,139      cycles                    #    2.526 GHz
    16,746,784,852      instructions              #    1.75  insn per cycle
       3.784228078 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.942522e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.677088 sec
     9,365,149,286      cycles                    #    2.533 GHz
    16,683,540,674      instructions              #    1.78  insn per cycle
       3.700388180 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2572) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 3.565285e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     4.173321 sec
     9,265,156,681      cycles                    #    2.210 GHz
    13,546,257,966      instructions              #    1.46  insn per cycle
       4.196664662 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1127) (512y:  205) (512z: 2045)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.208675e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     7.188579 sec
    19,265,327,616      cycles                    #    2.675 GHz
    47,812,080,167      instructions              #    2.48  insn per cycle
       7.205965330 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  578) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.552358e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270375e-06 )  GeV^0
TOTAL       :     3.421159 sec
     9,158,720,466      cycles                    #    2.670 GHz
    19,803,984,519      instructions              #    2.16  insn per cycle
       3.437778429 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3719) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.207503e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.769972 sec
     7,128,288,127      cycles                    #    2.560 GHz
    12,588,338,030      instructions              #    1.77  insn per cycle
       2.787861806 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3077) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 8.800518e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.708221 sec
     6,983,261,447      cycles                    #    2.566 GHz
    12,606,892,504      instructions              #    1.81  insn per cycle
       2.725243928 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2917) (512y:   81) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 7.041160e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270340e-06 )  GeV^0
TOTAL       :     2.853162 sec
     6,673,968,987      cycles                    #    2.328 GHz
    11,013,378,592      instructions              #    1.65  insn per cycle
       2.870540890 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 1559) (512y:  179) (512z: 2157)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.582173e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.828452 sec
    10,262,895,056      cycles                    #    2.668 GHz
    19,157,780,486      instructions              #    1.87  insn per cycle
       3.852100162 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  163) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 6.081384e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.487487 sec
     9,357,104,554      cycles                    #    2.669 GHz
    15,705,773,236      instructions              #    1.68  insn per cycle
       3.510681352 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  498) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.025374e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.995537 sec
     7,808,268,058      cycles                    #    2.590 GHz
    12,047,204,424      instructions              #    1.54  insn per cycle
       3.018613142 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  524) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.082475e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     2.992733 sec
     7,810,632,531      cycles                    #    2.593 GHz
    11,755,925,333      instructions              #    1.51  insn per cycle
       3.016263244 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  433) (512y:   15) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 7.769076e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     3.205406 sec
     7,722,855,175      cycles                    #    2.395 GHz
    10,916,378,460      instructions              #    1.41  insn per cycle
       3.228568931 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  130) (512y:   15) (512z:  318)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 4.914368e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371707e-02 +- 3.270376e-06 )  GeV^0
TOTAL       :     3.262435 sec
     8,752,968,444      cycles                    #    2.672 GHz
    18,407,763,149      instructions              #    2.10  insn per cycle
       3.278912088 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  180) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.229611e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270375e-06 )  GeV^0
TOTAL       :     2.507796 sec
     6,732,048,884      cycles                    #    2.670 GHz
    12,382,744,726      instructions              #    1.84  insn per cycle
       2.524398017 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  585) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.039683e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.283740 sec
     6,006,965,123      cycles                    #    2.617 GHz
    10,164,472,517      instructions              #    1.69  insn per cycle
       2.300369327 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  588) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 2.238065e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270339e-06 )  GeV^0
TOTAL       :     2.264588 sec
     5,969,249,872      cycles                    #    2.621 GHz
    10,046,255,732      instructions              #    1.68  insn per cycle
       2.281963503 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  499) (512y:   12) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=5, zero=0)
Internal loops fptype_sv    = VECTOR[16] ('512z': AVX512, 512bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MECalcOnly] (3a) = ( 1.669172e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371705e-02 +- 3.270340e-06 )  GeV^0
TOTAL       :     2.333696 sec
     5,849,649,158      cycles                    #    2.492 GHz
     9,622,400,849      instructions              #    1.64  insn per cycle
       2.350369337 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  198) (512y:   12) (512z:  343)
=========================================================================
Interesting performance, even in inline mode.
This AMD does not yet have AVX512 (it is a zen3 not a zen4?)

On b7s01p3272.cern.ch [CPU: AMD EPYC 7302 16-Core Processor] [GPU: none]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 1.790062e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     5.262333 sec
    18,640,514,529      cycles                    #    3.271 GHz                      (60.00%)
     1,432,917,809      stalled-cycles-frontend   #    7.69% frontend cycles idle     (60.11%)
     1,003,671,602      stalled-cycles-backend    #    5.38% backend cycles idle      (60.13%)
    48,606,388,470      instructions              #    2.61  insn per cycle
                                                  #    0.03  stalled cycles per insn  (60.14%)
       5.267442975 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  614) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 3.400763e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     3.684801 sec
    13,394,877,121      cycles                    #    3.248 GHz                      (59.95%)
     1,443,046,459      stalled-cycles-frontend   #   10.77% frontend cycles idle     (59.36%)
       894,839,605      stalled-cycles-backend    #    6.68% backend cycles idle      (59.33%)
    30,509,899,644      instructions              #    2.28  insn per cycle
                                                  #    0.05  stalled cycles per insn  (59.24%)
       3.690049746 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3274) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 7.041320e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.590213 sec
     9,868,103,807      cycles                    #    3.266 GHz                      (60.15%)
     1,380,923,621      stalled-cycles-frontend   #   13.99% frontend cycles idle     (60.34%)
       833,710,896      stalled-cycles-backend    #    8.45% backend cycles idle      (60.43%)
    16,683,521,360      instructions              #    1.69  insn per cycle
                                                  #    0.08  stalled cycles per insn  (60.34%)
       2.597024209 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2746) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 1.529296e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371779e-02 +- 3.268970e-06 )  GeV^0
TOTAL       :     5.380939 sec
    18,311,633,178      cycles                    #    3.261 GHz                      (59.94%)
       945,832,161      stalled-cycles-frontend   #    5.17% frontend cycles idle     (60.08%)
       847,106,138      stalled-cycles-backend    #    4.63% backend cycles idle      (60.11%)
    46,706,082,317      instructions              #    2.55  insn per cycle
                                                  #    0.02  stalled cycles per insn  (60.11%)
       5.386563442 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  578) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 5.934684e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268970e-06 )  GeV^0
TOTAL       :     2.362167 sec
     8,475,411,126      cycles                    #    3.267 GHz                      (59.95%)
       926,585,890      stalled-cycles-frontend   #   10.93% frontend cycles idle     (60.01%)
       681,317,029      stalled-cycles-backend    #    8.04% backend cycles idle      (60.11%)
    18,786,518,201      instructions              #    2.22  insn per cycle
                                                  #    0.05  stalled cycles per insn  (60.19%)
       2.368856108 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4: 3719) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=0]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 1.147558e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371787e-02 +- 3.269413e-06 )  GeV^0
TOTAL       :     1.810575 sec
     6,641,203,529      cycles                    #    3.251 GHz                      (59.72%)
       940,209,521      stalled-cycles-frontend   #   14.16% frontend cycles idle     (60.14%)
       652,548,851      stalled-cycles-backend    #    9.83% backend cycles idle      (60.20%)
    11,491,557,470      instructions              #    1.73  insn per cycle
                                                  #    0.08  stalled cycles per insn  (60.24%)
       1.816615491 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 3077) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 5.227925e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.929974 sec
    11,004,676,798      cycles                    #    3.266 GHz                      (60.10%)
     1,386,171,583      stalled-cycles-frontend   #   12.60% frontend cycles idle     (60.11%)
       965,136,809      stalled-cycles-backend    #    8.77% backend cycles idle      (60.01%)
    18,906,467,688      instructions              #    1.72  insn per cycle
                                                  #    0.07  stalled cycles per insn  (59.99%)
       2.936587747 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  163) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[2] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 7.589393e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.572034 sec
     9,814,496,112      cycles                    #    3.263 GHz                      (60.01%)
     1,386,091,393      stalled-cycles-frontend   #   14.12% frontend cycles idle     (60.14%)
       860,847,649      stalled-cycles-backend    #    8.77% backend cycles idle      (60.19%)
    15,703,812,468      instructions              #    1.60  insn per cycle
                                                  #    0.09  stalled cycles per insn  (60.26%)
       2.577875993 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  498) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 1.514028e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.372113e-02 +- 3.270608e-06 )  GeV^0
TOTAL       :     2.107620 sec
     8,265,179,278      cycles                    #    3.255 GHz                      (60.44%)
     1,354,383,415      stalled-cycles-frontend   #   16.39% frontend cycles idle     (60.35%)
       825,027,663      stalled-cycles-backend    #    9.98% backend cycles idle      (60.31%)
    12,014,744,785      instructions              #    1.45  insn per cycle
                                                  #    0.11  stalled cycles per insn  (59.88%)
       2.112479856 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  524) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 5.783732e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371779e-02 +- 3.268970e-06 )  GeV^0
TOTAL       :     2.345963 sec
     8,413,201,674      cycles                    #    3.256 GHz                      (60.10%)
       902,628,888      stalled-cycles-frontend   #   10.73% frontend cycles idle     (60.15%)
       810,114,213      stalled-cycles-backend    #    9.63% backend cycles idle      (60.21%)
    17,318,842,828      instructions              #    2.06  insn per cycle
                                                  #    0.05  stalled cycles per insn  (60.17%)
       2.352630849 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  180) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=6, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('sse4': SSE4.2, 128bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 1.694465e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371780e-02 +- 3.268970e-06 )  GeV^0
TOTAL       :     1.653767 sec
     6,134,059,591      cycles                    #    3.252 GHz                      (60.03%)
       909,830,650      stalled-cycles-frontend   #   14.83% frontend cycles idle     (60.11%)
       664,626,397      stalled-cycles-backend    #   10.84% backend cycles idle      (60.21%)
    11,372,767,439      instructions              #    1.85  insn per cycle
                                                  #    0.08  stalled cycles per insn  (60.31%)
       1.659193081 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  585) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc 9.2.0] [inlineHel=1]
FP precision                = FLOAT (NaN/abnormal=4, zero=0)
Internal loops fptype_sv    = VECTOR[8] ('avx2': AVX2, 256bit) [cxtype_ref=YES]
Random number generation    = COMMON RANDOM (C++ code)
OMP threads / `nproc --all` = 1 / 64
EvtsPerSec[MECalcOnly] (3a) = ( 3.275830e+07                 )  sec^-1
MeanMatrixElemValue         = ( 1.371787e-02 +- 3.269413e-06 )  GeV^0
TOTAL       :     1.444105 sec
     5,451,930,107      cycles                    #    3.256 GHz                      (59.91%)
       934,593,213      stalled-cycles-frontend   #   17.14% frontend cycles idle     (59.88%)
       654,715,858      stalled-cycles-backend    #   12.01% backend cycles idle      (59.93%)
     9,123,578,136      instructions              #    1.67  insn per cycle
                                                  #    0.10  stalled cycles per insn  (59.94%)
       1.450689691 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2:  588) (512y:    0) (512z:    0)
=========================================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants