-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"xxx" function interface: further separation of data access and calculations? #175
Comments
…ons in xxx (issue madgraph5#175) Performance seems the same as before I only changed ixzxxx so far, which uses all four vector components. The other functions where only one component is needed may get a performance hit. On itscrd03.cern.ch: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.339762e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.714149 sec 2,509,108,138 cycles # 2.657 GHz 3,474,658,914 instructions # 1.38 insn per cycle 1.002530396 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.470064e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.712408 sec 2,499,403,420 cycles # 2.658 GHz 3,469,331,063 instructions # 1.39 insn per cycle 0.999610147 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.338363e+06 ) sec^-1 MeanMatrixElemValue = ( 1.872330e-02 +- 3.458847e-06 ) GeV^0 TOTAL : 7.058616 sec 18,898,411,510 cycles # 2.675 GHz 48,170,875,957 instructions # 2.55 insn per cycle 7.068277761 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 604) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.894853e+06 ) sec^-1 MeanMatrixElemValue = ( 1.872330e-02 +- 3.458847e-06 ) GeV^0 TOTAL : 3.579694 sec 9,061,294,199 cycles # 2.527 GHz 16,402,712,569 instructions # 1.81 insn per cycle 3.589078581 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2562) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147704e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.828566 sec 20,973,799,401 cycles # 2.677 GHz 50,302,905,850 instructions # 2.40 insn per cycle 7.838230747 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) =========================================================================
I have been continuing to experiment with a prototype for this in PR #189. There I have some code which "looks nice" and is fully functional, but for CUDA is slower than it was. I am using eemumu (in epoch1 because this is where I was). In particular
What has changed is that the decosing of the AOSOA is done in calculate_wavefunctions and that the many momenta are passed (by value or reference, it does not matter? the problem is that they are many) to the xxx functions. Clearly this strategy is no good for CUDA. I think that the correct strategy should be to do things differently in C++ and in CUDA:
Actually, this should go one step further in CUDA
Eventually, this should go many steps further:
First technicality about the interface
Second technicality on the interface
Plans to do:
|
I have transformed PR #189 into a DO-NOTHING PR. I have reverted all changes I attempted there, and I merged them anyway to keep them in the history. But this was probably a lot of wasted time. (Learnt some things on the way, but too much effort anyway). I will work on these other lines as discussed above
|
As discussed above, one of the mian interests of thes einternal API changes is to eventually split the sigmakin kernel into separate cuda kernels. I have now opened issue #310 as a placeholder for this. |
I merged #313 which does the first bullet of the things I mentioned above. More specifically
The next logical steps of this work will actually come in #310. This #175 could be closed, but I leave it open for th emoment, |
This is vaguely related to vectorisation issue #71 and the big PR #171.
One question I am considering is whethe we should push even further the separation of data access and calculations.
Presently, the xxx functions
In practice, all these functions do two things
What I am proposing/questioning is that it may be better to have the xxx functions do only the calculation, while the AOSOA decoding should be done externally.
In practice
One drawback of the present class structure is that there a lot of hidden couplings through hidden assumptions, eg rambo knows about the AOSOA structure too
This new suggested structure may eventually help interfacing c++/cuda to fortran, or maybe not... to be reconsidered later
In any case, one should understand if the performance remains the same, degrades, or even improves through this refactoring. One should also study if the various input fptype_sv should be passed as references or by value. Hint: in this line I had the impression that passing by value was slightly faster, which I found a bit counterintuitive,
madgraph4gpu/epoch1/cuda/ee_mumu/src/HelAmps_sm.cc
Line 40 in 6f4916e
Low priority. Just filing so I do not forget this...
The text was updated successfully, but these errors were encountered: