"xxx" function interface: further separation of data access and calculations? #175

valassi · 2021-04-25T14:21:57Z

This is vaguely related to vectorisation issue #71 and the big PR #171.

One question I am considering is whethe we should push even further the separation of data access and calculations.

Presently, the xxx functions

take as input an fptype pointer allmomenta for all particles in all events, plus some parameters identifying which particle in which event is needed, and other parameters for the calculation such as mass, helicity, ingoing/outgoing
give as output the wavefunction for the given particle and even under the given calculation scenario

In practice, all these functions do two things

they decode the allmomenta AOSOA data structure using the pIparIp4Ievt and pIparIp4Ipag functions, extracting only the 4-momenta for the relevant particle and event
they then compute the wavefunction for that particle/event under the given scenario (mass, helicity, ingoing/outgoing)

What I am proposing/questioning is that it may be better to have the xxx functions do only the calculation, while the AOSOA decoding should be done externally.

In practice

the xxx interface would take as input the momenta for a given particle, and a given event (or range of events, ie "event page" for vectorization): as arrays of references are forbidden, however, this may require passing four separate fptype_sv (one for energy, px, py, pz)
in other words, the xxx function swould know nothing of AOSOA structures
it would be the caller of the xxx functions, presently calculate_wavefunction, that encapsulates the knowledge of the AOSOA structure, possibly even through an extrenal call

One drawback of the present class structure is that there a lot of hidden couplings through hidden assumptions, eg rambo knows about the AOSOA structure too

This new suggested structure may eventually help interfacing c++/cuda to fortran, or maybe not... to be reconsidered later

In any case, one should understand if the performance remains the same, degrades, or even improves through this refactoring. One should also study if the various input fptype_sv should be passed as references or by value. Hint: in this line I had the impression that passing by value was slightly faster, which I found a bit counterintuitive,

madgraph4gpu/epoch1/cuda/ee_mumu/src/HelAmps_sm.cc

Line 40 in 6f4916e

    
           // Return by value: it seems a tiny bit faster than returning a reference (both for scalar and vector), not clear why

Low priority. Just filing so I do not forget this...

…ons in xxx (issue madgraph5#175) Performance seems the same as before I only changed ixzxxx so far, which uses all four vector components. The other functions where only one component is needed may get a performance hit. On itscrd03.cern.ch: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.339762e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.714149 sec 2,509,108,138 cycles # 2.657 GHz 3,474,658,914 instructions # 1.38 insn per cycle 1.002530396 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) EvtsPerSec[MatrixElems] (3) = ( 7.470064e+08 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.712408 sec 2,499,403,420 cycles # 2.658 GHz 3,469,331,063 instructions # 1.39 insn per cycle 0.999610147 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 164 ========================================================================= Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = SCALAR ('none': ~vector[1], no SIMD) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.338363e+06 ) sec^-1 MeanMatrixElemValue = ( 1.872330e-02 +- 3.458847e-06 ) GeV^0 TOTAL : 7.058616 sec 18,898,411,510 cycles # 2.675 GHz 48,170,875,957 instructions # 2.55 insn per cycle 7.068277761 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 604) (avx2: 0) (512y: 0) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 4.894853e+06 ) sec^-1 MeanMatrixElemValue = ( 1.872330e-02 +- 3.458847e-06 ) GeV^0 TOTAL : 3.579694 sec 9,061,294,199 cycles # 2.527 GHz 16,402,712,569 instructions # 1.81 insn per cycle 3.589078581 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 0) (avx2: 2562) (512y: 95) (512z: 0) ------------------------------------------------------------------------- Process = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0] FP precision = DOUBLE (NaN/abnormal=0, zero=0) OMP threads / `nproc --all` = 1 / 4 EvtsPerSec[MatrixElems] (3) = ( 1.147704e+06 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 7.828566 sec 20,973,799,401 cycles # 2.677 GHz 50,302,905,850 instructions # 2.40 insn per cycle 7.838230747 seconds time elapsed =Symbols in CPPProcess.o= (~sse4: 567) (avx2: 0) (512y: 0) (512z: 0) =========================================================================

valassi · 2021-12-07T19:44:49Z

I have been continuing to experiment with a prototype for this in PR #189. There I have some code which "looks nice" and is fully functional, but for CUDA is slower than it was. I am using eemumu (in epoch1 because this is where I was). In particular

here I had 1.36E9 evts/sec 34963bf
in the following commit I lost 20% CUDA throughput (and increased registers from 120 to 130), I now have 1.15E9 evts/sec ac382b1
in further commits I even lost a bit more in c++ and maybe in cuda

What has changed is that the decosing of the AOSOA is done in calculate_wavefunctions and that the many momenta are passed (by value or reference, it does not matter? the problem is that they are many) to the xxx functions. Clearly this strategy is no good for CUDA.

I think that the correct strategy should be to do things differently in C++ and in CUDA:

in C++, one can decode all momenta externally in calculate_wavefunctions, and pass them to the xxx function in the single C++ thread where these will be used (maybe a few ones can be done in parallel in SIMD, but it is a single function call in a single thread) [ok maybe there is also OMP, but should not be a big thing]
in CUDA, it is better to let the xxx functions do the decoding: this is because thousands of threads do xxx and then ffv functions on thousands of different events and their momenta

Actually, this should go one step further in CUDA

the input to xxx should be (as it was) the full momenta buffer AOSOA, with indexing done in xxx
the output from the final FFV to the ME buffer should go to the full ME buffer, with ME indexing also done in the ffv function, not in calculate_wavefunction

Eventually, this should go many steps further:

each and every xxx output and ffv input/output should be a full array, with input/output indexing done in that function
and eventually still, each of these xxx/ffv should be a kernel on its own (see Stefan's WIP PR WIP - Split kernels and more #242 and issue cuda graphs #12 on graphs, and maybe issue Cuda streams #11 on streams)
the general idea is that all thes ethings should reduce register pressure, while my current passing values to each xxx increases register pressure and slows down the code

First technicality about the interface

if we pass momenta to c++, we can remove ipagv from the xxx interface in c++
actually, we can even remove ipar from the xxx interfcae in c++

Second technicality on the interface

in cuda, we cannot yet remove ipar from the interface of xxx, because presently the indexing of the big AOSOA needs to knw whih ipar we are talking about
eventually. we can remove ipar from the xxx interface in cuda if we use an array structure (SOAOSOA? with ipar as first dimension?) where we can reference the different external particles separately in separate large arrays

Plans to do:

essentially first step is complete this differentiation of C++ and cuda, with this first techniacality implemented
a second step will be to also test this second technicality, removing ipar from cuda
a third step (or maybe a paralle second setp) is to tryu out writing each xxx/ffv output to a large array in global memory, and see if the increased memory transfers will still be better becuase we reduce register pressure an dincrease occupancy...

valassi · 2021-12-08T09:48:05Z

I have transformed PR #189 into a DO-NOTHING PR. I have reverted all changes I attempted there, and I merged them anyway to keep them in the history. But this was probably a lot of wasted time. (Learnt some things on the way, but too much effort anyway).

I will work on these other lines as discussed above

differentiate cuda and c++, remove ipagV (and ipar) from c++ apis, as xxx/ffv for c++ should deal only with individual events or SIMD-vectors of events (this can be followed in this "xxx" function interface: further separation of data access and calculations? #175)
eventually maybe reshuffle AOSOA[npagM][npar][np4][neppM] as SOAOSOA[npar][npagM][np4][neppM], if not even back as SOA[npar][np4][nevt] (one might even think of using nblocks and nthreads to detremine npagM and neppM otherwise?), and remove ipar from the cuda internal apis (I have now opened separate buffers for different particles (SOAOSOA?) #309 about this)
in any case, I would also undo a lot of the work in klas5 Remove neppV and fptype_sv from API and decouple neppV from neppM (epochX) #300 and Decouple neppM and neppV in vectorised C++ code? #176, in the sense that I would force the use of aligned arrays, and force the use of reinterpret_cast, and remove all the rest of long and not really necessary code
working on a MEKL for API (5) Encapsulate the ME calculation in a single class #301 and WIP: prototype API to encapsulate the ME calculation in a kernel launcher MEKL #302 is probably more urgent... note that for the moment the changes of step1 (remove ipgav) can be done within sigmakin anyway

valassi · 2021-12-09T15:54:42Z

As discussed above, one of the mian interests of thes einternal API changes is to eventually split the sigmakin kernel into separate cuda kernels. I have now opened issue #310 as a placeholder for this.

valassi · 2021-12-13T13:20:15Z

I merged #313 which does the first bullet of the things I mentioned above.

More specifically
This is now complete thus far and can be merged. Main changes

removed ipagV and ipar from c++ internal APIs
started differentiating cuda and c++ (cuda xxx functions dela with large arrays, c++ deals with a single event or event vector)
renamed and moved indexing functions to MemoryAccess.h
for the moment, I did keep the complex logic to deal with aligned, unaligned and arbitrary arrays (it can be simplified later on), I actually removed sanity checks and assumptions, and reenabled alignment checks at every call, the overhead is minimal

The next logical steps of this work will actually come in #310. This #175 could be closed, but I leave it open for th emoment,

valassi added idea Possible new development (may need further discussion) refactoring Restructure classes and interfaces, improve modularity and encapsulation labels Apr 25, 2021

valassi changed the title ~~"xxx" function interface: further separation of data access and calculations~~ "xxx" function interface: further separation of data access and calculations? Apr 25, 2021

This was referenced Apr 25, 2021

Decouple neppM and neppV in vectorised C++ code? #176

Closed

(DO-NOTHING) reverted tests for further separating data access from calculations in xxx #189

Merged

valassi mentioned this issue Apr 29, 2021

MadEvent integration (and Fortran interoperability with C/C++/CUDA) #121

Open

valassi mentioned this issue Dec 9, 2021

separate buffers for different particles (SOAOSOA?) #309

Closed

valassi self-assigned this Dec 9, 2021

valassi mentioned this issue Dec 9, 2021

Split the sigmakin kernel into smaller kernels #310

Open

valassi mentioned this issue Dec 11, 2021

New buffer access in xxx (and ffv) functions - remove ipagV from C++ api #313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"xxx" function interface: further separation of data access and calculations? #175

"xxx" function interface: further separation of data access and calculations? #175

valassi commented Apr 25, 2021

valassi commented Dec 7, 2021 •

edited

Loading

valassi commented Dec 8, 2021 •

edited

Loading

valassi commented Dec 9, 2021

valassi commented Dec 13, 2021

"xxx" function interface: further separation of data access and calculations? #175

"xxx" function interface: further separation of data access and calculations? #175

Comments

valassi commented Apr 25, 2021

valassi commented Dec 7, 2021 • edited Loading

valassi commented Dec 8, 2021 • edited Loading

valassi commented Dec 9, 2021

valassi commented Dec 13, 2021

valassi commented Dec 7, 2021 •

edited

Loading

valassi commented Dec 8, 2021 •

edited

Loading