Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"xxx" function interface: further separation of data access and calculations? #175

Open
valassi opened this issue Apr 25, 2021 · 4 comments
Assignees
Labels
idea Possible new development (may need further discussion) refactoring Restructure classes and interfaces, improve modularity and encapsulation

Comments

@valassi
Copy link
Member

valassi commented Apr 25, 2021

This is vaguely related to vectorisation issue #71 and the big PR #171.

One question I am considering is whethe we should push even further the separation of data access and calculations.

Presently, the xxx functions

  • take as input an fptype pointer allmomenta for all particles in all events, plus some parameters identifying which particle in which event is needed, and other parameters for the calculation such as mass, helicity, ingoing/outgoing
  • give as output the wavefunction for the given particle and even under the given calculation scenario

In practice, all these functions do two things

  • they decode the allmomenta AOSOA data structure using the pIparIp4Ievt and pIparIp4Ipag functions, extracting only the 4-momenta for the relevant particle and event
  • they then compute the wavefunction for that particle/event under the given scenario (mass, helicity, ingoing/outgoing)

What I am proposing/questioning is that it may be better to have the xxx functions do only the calculation, while the AOSOA decoding should be done externally.

In practice

  • the xxx interface would take as input the momenta for a given particle, and a given event (or range of events, ie "event page" for vectorization): as arrays of references are forbidden, however, this may require passing four separate fptype_sv (one for energy, px, py, pz)
  • in other words, the xxx function swould know nothing of AOSOA structures
  • it would be the caller of the xxx functions, presently calculate_wavefunction, that encapsulates the knowledge of the AOSOA structure, possibly even through an extrenal call

One drawback of the present class structure is that there a lot of hidden couplings through hidden assumptions, eg rambo knows about the AOSOA structure too

This new suggested structure may eventually help interfacing c++/cuda to fortran, or maybe not... to be reconsidered later

In any case, one should understand if the performance remains the same, degrades, or even improves through this refactoring. One should also study if the various input fptype_sv should be passed as references or by value. Hint: in this line I had the impression that passing by value was slightly faster, which I found a bit counterintuitive,

// Return by value: it seems a tiny bit faster than returning a reference (both for scalar and vector), not clear why

Low priority. Just filing so I do not forget this...

@valassi valassi added idea Possible new development (may need further discussion) refactoring Restructure classes and interfaces, improve modularity and encapsulation labels Apr 25, 2021
@valassi valassi changed the title "xxx" function interface: further separation of data access and calculations "xxx" function interface: further separation of data access and calculations? Apr 25, 2021
valassi added a commit to valassi/madgraph4gpu that referenced this issue Apr 28, 2021
…ons in xxx (issue madgraph5#175)

Performance seems the same as before

I only changed ixzxxx so far, which uses all four vector components.
The other functions where only one component is needed may get a performance hit.

On itscrd03.cern.ch:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.339762e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.714149 sec
     2,509,108,138      cycles                    #    2.657 GHz
     3,474,658,914      instructions              #    1.38  insn per cycle
       1.002530396 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
EvtsPerSec[MatrixElems] (3) = ( 7.470064e+08                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.712408 sec
     2,499,403,420      cycles                    #    2.658 GHz
     3,469,331,063      instructions              #    1.39  insn per cycle
       0.999610147 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 164
=========================================================================
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = SCALAR ('none': ~vector[1], no SIMD)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.338363e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.872330e-02 +- 3.458847e-06 )  GeV^0
TOTAL       :     7.058616 sec
    18,898,411,510      cycles                    #    2.675 GHz
    48,170,875,957      instructions              #    2.55  insn per cycle
       7.068277761 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  604) (avx2:    0) (512y:    0) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH1_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 4.894853e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.872330e-02 +- 3.458847e-06 )  GeV^0
TOTAL       :     3.579694 sec
     9,061,294,199      cycles                    #    2.527 GHz
    16,402,712,569      instructions              #    1.81  insn per cycle
       3.589078581 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:    0) (avx2: 2562) (512y:   95) (512z:    0)
-------------------------------------------------------------------------
Process                     = EPOCH2_EEMUMU_CPP [gcc (GCC) 9.2.0]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
OMP threads / `nproc --all` = 1 / 4
EvtsPerSec[MatrixElems] (3) = ( 1.147704e+06                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     7.828566 sec
    20,973,799,401      cycles                    #    2.677 GHz
    50,302,905,850      instructions              #    2.40  insn per cycle
       7.838230747 seconds time elapsed
=Symbols in CPPProcess.o= (~sse4:  567) (avx2:    0) (512y:    0) (512z:    0)
=========================================================================
@valassi
Copy link
Member Author

valassi commented Dec 7, 2021

I have been continuing to experiment with a prototype for this in PR #189. There I have some code which "looks nice" and is fully functional, but for CUDA is slower than it was. I am using eemumu (in epoch1 because this is where I was). In particular

  • here I had 1.36E9 evts/sec 34963bf
  • in the following commit I lost 20% CUDA throughput (and increased registers from 120 to 130), I now have 1.15E9 evts/sec ac382b1
  • in further commits I even lost a bit more in c++ and maybe in cuda

What has changed is that the decosing of the AOSOA is done in calculate_wavefunctions and that the many momenta are passed (by value or reference, it does not matter? the problem is that they are many) to the xxx functions. Clearly this strategy is no good for CUDA.

I think that the correct strategy should be to do things differently in C++ and in CUDA:

  • in C++, one can decode all momenta externally in calculate_wavefunctions, and pass them to the xxx function in the single C++ thread where these will be used (maybe a few ones can be done in parallel in SIMD, but it is a single function call in a single thread) [ok maybe there is also OMP, but should not be a big thing]
  • in CUDA, it is better to let the xxx functions do the decoding: this is because thousands of threads do xxx and then ffv functions on thousands of different events and their momenta

Actually, this should go one step further in CUDA

  • the input to xxx should be (as it was) the full momenta buffer AOSOA, with indexing done in xxx
  • the output from the final FFV to the ME buffer should go to the full ME buffer, with ME indexing also done in the ffv function, not in calculate_wavefunction

Eventually, this should go many steps further:

  • each and every xxx output and ffv input/output should be a full array, with input/output indexing done in that function
  • and eventually still, each of these xxx/ffv should be a kernel on its own (see Stefan's WIP PR WIP - Split kernels and more #242 and issue cuda graphs #12 on graphs, and maybe issue Cuda streams #11 on streams)
  • the general idea is that all thes ethings should reduce register pressure, while my current passing values to each xxx increases register pressure and slows down the code

First technicality about the interface

  • if we pass momenta to c++, we can remove ipagv from the xxx interface in c++
  • actually, we can even remove ipar from the xxx interfcae in c++

Second technicality on the interface

  • in cuda, we cannot yet remove ipar from the interface of xxx, because presently the indexing of the big AOSOA needs to knw whih ipar we are talking about
  • eventually. we can remove ipar from the xxx interface in cuda if we use an array structure (SOAOSOA? with ipar as first dimension?) where we can reference the different external particles separately in separate large arrays

Plans to do:

  • essentially first step is complete this differentiation of C++ and cuda, with this first techniacality implemented
  • a second step will be to also test this second technicality, removing ipar from cuda
  • a third step (or maybe a paralle second setp) is to tryu out writing each xxx/ffv output to a large array in global memory, and see if the increased memory transfers will still be better becuase we reduce register pressure an dincrease occupancy...

@valassi
Copy link
Member Author

valassi commented Dec 8, 2021

I have transformed PR #189 into a DO-NOTHING PR. I have reverted all changes I attempted there, and I merged them anyway to keep them in the history. But this was probably a lot of wasted time. (Learnt some things on the way, but too much effort anyway).

I will work on these other lines as discussed above

@valassi
Copy link
Member Author

valassi commented Dec 9, 2021

As discussed above, one of the mian interests of thes einternal API changes is to eventually split the sigmakin kernel into separate cuda kernels. I have now opened issue #310 as a placeholder for this.

@valassi
Copy link
Member Author

valassi commented Dec 13, 2021

I merged #313 which does the first bullet of the things I mentioned above.

More specifically
This is now complete thus far and can be merged. Main changes

  • removed ipagV and ipar from c++ internal APIs
  • started differentiating cuda and c++ (cuda xxx functions dela with large arrays, c++ deals with a single event or event vector)
  • renamed and moved indexing functions to MemoryAccess.h
  • for the moment, I did keep the complex logic to deal with aligned, unaligned and arbitrary arrays (it can be simplified later on), I actually removed sanity checks and assumptions, and reenabled alignment checks at every call, the overhead is minimal

The next logical steps of this work will actually come in #310. This #175 could be closed, but I leave it open for th emoment,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
idea Possible new development (may need further discussion) refactoring Restructure classes and interfaces, improve modularity and encapsulation
Projects
None yet
Development

No branches or pull requests

1 participant