Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AOS/SOA for input particle 4-momenta (and random numbers) #16

Closed
roiser opened this issue Aug 12, 2020 · 15 comments
Closed

AOS/SOA for input particle 4-momenta (and random numbers) #16

roiser opened this issue Aug 12, 2020 · 15 comments
Assignees
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator

Comments

@roiser
Copy link
Member

roiser commented Aug 12, 2020

Done, including AOSOA (AV)

eemumu_AV/master

Steered by typedefs. We need to go to the end with this idea.

Should go back and time AOS… actually AOSOA5.0E8, AOS4.8E8

@roiser roiser added enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator labels Aug 12, 2020
@valassi
Copy link
Member

valassi commented Aug 13, 2020

Actually I have timed this yesterday, and AOSOA does seem to pay off, even if not by much.
See https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw/edit#

Description CUDA tput
BASELINE 5.00E8/s
SOA 4.65E8/s speed -7%
AOS 4.85E8/s speed -3%

I would say that we should keep some AOSOA-like structure as baseline

To do in any case

  • understand better what causes the performance improvement (memory access via coalescing? SIMD vectorized computations??)
  • massive clean up of the code, should use the same array dimensions everywehere: goal should be AOSOA[ngpublocks][nparticles][np4=4(E,px,py,pz)][ngputhreadsinblock]... also the random number arrays should use ngputhreads as last dimnesions (now it is hardoded as 32, the number if threads in a warp)

@valassi
Copy link
Member

valassi commented Aug 14, 2020

Some food for thought from Vincenzo, found by chance from google "godbolt cuda soa"
https://indico.cern.ch/event/851670/contributions/3585184/

@valassi
Copy link
Member

valassi commented Aug 14, 2020

And another interesting talk at CERN from the same google for godbolt
https://indico.cern.ch/event/932905/contributions/3920347/

@valassi
Copy link
Member

valassi commented Aug 14, 2020

I would like to do some changes to the AOSOA, but first I's like to understand a bit better if/why the layour makes a difference. I can think of two things, memory coalescing, and instruction vectorization. So I am doing some research to understand which metrics are relevant, eg in the profiler.

About memory coalescing, google "nvidia nsight coalesce" brought me here: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels. This is avery interesting article pointing to two metrics, and showing how to focus on them in the tools. I added them here 0429183

I have then profiled the BASELINE ASA against AOS and SOA. At face value I confiorm that SOA is similar, just ~2% slower than ASA, while AOS is quite a bit slower, around 7-10%. The profiles are very interesting.

First, looking at the metrics in the blog above, indeed the number of requests for memory in sgnakin is the same in ASA and AOS, but the number of sectors (transactions) is a factor 4 higher with AOS than with ASA. This is a clear indication that the memory (the allmomenta memory) is not coalesced in AOS and is better in ASA. The distinction between ASA and SOA is much more subtle and unclear, the numbers of requests and sectors is lower in SOA than ASA, with a ratiob etween them that remains comparable. Note also that the ratio is around 6, while the optimale should be around 4?

This is ASA
image

This is SOA
image

This is AOS
image

Second, SOA has a higher number of registers that ASA (182 against 152) and this may be the reason for a penalty elsehwere, eg in the number of active warps. Maybe this explains the slightly lower throughput of SOA, maybe not. TThis is SOA compared to ASA
image

@valassi
Copy link
Member

valassi commented Aug 17, 2020

I have made many studies related to this and I will dump a few results an ddecisions here.

First, I cleanly spearated in the code the ASA structure for random (based on neppR) and for momenta (based on neppM). In issue #23 I studied the impact of doing a compile time vs a runtime choice for ASA parameters. The difference is small but visible, at the levl of 5% throughput for neppM, and visble in other profiler parameters.

@valassi
Copy link
Member

valassi commented Aug 17, 2020

Next, I studied whether AOS is any different from ASA with neppM=1. They are actually almost identical. There are differences below 1% in some metrics and I am not even sure why.

This is an overview
image

This is the warp state
image

This is the instruction mix
image

The #registers are essentially the same.

Conclusion: I will drop AOS from the code to make it simpler. This can always be recovered by setting neppM=1 in the ASA option (and indeed it is interesting for some studies, see later).

@valassi
Copy link
Member

valassi commented Aug 17, 2020

Then, I compare AOSOA (in my default with neppM=32) to SOA (which is essentially AOSOA with a much larger neppM=16384*32). The latter is worse in all relevant metrics an ddoes give a lower throughput. I am not sure why, but the issues seem to come from a much larger number of registers (operations not vectorized??). Anyway, even from first principles this is not really a sound choice.

Overview
image

Memory
image

Compute workload
image

Scheduler and stalls
image

Insruction mix
image

Conclusion: I will finally drop also AOS. I will then concentrate only on AOSOA. This will allow a much cleaner code. The relevant parameter in any case is neppM (smaller as 1 means AOS, larger as ndim means SOA) and various options can still be studied.

valassi added a commit that referenced this issue Aug 17, 2020
time ./gcheck.exe -p 16384 32 12
***************************************
NumIterations             = 12
NumThreadsPerBlock        = 32
NumBlocksPerGrid          = 16384
---------------------------------------
FP precision              = DOUBLE (nan=0)
Complex type              = THRUST::COMPLEX
RanNumb memory layout     = AOSOA[32]
Momenta memory layout     = AOSOA[32]
Wavefunction GPU memory   = LOCAL
Curand generation         = DEVICE (CUDA code)
---------------------------------------
NumberOfEntries           = 12
TotalTimeInWaveFuncs      = 1.061890e-02 sec
MeanTimeInWaveFuncs       = 8.849083e-04 sec
StdDevTimeInWaveFuncs     = 2.232153e-05 sec
MinTimeInWaveFuncs        = 8.750640e-04 sec
MaxTimeInWaveFuncs        = 9.580780e-04 sec
---------------------------------------
TotalEventsComputed       = 6291456
RamboEventsPerSec         = 8.247658e+07 sec^-1
MatrixElemEventsPerSec    = 5.924772e+08 sec^-1
***************************************
NumMatrixElements(notNan) = 6291456
MeanMatrixElemValue       = 1.371972e-02 GeV^0
StdErrMatrixElemValue     = 3.270361e-06 GeV^0
StdDevMatrixElemValue     = 8.202972e-03 GeV^0
MinMatrixElemValue        = 6.071582e-03 GeV^0
MaxMatrixElemValue        = 3.374925e-02 GeV^0
***************************************
00 CudaFree : 0.146479 sec
0a ProcInit : 0.000569 sec
0b MemAlloc : 0.075094 sec
0c GenCreat : 0.014556 sec
1a GenSeed  : 0.000012 sec
1b GenRnGen : 0.007995 sec
2a RamboIni : 0.000108 sec
2b RamboFin : 0.000058 sec
2c CpDTHwgt : 0.006855 sec
2d CpDTHmom : 0.069260 sec
3a SigmaKin : 0.000096 sec
3b CpDTHmes : 0.010523 sec
4a DumpLoop : 0.022509 sec
9a DumpAll  : 0.023724 sec
9b GenDestr : 0.000221 sec
9c MemFree  : 0.020944 sec
9d CudReset : 0.042465 sec
TOTAL       : 0.441471 sec
TOTAL(n-2)  : 0.252526 sec
***************************************

real    0m0.452s
user    0m0.174s
sys     0m0.276s
@valassi
Copy link
Member

valassi commented Aug 17, 2020

Coming now to the optimal nepp size in AOSOA. So far I am using a default of 32, which I included as this is the number of threads in a warp. I am instead getting the idea that 4 is a better size for our doubles (8 bytes), because the cache lines are 32 bytes and fit 4 doubles. (Probably 8 would be better for floats).

To start with, at face value an ASA32 vs ASA04 (where npeeR=neppM=32 or 4). The throughput is marginally better. Especially, the FP64 seems a bit better utilized, and in the end that's the only thing we do: compute a lot of FP64, so we should do that as fast as possible.
image

Note that the memory difference comes from a large increase in data pipe LSU wavefront, but I am not sure that's any good in itself. I would say 4 is better than 32 because it takes overall a bit less time, and FP64 is better (both only at the level of 1%).

I also compared ASA04 to ASA08, but the latter is not better, and maybe slightly worse, for doubles.

Instead, ASA01, ie AOS, is very interesting. This is clearly worse than ASA32 (or ASA04). The big difference here is that the number of requests is the same, but the number of sectors is a factor 4 higher: this is an indication of non-coalesced memory access.
image

For the record I also tried ASA02, but clearly this needs twice as many transactions as ASA04.
image

Note that indeed for FLOAT the minimum top get fully coalesced access is ASA08 (8 floats of 4 bytes are a cache line of 32 bytes). Using ASA04 there results in twice the number of transactions as one could use.

Conclusion: in other words, we need at least ASA04 (double) or ASA08 (float) to have coalesced memory access. We do not seem to get anything better by using higher nepp. I will set the new defaults.

@valassi
Copy link
Member

valassi commented Aug 17, 2020

Next point is to try and understand the number of requests. I would imagine that to improve on memory usage we can do two things:

  • First, for a given number of requests, try to serve them with the least number of transactions. This is about reducing the transaction/request ratio, and is what I did above. As long as we use at least nepp=4 for doubles and nepp=4 for floats, then access is coalesced and we minimise the transactins neede for a given request.
  • The next point is try to understan the number of requests themselves.

To study this I used the two metrics described here https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/. I actually realised that they are available also in command line, eg as

/usr/local/cuda-11.0/bin/ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum ./gcheck.exe -p 16384 32 1
...
  gProc::sigmaKin(double const*, double*), 2020-Aug-17 17:09:35, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                      1,527,808
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                     10,154,052
    ---------------------------------------------------------------------- --------------- ------------------------------

The question is why 1527808 requests? And why 10154052 transactions by the way?

Using my current default nepp=4, I ran this for several configurations

-p blocks threads iterations requests sectors (transactions)
-p 16,384 32 1 1,527,808 10,154,052
-p 2,048 32 1 352,256 2,558,406
-p 256 32 1 70,144 528,896
-p 32 32 1 8,768 66,112
-p 2 32 1 548 4,132
-p 1 64 1 548 4,132
-p 1 32 1 274 2,066
-p 1 4 1 274 274

This is a very useful link: https://stackoverflow.com/questions/60535867. I actually found it by looking fo rthe names of the old metrics (https://docs.nvidia.com/nsight-compute/2019.5/NsightComputeCli/index.html#nvprof-metric-comparison) because the new ones are still not much documented.

A few comments on teh above:

  • The first point is that each request is a request at the warp level, which is 32 threads. Whether I use 4 threads or 32, the number of requests is the same, 274.
  • The second point is that a transaction is a 32-byte cache line. So, one coalesced request for doubles (8 bytes) at the warp level needs 8 pages of 4 doubles. However, other types of requests may need a differnt number of transactions. Here it seems that 274 is 256+18. These requests for a full warp need 2066 transactions, which is 256x8(=2048)+18. So, maybe this is 256 doubles, and 18 something else.
  • As soon as I request more warps (either with more threads than 32 per block, or with more blocks), the requests increase. Both for one 64-thread block, or two 32-thread blocks, I need two warps, and twice the number of requests. The number of transactions also doubles.
  • This linearity goes on until 256 blocks: "256x32" needs 256 as many requests and transactions as "1x32", 70144=256x274 and 528896=256x2066. For larger numbers of blocks I cannot really understand the pattern anymore, the number of requests decreases with respect to what I would expect. To be understood further..

@valassi
Copy link
Member

valassi commented Aug 18, 2020

It turns out that most of the 274 requests were due to the way I had implemented helicity filtering. I have improved helicity filtering (issue #24) and this now reduces drastically the number of requests. For the default nepp=4,


./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------

I redo my previous table with the new code

-p blocks threads iterations requests sectors (transactions)
-p 1 4 1 16 16
-p 1 32 1 16 128
-p 1 64 1 32 256
-p 2 32 1 32 256
-p 256 32 1 4096 32768
-p 2048 32 1 32768 262134 -- 262141
-p 16384 32 1 262144 2097044 -- 2097063
-p 2048 256 1 262144 2097152

The number of requests follows a perfect linearity: with 16384/32 I need 16384 as many requests (262144=16384*16) as with 1/32. In other words, each warp of 32 threads issues 16 requests, and it's not surprising as it needs 16 doubles (the 4-momenta of 4 particles).

The number of transactions also seems to follow the same linearity. For high number of blocks the number of transactions fluctuates (maybe the profiling tool is not very precise on this point??), but essentially with 16384/32 I also need 16384 as many transactions as with 1/32 (as 16384*128=2097152... this is actually what I get with 2048/256, but the throughput is smaller than with 16384/32).

In particular

  • With 1/4 ie only 4 threads each request needs only one transaction: each request is for one double, ie 8 bytes, but there are only 4 threads in a warp so each warp gets excatly one page of 4 doubles in a 32-byte transaction.
  • With 1/32 ie 32 threads, each request needs eight transactions (I need eight 32-byte pages to serve the 32 doubles I need for one request)
  • NB one "request" is for instance the request for pz of particle 2, if I understand correctly

This looks much better.

@valassi
Copy link
Member

valassi commented Aug 18, 2020

I have now repeated some of the previous profile tests on ASA with nepp=4 (default now) vs 32 (old default) vs 1 (ie AOS).

I would say that ASA32 and ASA04 are indistinguishable, while ASA01 ie AOS is different. Surprisingly, it seems to have now a bit higher throughput?
image
image

Anyway the memory is clearly less optimized: it is much more used because the data is retrieved in too many transactions. So there is a 75% L1 hit rate with AOS/ASA01, while it is 0% with ASA04, simply because each page is retrieved 4 times, so 3 times out of 4 it is already in cache... But it is much better to just retrieve it only once
image

I would CONCLUDE this issue on AOS/SOA structures FOR MOMENTA

  • It does not seem to be the primary issue: our main issue is that we must optimise the internals of sigmakin, not the retrieval of the momenta into sigmakin
  • That said, it seems reasonable to keep nepp=4 (for doubles, or nepp=8 for floats) as the baseline. This reduces the number of transactions, avoids relying on the L1 cache by simply retrieving all relevant data in one go.
  • The fact that there seems to be a small improvement in throughput with AOS is surprising, could be better studied, but I would just ignore it. Also, it may work here for eemumu where there are exactly four particles, but it may give issues elsewhere. Actually one particle 4-momenta is one cache line with doubles (4 doubles), but it becomes half a cache line with floats (4 floats is 16 bytes), and with odd number of particles it may give issues.
  • Having the code already set up (with a variable nepp) as AOSOA may also be useful to exploit SIMD on the CPU.
  • The structure of memmory layouts for intermediate results of the sigmakin computation (the w[5][6]) remains instead something that we may need to study better. THis is discussed in issue Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7. I would close this one now...

@valassi valassi closed this as completed Aug 18, 2020
@valassi valassi changed the title AOS/SOA structures AOS/SOA structures for input particle 4-momenta Aug 18, 2020
@valassi valassi changed the title AOS/SOA structures for input particle 4-momenta AOS/SOA structures for input particle 4-momenta (and random numbers) Aug 18, 2020
@valassi valassi changed the title AOS/SOA structures for input particle 4-momenta (and random numbers) AOS/SOA for input particle 4-momenta (and random numbers) Aug 18, 2020
@valassi valassi mentioned this issue Nov 25, 2020
@valassi valassi reopened this Dec 9, 2020
@valassi
Copy link
Member

valassi commented Dec 9, 2020

Marking as closed in the two attached projects.

@valassi valassi closed this as completed Dec 9, 2020
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 11, 2021
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.725189 sec
     2,549,736,022      cycles                    #    2.650 GHz
     3,503,747,691      instructions              #    1.37  insn per cycle
       1.023542621 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1]
=========================================================================
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jun 11, 2021


On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     0.723324 sec
     2,546,454,804      cycles                    #    2.656 GHz
     3,488,166,591      instructions              #    1.37  insn per cycle
       1.020129392 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1]
=========================================================================
@valassi
Copy link
Member

valassi commented Jun 11, 2021

Hi @roiser @oliviermattelaer @hageboeck I did a few additional tests of AOSOA in cuda while writing the paper (to avoid making stupid statements). THis updates the observations above. This is all in PR #209, just merged.

Two main observations:

  • I confirm that in CUDA eemumu the AOSOA for momenta only brings a very marginal improvement with respect to AOS. I do see that using an AOS moves around four times more memory (the number of sectors goes up by x4), but the ME throughout only decreases by maybe 5%. And this is eemumu, where memory access to momenta is much more important than in ggttgg...
  • Second point, the number of requests that is now needed per event (or better, per event page) is 40: we need the full 4-momenta of two particles, the pz momenta of two particles (thanks to the simplified ixx/oxx functions), all this times four good helicities. In my previous posts here I was mentioning 16, but this was for a single helicity, and before the simplification in ixx/oxx.

So, all looks understood here...

This is the current default
214b8c2

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.227665e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.352993e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.119368 sec
     3,329,955,135      cycles                    #    2.649 GHz
     4,782,880,833      instructions              #    1.44  insn per cycle
       1.414098430 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,035 [-p 2048 256 1]
=========================================================================

This is with an AOS for momenta (neppM=1... I still keep random numbers as AOSOA[8] to get the same physics results)
380f06c
You can see that the number of sectors (transaction roundtrips) has gone up by a factor 4 because memory is not coaleseced. Still, thoughput only goes down from 7.20E8 to 7.00E8 (and there are huge fluctuations on this machine).

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
EvtsPerSec[MatrixElems] (3) = ( 6.997163e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.272792e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.208810 sec
     3,334,791,948      cycles                    #    2.652 GHz
     4,801,124,639      instructions              #    1.44  insn per cycle
       1.502558668 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,817 [-p 2048 256 1]
=========================================================================

@valassi
Copy link
Member

valassi commented Jun 11, 2021

PS Uh! I am silly.

I forgot to say, I added a NEW metric, where I do NOT include the device to host copy.

In the numbers above, the difference between AOSOA and AOS ia bit bigger: 1.35E9 to 1.27E9. But still, only around 8%.

I will quote in the paper that it seems below 10%...

@valassi
Copy link
Member

valassi commented Jun 11, 2021

PPS A few observations in PR #210 for single precision: not surprinsingly,

  • with the optimal AOSOA[8], the number of transactions is half of that in double precision with the optimal AOSOA[4], for the sam enumber of request
  • with the suboptimal AOS, the number of transactions is 8 times asmany as with AOSOA[8]... actually it is the same as for double precision with AOS

In other words:

  • with single and double precision, if AOS is used, the number of requests is the same in the two cases (approximately 40 times the number of events, divided by 32) and the number of transactions is also the same (approximately 40 times the number of events) - a single request is issued for 32 threads, resulting in 32 transactions
  • if the optimal AOSOA is used (4 for double, 8 for single), the number of transactions decreases by a factor 4, or 8, respectively - because one request for 32 threads can be handled with only 8, or 4, respectively, transactions, insteda of 32

valassi added a commit to valassi/madgraph4gpu that referenced this issue May 20, 2022
…failing

patching file Source/dsample.f
Hunk madgraph5#3 FAILED at 181.
Hunk madgraph5#4 succeeded at 197 (offset 2 lines).
Hunk madgraph5#5 FAILED at 211.
Hunk madgraph5#6 succeeded at 893 (offset 3 lines).
2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej
patching file SubProcesses/addmothers.f
patching file SubProcesses/cuts.f
patching file SubProcesses/makefile
Hunk madgraph5#3 FAILED at 61.
Hunk madgraph5#4 succeeded at 94 (offset 6 lines).
Hunk madgraph5#5 succeeded at 122 (offset 6 lines).
1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej
patching file SubProcesses/reweight.f
Hunk #1 FAILED at 1782.
Hunk #2 succeeded at 1827 (offset 27 lines).
Hunk madgraph5#3 succeeded at 1841 (offset 27 lines).
Hunk madgraph5#4 succeeded at 1963 (offset 27 lines).
1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej
patching file auto_dsig.f
Hunk madgraph5#6 FAILED at 301.
Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines).
Hunk madgraph5#11 succeeded at 912 (offset 16 lines).
Hunk madgraph5#12 succeeded at 958 (offset 16 lines).
Hunk madgraph5#13 succeeded at 971 (offset 16 lines).
Hunk madgraph5#14 succeeded at 987 (offset 16 lines).
Hunk madgraph5#15 succeeded at 1006 (offset 16 lines).
Hunk madgraph5#16 succeeded at 1019 (offset 16 lines).
1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej
patching file driver.f
patching file matrix1.f
patching file auto_dsig1.f
Hunk #2 succeeded at 220 (offset 7 lines).
Hunk madgraph5#3 succeeded at 290 (offset 7 lines).
Hunk madgraph5#4 succeeded at 453 (offset 8 lines).
Hunk madgraph5#5 succeeded at 464 (offset 8 lines).
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 14, 2022
valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 14, 2022
valassi added a commit to valassi/madgraph4gpu that referenced this issue May 17, 2024
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected

STARTED  AT Thu May 16 01:24:16 AM CEST 2024
(SM tests)
ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0]
(BSM tests)
ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0]

24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

The new issue madgraph5#845 is the following
+Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
+
+Backtrace for this error:
+#0  0x7f2a1a623860 in ???
+#1  0x7f2a1a622a05 in ???
+#2  0x7f2a1a254def in ???
+madgraph5#3  0x7f2a1ae20acc in ???
+madgraph5#4  0x7f2a1acc4575 in ???
+madgraph5#5  0x7f2a1ae1d4c9 in ???
+madgraph5#6  0x7f2a1ae2570d in ???
+madgraph5#7  0x7f2a1ae2afa1 in ???
+madgraph5#8  0x43008b in ???
+madgraph5#9  0x431c10 in ???
+madgraph5#10  0x432d47 in ???
+madgraph5#11  0x433b1e in ???
+madgraph5#12  0x44a921 in ???
+madgraph5#13  0x42ebbf in ???
+madgraph5#14  0x40371e in ???
+madgraph5#15  0x7f2a1a23feaf in ???
+madgraph5#16  0x7f2a1a23ff5f in ???
+madgraph5#17  0x403844 in ???
+madgraph5#18  0xffffffffffffffff in ???
+./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
+ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator
Projects
None yet
Development

No branches or pull requests

2 participants