AOS/SOA for input particle 4-momenta (and random numbers) #16

roiser · 2020-08-12T15:31:55Z

Done, including AOSOA (AV)

eemumu_AV/master

Steered by typedefs. We need to go to the end with this idea.

Should go back and time AOS… actually AOSOA~~5.0E8, AOS~~4.8E8

valassi · 2020-08-13T08:23:47Z

Actually I have timed this yesterday, and AOSOA does seem to pay off, even if not by much.
See https://docs.google.com/document/d/1g2xwJ2FsSlxHvSUdPZjCyFW7zhsblMQ4g8UHlrkWyVw/edit#

Description	CUDA tput
BASELINE	5.00E8/s
SOA	4.65E8/s speed -7%
AOS	4.85E8/s speed -3%

I would say that we should keep some AOSOA-like structure as baseline

To do in any case

understand better what causes the performance improvement (memory access via coalescing? SIMD vectorized computations??)
massive clean up of the code, should use the same array dimensions everywehere: goal should be AOSOA[ngpublocks][nparticles][np4=4(E,px,py,pz)][ngputhreadsinblock]... also the random number arrays should use ngputhreads as last dimnesions (now it is hardoded as 32, the number if threads in a warp)

valassi · 2020-08-14T11:36:34Z

Some food for thought from Vincenzo, found by chance from google "godbolt cuda soa"
https://indico.cern.ch/event/851670/contributions/3585184/

valassi · 2020-08-14T11:38:43Z

And another interesting talk at CERN from the same google for godbolt
https://indico.cern.ch/event/932905/contributions/3920347/

valassi · 2020-08-14T16:07:10Z

I would like to do some changes to the AOSOA, but first I's like to understand a bit better if/why the layour makes a difference. I can think of two things, memory coalescing, and instruction vectorization. So I am doing some research to understand which metrics are relevant, eg in the profiler.

About memory coalescing, google "nvidia nsight coalesce" brought me here: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels. This is avery interesting article pointing to two metrics, and showing how to focus on them in the tools. I added them here 0429183

I have then profiled the BASELINE ASA against AOS and SOA. At face value I confiorm that SOA is similar, just ~2% slower than ASA, while AOS is quite a bit slower, around 7-10%. The profiles are very interesting.

First, looking at the metrics in the blog above, indeed the number of requests for memory in sgnakin is the same in ASA and AOS, but the number of sectors (transactions) is a factor 4 higher with AOS than with ASA. This is a clear indication that the memory (the allmomenta memory) is not coalesced in AOS and is better in ASA. The distinction between ASA and SOA is much more subtle and unclear, the numbers of requests and sectors is lower in SOA than ASA, with a ratiob etween them that remains comparable. Note also that the ratio is around 6, while the optimale should be around 4?

This is ASA

This is SOA

This is AOS

Second, SOA has a higher number of registers that ASA (182 against 152) and this may be the reason for a penalty elsehwere, eg in the number of active warps. Maybe this explains the slightly lower throughput of SOA, maybe not. TThis is SOA compared to ASA

valassi · 2020-08-17T08:48:35Z

I have made many studies related to this and I will dump a few results an ddecisions here.

First, I cleanly spearated in the code the ASA structure for random (based on neppR) and for momenta (based on neppM). In issue #23 I studied the impact of doing a compile time vs a runtime choice for ASA parameters. The difference is small but visible, at the levl of 5% throughput for neppM, and visble in other profiler parameters.

valassi · 2020-08-17T08:51:58Z

Next, I studied whether AOS is any different from ASA with neppM=1. They are actually almost identical. There are differences below 1% in some metrics and I am not even sure why.

This is an overview

This is the warp state

This is the instruction mix

The #registers are essentially the same.

Conclusion: I will drop AOS from the code to make it simpler. This can always be recovered by setting neppM=1 in the ASA option (and indeed it is interesting for some studies, see later).

valassi · 2020-08-17T09:27:38Z

Then, I compare AOSOA (in my default with neppM=32) to SOA (which is essentially AOSOA with a much larger neppM=16384*32). The latter is worse in all relevant metrics an ddoes give a lower throughput. I am not sure why, but the issues seem to come from a much larger number of registers (operations not vectorized??). Anyway, even from first principles this is not really a sound choice.

Overview

Memory

Compute workload

Scheduler and stalls

Insruction mix

Conclusion: I will finally drop also AOS. I will then concentrate only on AOSOA. This will allow a much cleaner code. The relevant parameter in any case is neppM (smaller as 1 means AOS, larger as ndim means SOA) and various options can still be studied.

…#16)

time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[32] Momenta memory layout = AOSOA[32] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.061890e-02 sec MeanTimeInWaveFuncs = 8.849083e-04 sec StdDevTimeInWaveFuncs = 2.232153e-05 sec MinTimeInWaveFuncs = 8.750640e-04 sec MaxTimeInWaveFuncs = 9.580780e-04 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 8.247658e+07 sec^-1 MatrixElemEventsPerSec = 5.924772e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.371972e-02 GeV^0 StdErrMatrixElemValue = 3.270361e-06 GeV^0 StdDevMatrixElemValue = 8.202972e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.146479 sec 0a ProcInit : 0.000569 sec 0b MemAlloc : 0.075094 sec 0c GenCreat : 0.014556 sec 1a GenSeed : 0.000012 sec 1b GenRnGen : 0.007995 sec 2a RamboIni : 0.000108 sec 2b RamboFin : 0.000058 sec 2c CpDTHwgt : 0.006855 sec 2d CpDTHmom : 0.069260 sec 3a SigmaKin : 0.000096 sec 3b CpDTHmes : 0.010523 sec 4a DumpLoop : 0.022509 sec 9a DumpAll : 0.023724 sec 9b GenDestr : 0.000221 sec 9c MemFree : 0.020944 sec 9d CudReset : 0.042465 sec TOTAL : 0.441471 sec TOTAL(n-2) : 0.252526 sec *************************************** real 0m0.452s user 0m0.174s sys 0m0.276s

valassi · 2020-08-17T14:51:10Z

Coming now to the optimal nepp size in AOSOA. So far I am using a default of 32, which I included as this is the number of threads in a warp. I am instead getting the idea that 4 is a better size for our doubles (8 bytes), because the cache lines are 32 bytes and fit 4 doubles. (Probably 8 would be better for floats).

To start with, at face value an ASA32 vs ASA04 (where npeeR=neppM=32 or 4). The throughput is marginally better. Especially, the FP64 seems a bit better utilized, and in the end that's the only thing we do: compute a lot of FP64, so we should do that as fast as possible.

Note that the memory difference comes from a large increase in data pipe LSU wavefront, but I am not sure that's any good in itself. I would say 4 is better than 32 because it takes overall a bit less time, and FP64 is better (both only at the level of 1%).

I also compared ASA04 to ASA08, but the latter is not better, and maybe slightly worse, for doubles.

Instead, ASA01, ie AOS, is very interesting. This is clearly worse than ASA32 (or ASA04). The big difference here is that the number of requests is the same, but the number of sectors is a factor 4 higher: this is an indication of non-coalesced memory access.

For the record I also tried ASA02, but clearly this needs twice as many transactions as ASA04.

Note that indeed for FLOAT the minimum top get fully coalesced access is ASA08 (8 floats of 4 bytes are a cache line of 32 bytes). Using ASA04 there results in twice the number of transactions as one could use.

Conclusion: in other words, we need at least ASA04 (double) or ASA08 (float) to have coalesced memory access. We do not seem to get anything better by using higher nepp. I will set the new defaults.

valassi · 2020-08-17T15:38:47Z

Next point is to try and understand the number of requests. I would imagine that to improve on memory usage we can do two things:

First, for a given number of requests, try to serve them with the least number of transactions. This is about reducing the transaction/request ratio, and is what I did above. As long as we use at least nepp=4 for doubles and nepp=4 for floats, then access is coalesced and we minimise the transactins neede for a given request.
The next point is try to understan the number of requests themselves.

To study this I used the two metrics described here https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/. I actually realised that they are available also in command line, eg as

/usr/local/cuda-11.0/bin/ncu --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum ./gcheck.exe -p 16384 32 1
...
  gProc::sigmaKin(double const*, double*), 2020-Aug-17 17:09:35, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                      1,527,808
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                     10,154,052
    ---------------------------------------------------------------------- --------------- ------------------------------

The question is why 1527808 requests? And why 10154052 transactions by the way?

Using my current default nepp=4, I ran this for several configurations

-p blocks threads iterations	requests	sectors (transactions)
-p 16,384 32 1	1,527,808	10,154,052
-p 2,048 32 1	352,256	2,558,406
-p 256 32 1	70,144	528,896
-p 32 32 1	8,768	66,112
-p 2 32 1	548	4,132
-p 1 64 1	548	4,132
-p 1 32 1	274	2,066
-p 1 4 1	274	274

This is a very useful link: https://stackoverflow.com/questions/60535867. I actually found it by looking fo rthe names of the old metrics (https://docs.nvidia.com/nsight-compute/2019.5/NsightComputeCli/index.html#nvprof-metric-comparison) because the new ones are still not much documented.

A few comments on teh above:

The first point is that each request is a request at the warp level, which is 32 threads. Whether I use 4 threads or 32, the number of requests is the same, 274.
The second point is that a transaction is a 32-byte cache line. So, one coalesced request for doubles (8 bytes) at the warp level needs 8 pages of 4 doubles. However, other types of requests may need a differnt number of transactions. Here it seems that 274 is 256+18. These requests for a full warp need 2066 transactions, which is 256x8(=2048)+18. So, maybe this is 256 doubles, and 18 something else.
As soon as I request more warps (either with more threads than 32 per block, or with more blocks), the requests increase. Both for one 64-thread block, or two 32-thread blocks, I need two warps, and twice the number of requests. The number of transactions also doubles.
This linearity goes on until 256 blocks: "256x32" needs 256 as many requests and transactions as "1x32", 70144=256x274 and 528896=256x2066. For larger numbers of blocks I cannot really understand the pattern anymore, the number of requests decreases with respect to what I would expect. To be understood further..

valassi · 2020-08-18T07:52:35Z

It turns out that most of the 274 requests were due to the way I had implemented helicity filtering. I have improved helicity filtering (issue #24) and this now reduces drastically the number of requests. For the default nepp=4,


./profile.sh -nogui -p 1 4 1
  gProc::sigmaKin(double const*, double*), 2020-Aug-18 09:25:37, Context 1, Stream 7
    Section: Command line profiler metrics
    ---------------------------------------------------------------------- --------------- ------------------------------
    l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum                                request                             16
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum                                  sector                             16
    ---------------------------------------------------------------------- --------------- ------------------------------

I redo my previous table with the new code

-p blocks threads iterations	requests	sectors (transactions)
-p 1 4 1	16	16
-p 1 32 1	16	128
-p 1 64 1	32	256
-p 2 32 1	32	256
-p 256 32 1	4096	32768
-p 2048 32 1	32768	262134 -- 262141
-p 16384 32 1	262144	2097044 -- 2097063
-p 2048 256 1	262144	2097152

The number of requests follows a perfect linearity: with 16384/32 I need 16384 as many requests (262144=16384*16) as with 1/32. In other words, each warp of 32 threads issues 16 requests, and it's not surprising as it needs 16 doubles (the 4-momenta of 4 particles).

The number of transactions also seems to follow the same linearity. For high number of blocks the number of transactions fluctuates (maybe the profiling tool is not very precise on this point??), but essentially with 16384/32 I also need 16384 as many transactions as with 1/32 (as 16384*128=2097152... this is actually what I get with 2048/256, but the throughput is smaller than with 16384/32).

In particular

With 1/4 ie only 4 threads each request needs only one transaction: each request is for one double, ie 8 bytes, but there are only 4 threads in a warp so each warp gets excatly one page of 4 doubles in a 32-byte transaction.
With 1/32 ie 32 threads, each request needs eight transactions (I need eight 32-byte pages to serve the 32 doubles I need for one request)
NB one "request" is for instance the request for pz of particle 2, if I understand correctly

This looks much better.

valassi · 2020-08-18T08:52:03Z

I have now repeated some of the previous profile tests on ASA with nepp=4 (default now) vs 32 (old default) vs 1 (ie AOS).

I would say that ASA32 and ASA04 are indistinguishable, while ASA01 ie AOS is different. Surprisingly, it seems to have now a bit higher throughput?

Anyway the memory is clearly less optimized: it is much more used because the data is retrieved in too many transactions. So there is a 75% L1 hit rate with AOS/ASA01, while it is 0% with ASA04, simply because each page is retrieved 4 times, so 3 times out of 4 it is already in cache... But it is much better to just retrieve it only once

I would CONCLUDE this issue on AOS/SOA structures FOR MOMENTA

It does not seem to be the primary issue: our main issue is that we must optimise the internals of sigmakin, not the retrieval of the momenta into sigmakin
That said, it seems reasonable to keep nepp=4 (for doubles, or nepp=8 for floats) as the baseline. This reduces the number of transactions, avoids relying on the L1 cache by simply retrieving all relevant data in one go.
The fact that there seems to be a small improvement in throughput with AOS is surprising, could be better studied, but I would just ignore it. Also, it may work here for eemumu where there are exactly four particles, but it may give issues elsewhere. Actually one particle 4-momenta is one cache line with doubles (4 doubles), but it becomes half a cache line with floats (4 floats is 16 bytes), and with odd number of particles it may give issues.
Having the code already set up (with a variable nepp) as AOSOA may also be useful to exploit SIMD on the CPU.
The structure of memmory layouts for intermediate results of the sigmakin computation (the w[5][6]) remains instead something that we may need to study better. THis is discussed in issue Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7. I would close this one now...

valassi · 2020-12-09T22:17:47Z

Marking as closed in the two attached projects.

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.725189 sec 2,549,736,022 cycles # 2.650 GHz 3,503,747,691 instructions # 1.37 insn per cycle 1.023542621 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1] =========================================================================

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.723324 sec 2,546,454,804 cycles # 2.656 GHz 3,488,166,591 instructions # 1.37 insn per cycle 1.020129392 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1] =========================================================================

valassi · 2021-06-11T10:16:51Z

Hi @roiser @oliviermattelaer @hageboeck I did a few additional tests of AOSOA in cuda while writing the paper (to avoid making stupid statements). THis updates the observations above. This is all in PR #209, just merged.

Two main observations:

I confirm that in CUDA eemumu the AOSOA for momenta only brings a very marginal improvement with respect to AOS. I do see that using an AOS moves around four times more memory (the number of sectors goes up by x4), but the ME throughout only decreases by maybe 5%. And this is eemumu, where memory access to momenta is much more important than in ggttgg...
Second point, the number of requests that is now needed per event (or better, per event page) is 40: we need the full 4-momenta of two particles, the pz momenta of two particles (thanks to the simplified ixx/oxx functions), all this times four good helicities. In my previous posts here I was mentioning 16, but this was for a single helicity, and before the simplification in ixx/oxx.

So, all looks understood here...

This is the current default
214b8c2

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[4]
EvtsPerSec[MatrixElems] (3) = ( 7.227665e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.352993e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.119368 sec
     3,329,955,135      cycles                    #    2.649 GHz
     4,782,880,833      instructions              #    1.44  insn per cycle
       1.414098430 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,035 [-p 2048 256 1]
=========================================================================

This is with an AOS for momenta (neppM=1... I still keep random numbers as AOSOA[8] to get the same physics results)
380f06c
You can see that the number of sectors (transaction roundtrips) has gone up by a factor 4 because memory is not coaleseced. Still, thoughput only goes down from 7.20E8 to 7.00E8 (and there are huge fluctuations on this machine).

On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]:
=========================================================================
Process                     = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221]
FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
RanNumb memory layout       = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY]
Momenta memory layout       = AOSOA[1] == AOS
EvtsPerSec[MatrixElems] (3) = ( 6.997163e+08                 )  sec^-1
EvtsPerSec[MECalcOnly] (3a) = ( 1.272792e+09                 )  sec^-1
MeanMatrixElemValue         = ( 1.371706e-02 +- 3.270315e-06 )  GeV^0
TOTAL       :     1.208810 sec
     3,334,791,948      cycles                    #    2.652 GHz
     4,801,124,639      instructions              #    1.44  insn per cycle
       1.502558668 seconds time elapsed
==PROF== Profiling "sigmaKin": launch__registers_per_thread 120
==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 8 1]
==PROF== Profiling "sigmaKin": requests 40 sectors 1,280 [-p 1 32 1]
==PROF== Profiling "sigmaKin": requests 655,360 sectors 20,971,817 [-p 2048 256 1]
=========================================================================

valassi · 2021-06-11T10:19:46Z

PS Uh! I am silly.

I forgot to say, I added a NEW metric, where I do NOT include the device to host copy.

In the numbers above, the difference between AOSOA and AOS ia bit bigger: 1.35E9 to 1.27E9. But still, only around 8%.

I will quote in the paper that it seems below 10%...

valassi · 2021-06-11T12:31:17Z

PPS A few observations in PR #210 for single precision: not surprinsingly,

with the optimal AOSOA[8], the number of transactions is half of that in double precision with the optimal AOSOA[4], for the sam enumber of request
with the suboptimal AOS, the number of transactions is 8 times asmany as with AOSOA[8]... actually it is the same as for double precision with AOS

In other words:

with single and double precision, if AOS is used, the number of requests is the same in the two cases (approximately 40 times the number of events, divided by 32) and the number of transactions is also the same (approximately 40 times the number of events) - a single request is issued for 32 threads, resulting in 32 transactions
if the optimal AOSOA is used (4 for double, 8 for single), the number of transactions decreases by a factor 4, or 8, respectively - because one request for 32 threads can be handled with only 8, or 4, respectively, transactions, insteda of 32

…failing patching file Source/dsample.f Hunk madgraph5#3 FAILED at 181. Hunk madgraph5#4 succeeded at 197 (offset 2 lines). Hunk madgraph5#5 FAILED at 211. Hunk madgraph5#6 succeeded at 893 (offset 3 lines). 2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej patching file SubProcesses/addmothers.f patching file SubProcesses/cuts.f patching file SubProcesses/makefile Hunk madgraph5#3 FAILED at 61. Hunk madgraph5#4 succeeded at 94 (offset 6 lines). Hunk madgraph5#5 succeeded at 122 (offset 6 lines). 1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej patching file SubProcesses/reweight.f Hunk #1 FAILED at 1782. Hunk #2 succeeded at 1827 (offset 27 lines). Hunk madgraph5#3 succeeded at 1841 (offset 27 lines). Hunk madgraph5#4 succeeded at 1963 (offset 27 lines). 1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej patching file auto_dsig.f Hunk madgraph5#6 FAILED at 301. Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines). Hunk madgraph5#11 succeeded at 912 (offset 16 lines). Hunk madgraph5#12 succeeded at 958 (offset 16 lines). Hunk madgraph5#13 succeeded at 971 (offset 16 lines). Hunk madgraph5#14 succeeded at 987 (offset 16 lines). Hunk madgraph5#15 succeeded at 1006 (offset 16 lines). Hunk madgraph5#16 succeeded at 1019 (offset 16 lines). 1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej patching file driver.f patching file matrix1.f patching file auto_dsig1.f Hunk #2 succeeded at 220 (offset 7 lines). Hunk madgraph5#3 succeeded at 290 (offset 7 lines). Hunk madgraph5#4 succeeded at 453 (offset 8 lines). Hunk madgraph5#5 succeeded at 464 (offset 8 lines).

…madgraph5#16, which is now in the way

…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected STARTED AT Thu May 16 01:24:16 AM CEST 2024 (SM tests) ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt The new issue madgraph5#845 is the following +Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. + +Backtrace for this error: +#0 0x7f2a1a623860 in ??? +#1 0x7f2a1a622a05 in ??? +#2 0x7f2a1a254def in ??? +madgraph5#3 0x7f2a1ae20acc in ??? +madgraph5#4 0x7f2a1acc4575 in ??? +madgraph5#5 0x7f2a1ae1d4c9 in ??? +madgraph5#6 0x7f2a1ae2570d in ??? +madgraph5#7 0x7f2a1ae2afa1 in ??? +madgraph5#8 0x43008b in ??? +madgraph5#9 0x431c10 in ??? +madgraph5#10 0x432d47 in ??? +madgraph5#11 0x433b1e in ??? +madgraph5#12 0x44a921 in ??? +madgraph5#13 0x42ebbf in ??? +madgraph5#14 0x40371e in ??? +madgraph5#15 0x7f2a1a23feaf in ??? +madgraph5#16 0x7f2a1a23ff5f in ??? +madgraph5#17 0x403844 in ??? +madgraph5#18 0xffffffffffffffff in ??? +./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} +ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed

roiser added enhancement A feature we want to develop upstream Ready to be included in the MG5 code generator labels Aug 12, 2020

roiser assigned valassi Aug 13, 2020

valassi added a commit that referenced this issue Aug 14, 2020

Add metrics relevant to AOS/SOA issue #16 in the profiel script

0429183

valassi mentioned this issue Aug 17, 2020

Compile-time const (and constexpr) vs constant-memory constants (layouts; physics parameters?) #23

Closed

valassi added a commit that referenced this issue Aug 17, 2020

Cleanup: remove AOS as this can be achieved as ASA with neppM=1 (issue …

e4edf3b

…#16)

valassi mentioned this issue Aug 17, 2020

Improve helicity filtering #24

Closed

valassi added a commit that referenced this issue Aug 18, 2020

NEW AOSOA DEFAULT: nepp=4 for doubles, nepp=8 for floats (issue #16)

d5ce433

valassi closed this as completed Aug 18, 2020

valassi changed the title ~~AOS/SOA structures~~ AOS/SOA structures for input particle 4-momenta Aug 18, 2020

valassi changed the title ~~AOS/SOA structures for input particle 4-momenta~~ AOS/SOA structures for input particle 4-momenta (and random numbers) Aug 18, 2020

valassi changed the title ~~AOS/SOA structures for input particle 4-momenta (and random numbers)~~ AOS/SOA for input particle 4-momenta (and random numbers) Aug 18, 2020

This was referenced Aug 18, 2020

Memory layout (shared/global/local, AOSOA...) for intermediate wavefunctions in ME calculations #7

Closed

Branch efficiency: check that we have no issues with branch divergence #25

Closed

Use maxrregcount to reduce register usage and improve throughput #26

Open

valassi mentioned this issue Nov 25, 2020

good helicity #60

Closed

valassi reopened this Dec 9, 2020

valassi closed this as completed Dec 9, 2020

valassi mentioned this issue Apr 25, 2021

Decouple neppM and neppV in vectorised C++ code? #176

Closed

valassi mentioned this issue Jun 11, 2021

Log a few tests of sector/requests in single precision #210

Merged

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 14, 2022

[lhe] in gg_tt.mad CPPProcess.cc remove an old optional test for issue …

f598229

…madgraph5#16, which is now in the way

valassi added a commit to valassi/madgraph4gpu that referenced this issue Dec 14, 2022

[lhe] in gg_tt.mad CPPProcess.cc remove an old optional test for issue …

26bdcf4

…madgraph5#16, which is now in the way

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AOS/SOA for input particle 4-momenta (and random numbers) #16

AOS/SOA for input particle 4-momenta (and random numbers) #16

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 14, 2020

valassi commented Aug 14, 2020

valassi commented Aug 14, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 18, 2020 •

edited

Loading

valassi commented Aug 18, 2020

valassi commented Dec 9, 2020

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021

AOS/SOA for input particle 4-momenta (and random numbers) #16

AOS/SOA for input particle 4-momenta (and random numbers) #16

Comments

roiser commented Aug 12, 2020

valassi commented Aug 13, 2020

valassi commented Aug 14, 2020

valassi commented Aug 14, 2020

valassi commented Aug 14, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 17, 2020

valassi commented Aug 18, 2020 • edited Loading

valassi commented Aug 18, 2020

valassi commented Dec 9, 2020

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021

valassi commented Jun 11, 2021

valassi commented Aug 18, 2020 •

edited

Loading