-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AOS/SOA for input particle 4-momenta (and random numbers) #16
Comments
Actually I have timed this yesterday, and AOSOA does seem to pay off, even if not by much.
I would say that we should keep some AOSOA-like structure as baseline To do in any case
|
Some food for thought from Vincenzo, found by chance from google "godbolt cuda soa" |
And another interesting talk at CERN from the same google for godbolt |
I would like to do some changes to the AOSOA, but first I's like to understand a bit better if/why the layour makes a difference. I can think of two things, memory coalescing, and instruction vectorization. So I am doing some research to understand which metrics are relevant, eg in the profiler. About memory coalescing, google "nvidia nsight coalesce" brought me here: https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels. This is avery interesting article pointing to two metrics, and showing how to focus on them in the tools. I added them here 0429183 I have then profiled the BASELINE ASA against AOS and SOA. At face value I confiorm that SOA is similar, just ~2% slower than ASA, while AOS is quite a bit slower, around 7-10%. The profiles are very interesting. First, looking at the metrics in the blog above, indeed the number of requests for memory in sgnakin is the same in ASA and AOS, but the number of sectors (transactions) is a factor 4 higher with AOS than with ASA. This is a clear indication that the memory (the allmomenta memory) is not coalesced in AOS and is better in ASA. The distinction between ASA and SOA is much more subtle and unclear, the numbers of requests and sectors is lower in SOA than ASA, with a ratiob etween them that remains comparable. Note also that the ratio is around 6, while the optimale should be around 4? Second, SOA has a higher number of registers that ASA (182 against 152) and this may be the reason for a penalty elsehwere, eg in the number of active warps. Maybe this explains the slightly lower throughput of SOA, maybe not. TThis is SOA compared to ASA |
I have made many studies related to this and I will dump a few results an ddecisions here. First, I cleanly spearated in the code the ASA structure for random (based on neppR) and for momenta (based on neppM). In issue #23 I studied the impact of doing a compile time vs a runtime choice for ASA parameters. The difference is small but visible, at the levl of 5% throughput for neppM, and visble in other profiler parameters. |
Next, I studied whether AOS is any different from ASA with neppM=1. They are actually almost identical. There are differences below 1% in some metrics and I am not even sure why. The #registers are essentially the same. Conclusion: I will drop AOS from the code to make it simpler. This can always be recovered by setting neppM=1 in the ASA option (and indeed it is interesting for some studies, see later). |
Then, I compare AOSOA (in my default with neppM=32) to SOA (which is essentially AOSOA with a much larger neppM=16384*32). The latter is worse in all relevant metrics an ddoes give a lower throughput. I am not sure why, but the issues seem to come from a much larger number of registers (operations not vectorized??). Anyway, even from first principles this is not really a sound choice. Conclusion: I will finally drop also AOS. I will then concentrate only on AOSOA. This will allow a much cleaner code. The relevant parameter in any case is neppM (smaller as 1 means AOS, larger as ndim means SOA) and various options can still be studied. |
time ./gcheck.exe -p 16384 32 12 *************************************** NumIterations = 12 NumThreadsPerBlock = 32 NumBlocksPerGrid = 16384 --------------------------------------- FP precision = DOUBLE (nan=0) Complex type = THRUST::COMPLEX RanNumb memory layout = AOSOA[32] Momenta memory layout = AOSOA[32] Wavefunction GPU memory = LOCAL Curand generation = DEVICE (CUDA code) --------------------------------------- NumberOfEntries = 12 TotalTimeInWaveFuncs = 1.061890e-02 sec MeanTimeInWaveFuncs = 8.849083e-04 sec StdDevTimeInWaveFuncs = 2.232153e-05 sec MinTimeInWaveFuncs = 8.750640e-04 sec MaxTimeInWaveFuncs = 9.580780e-04 sec --------------------------------------- TotalEventsComputed = 6291456 RamboEventsPerSec = 8.247658e+07 sec^-1 MatrixElemEventsPerSec = 5.924772e+08 sec^-1 *************************************** NumMatrixElements(notNan) = 6291456 MeanMatrixElemValue = 1.371972e-02 GeV^0 StdErrMatrixElemValue = 3.270361e-06 GeV^0 StdDevMatrixElemValue = 8.202972e-03 GeV^0 MinMatrixElemValue = 6.071582e-03 GeV^0 MaxMatrixElemValue = 3.374925e-02 GeV^0 *************************************** 00 CudaFree : 0.146479 sec 0a ProcInit : 0.000569 sec 0b MemAlloc : 0.075094 sec 0c GenCreat : 0.014556 sec 1a GenSeed : 0.000012 sec 1b GenRnGen : 0.007995 sec 2a RamboIni : 0.000108 sec 2b RamboFin : 0.000058 sec 2c CpDTHwgt : 0.006855 sec 2d CpDTHmom : 0.069260 sec 3a SigmaKin : 0.000096 sec 3b CpDTHmes : 0.010523 sec 4a DumpLoop : 0.022509 sec 9a DumpAll : 0.023724 sec 9b GenDestr : 0.000221 sec 9c MemFree : 0.020944 sec 9d CudReset : 0.042465 sec TOTAL : 0.441471 sec TOTAL(n-2) : 0.252526 sec *************************************** real 0m0.452s user 0m0.174s sys 0m0.276s
Next point is to try and understand the number of requests. I would imagine that to improve on memory usage we can do two things:
To study this I used the two metrics described here https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/. I actually realised that they are available also in command line, eg as
The question is why 1527808 requests? And why 10154052 transactions by the way? Using my current default nepp=4, I ran this for several configurations
This is a very useful link: https://stackoverflow.com/questions/60535867. I actually found it by looking fo rthe names of the old metrics (https://docs.nvidia.com/nsight-compute/2019.5/NsightComputeCli/index.html#nvprof-metric-comparison) because the new ones are still not much documented. A few comments on teh above:
|
It turns out that most of the 274 requests were due to the way I had implemented helicity filtering. I have improved helicity filtering (issue #24) and this now reduces drastically the number of requests. For the default nepp=4,
I redo my previous table with the new code
The number of requests follows a perfect linearity: with 16384/32 I need 16384 as many requests (262144=16384*16) as with 1/32. In other words, each warp of 32 threads issues 16 requests, and it's not surprising as it needs 16 doubles (the 4-momenta of 4 particles). The number of transactions also seems to follow the same linearity. For high number of blocks the number of transactions fluctuates (maybe the profiling tool is not very precise on this point??), but essentially with 16384/32 I also need 16384 as many transactions as with 1/32 (as 16384*128=2097152... this is actually what I get with 2048/256, but the throughput is smaller than with 16384/32). In particular
This looks much better. |
I have now repeated some of the previous profile tests on ASA with nepp=4 (default now) vs 32 (old default) vs 1 (ie AOS). I would say that ASA32 and ASA04 are indistinguishable, while ASA01 ie AOS is different. Surprisingly, it seems to have now a bit higher throughput? Anyway the memory is clearly less optimized: it is much more used because the data is retrieved in too many transactions. So there is a 75% L1 hit rate with AOS/ASA01, while it is 0% with ASA04, simply because each page is retrieved 4 times, so 3 times out of 4 it is already in cache... But it is much better to just retrieve it only once I would CONCLUDE this issue on AOS/SOA structures FOR MOMENTA
|
Marking as closed in the two attached projects. |
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.237518e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.361609e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.725189 sec 2,549,736,022 cycles # 2.650 GHz 3,503,747,691 instructions # 1.37 insn per cycle 1.023542621 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,243,444 [-p 2048 256 1] =========================================================================
On itscrd70.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: NVIDIA Tesla V100S-PCIE-32GB]: ========================================================================= Process = EPOCH1_EEMUMU_CUDA [nvcc 11.0.221] FP precision = DOUBLE (NaN/abnormal=0, zero=0) RanNumb memory layout = AOSOA[8] [HARDCODED FOR REPRODUCIBILITY] Momenta memory layout = AOSOA[4] EvtsPerSec[MatrixElems] (3) = ( 7.043865e+08 ) sec^-1 EvtsPerSec[MECalcOnly] (3a) = ( 1.351422e+09 ) sec^-1 MeanMatrixElemValue = ( 1.371706e-02 +- 3.270315e-06 ) GeV^0 TOTAL : 0.723324 sec 2,546,454,804 cycles # 2.656 GHz 3,488,166,591 instructions # 1.37 insn per cycle 1.020129392 seconds time elapsed ==PROF== Profiling "sigmaKin": launch__registers_per_thread 120 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ==PROF== Profiling "sigmaKin": requests 40 sectors 80 [-p 1 8 1] ==PROF== Profiling "sigmaKin": requests 40 sectors 320 [-p 1 32 1] ==PROF== Profiling "sigmaKin": requests 655,360 sectors 5,242,606 [-p 2048 256 1] =========================================================================
Hi @roiser @oliviermattelaer @hageboeck I did a few additional tests of AOSOA in cuda while writing the paper (to avoid making stupid statements). THis updates the observations above. This is all in PR #209, just merged. Two main observations:
So, all looks understood here... This is the current default
This is with an AOS for momenta (neppM=1... I still keep random numbers as AOSOA[8] to get the same physics results)
|
PS Uh! I am silly. I forgot to say, I added a NEW metric, where I do NOT include the device to host copy. In the numbers above, the difference between AOSOA and AOS ia bit bigger: 1.35E9 to 1.27E9. But still, only around 8%. I will quote in the paper that it seems below 10%... |
PPS A few observations in PR #210 for single precision: not surprinsingly,
In other words:
|
…failing patching file Source/dsample.f Hunk madgraph5#3 FAILED at 181. Hunk madgraph5#4 succeeded at 197 (offset 2 lines). Hunk madgraph5#5 FAILED at 211. Hunk madgraph5#6 succeeded at 893 (offset 3 lines). 2 out of 6 hunks FAILED -- saving rejects to file Source/dsample.f.rej patching file SubProcesses/addmothers.f patching file SubProcesses/cuts.f patching file SubProcesses/makefile Hunk madgraph5#3 FAILED at 61. Hunk madgraph5#4 succeeded at 94 (offset 6 lines). Hunk madgraph5#5 succeeded at 122 (offset 6 lines). 1 out of 5 hunks FAILED -- saving rejects to file SubProcesses/makefile.rej patching file SubProcesses/reweight.f Hunk #1 FAILED at 1782. Hunk #2 succeeded at 1827 (offset 27 lines). Hunk madgraph5#3 succeeded at 1841 (offset 27 lines). Hunk madgraph5#4 succeeded at 1963 (offset 27 lines). 1 out of 4 hunks FAILED -- saving rejects to file SubProcesses/reweight.f.rej patching file auto_dsig.f Hunk madgraph5#6 FAILED at 301. Hunk madgraph5#10 succeeded at 773 with fuzz 2 (offset 4 lines). Hunk madgraph5#11 succeeded at 912 (offset 16 lines). Hunk madgraph5#12 succeeded at 958 (offset 16 lines). Hunk madgraph5#13 succeeded at 971 (offset 16 lines). Hunk madgraph5#14 succeeded at 987 (offset 16 lines). Hunk madgraph5#15 succeeded at 1006 (offset 16 lines). Hunk madgraph5#16 succeeded at 1019 (offset 16 lines). 1 out of 16 hunks FAILED -- saving rejects to file auto_dsig.f.rej patching file driver.f patching file matrix1.f patching file auto_dsig1.f Hunk #2 succeeded at 220 (offset 7 lines). Hunk madgraph5#3 succeeded at 290 (offset 7 lines). Hunk madgraph5#4 succeeded at 453 (offset 8 lines). Hunk madgraph5#5 succeeded at 464 (offset 8 lines).
…madgraph5#16, which is now in the way
…madgraph5#16, which is now in the way
…#845 in log_gqttq_mad_f_inl0_hrd0.txt, the rest as expected STARTED AT Thu May 16 01:24:16 AM CEST 2024 (SM tests) ENDED(1) AT Thu May 16 05:58:45 AM CEST 2024 [Status=0] (BSM tests) ENDED(1) AT Thu May 16 06:07:42 AM CEST 2024 [Status=0] 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 18 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 24 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 0 /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt The new issue madgraph5#845 is the following +Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. + +Backtrace for this error: +#0 0x7f2a1a623860 in ??? +#1 0x7f2a1a622a05 in ??? +#2 0x7f2a1a254def in ??? +madgraph5#3 0x7f2a1ae20acc in ??? +madgraph5#4 0x7f2a1acc4575 in ??? +madgraph5#5 0x7f2a1ae1d4c9 in ??? +madgraph5#6 0x7f2a1ae2570d in ??? +madgraph5#7 0x7f2a1ae2afa1 in ??? +madgraph5#8 0x43008b in ??? +madgraph5#9 0x431c10 in ??? +madgraph5#10 0x432d47 in ??? +madgraph5#11 0x433b1e in ??? +madgraph5#12 0x44a921 in ??? +madgraph5#13 0x42ebbf in ??? +madgraph5#14 0x40371e in ??? +madgraph5#15 0x7f2a1a23feaf in ??? +madgraph5#16 0x7f2a1a23ff5f in ??? +madgraph5#17 0x403844 in ??? +madgraph5#18 0xffffffffffffffff in ??? +./madX.sh: line 379: 3004240 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp} +ERROR! ' ./build.512z_f_inl0_hrd0/madevent_cpp < /tmp/avalassi/input_gqttq_x10_cudacpp > /tmp/avalassi/output_gqttq_x10_cudacpp' failed
Done, including AOSOA (AV)
eemumu_AV/master
Steered by typedefs. We need to go to the end with this idea.
Should go back and time AOS… actually AOSOA
5.0E8, AOS4.8E8The text was updated successfully, but these errors were encountered: