API(4) Move HelAmps to the latest MemoryAccess classes #322

valassi · 2021-12-24T09:44:12Z

I have decided to strip off what was initially "the second part of apirambo" PR #321.

Essentially in apirambo I have the new MemoryAccess for rambo, the old MemoryAcccess for MEs. This new PR is about porting the new MemoryAccess to MEs and specifically HELAmps, too.

This is WIP

I have only prototyped one function imzxxx, all other functions must be done
The SIMD/512y performance in eemumu of the new memory access gives a 10% performance hit, to be understood
The functional and performance tests however succeed, all segfaults and failures are fixed

What remans to be done

fix the performance in eemumu
port all other ixxx/oxxx functions
rerun the throuput test in eemumu and check all is ok
remove neppM from mgOnGpuConfig, now it will be enough to encapsulate it in MemoryAccessMomenta
strip off p4type from MemryAccess.h, rename it as MemoryAccessVectors.h (just the fptype/fptye_v conversions)
cleanup, remove all unused code (eg Memory.h, possibly more)
backport to code generate
rerun tests for ggtt and ggttgg, check performance

…uilds ok

…lAmps - none check/gcheck builds (Still to do: testxxx port and SIMD vector types)

ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/10.2.0-c44b3/x86_64-centos7/bin/g++ -O3 -std=c++17 -I. -I../../src -I../../../../../tools -I../../../../../tools -I../../../../../test/googletest/googletest/include -DUSE_NVTX -Wall -Wshadow -Wextra -ffast-math -DMGONGPU_FPTYPE_DOUBLE -I/usr/local/cuda-11.1/include/ -c testxxx.cc -o testxxx.o In file included from testxxx.cc:5: ../../src/HelAmps_sm.h:89:8: warning: inline function ‘void MG5_sm::imzxxx(const fptype_sv*, int, int, cxtype_sv*, int) [with M_ACCESS = KernelAccessMomenta<false>; fptype_sv = double; cxtype_sv = std::complex<double>]’ used but never defined 89 | void imzxxx( const fptype_sv* momenta, /cvmfs/sft.cern.ch/lcg/releases/binutils/2.34-990b2/x86_64-centos7/bin/ld: ./testxxx.o: in function `SIGMA_SM_EPEM_MUPMUM_CPU_testxxx_Test::TestBody()': testxxx.cc:(.text+0xdd76): undefined reference to `void MG5_sm::imzxxx<KernelAccessMomenta<false> >(double const*, int, int, std::complex<double>*, int)'

…t test fails at runtime testxxx.cc:187: Failure The difference between cxreal( wf[iw6] ) and expReal is 1000, which exceeds std::abs( expReal * toleranceXXXs ), where cxreal( wf[iw6] ) evaluates to -500, expReal evaluates to 500, and std::abs( expReal * toleranceXXXs ) evaluates to 4.9999999999999999e-13. itest=12: imzxxx#1 against ixxxxx

…yAccess classes

…issing (was always testing event 0)

…akes fptype* as input, not fptype_sv*

… const - check.exe ok, runTest segfaults

…ort unaligned/arbitrary arrays)

…gfault), check is slow 4.40E6 512y?

(NB runTest still segfaults if the checks are skipped)

…till slow

…gM*neppM, not ievt! - runTest now ok

… ok, cuda perf ok, SIMD 10% slower... (NB cuda performance is definitely not affected, one test even was 2% faster than the previous reference)

…2022 - eemumu performance fluctuations (Note in particular a 2% fluctuation in the cuda results)

…cover apirambo performance (Note there are still reproducible 2% fluctuations in both cuda and c++, but today I get them in apirambo too)

…events have same initial momenta (Hence the MEs are all 0 and the tests fail)

Revert "[apihel] experiment with noinline keyword - gives build warnings, performance slightly worse than expected" This reverts commit 63ca571. Revert "[apihel] attempt another fix, define INLINE as inline in all cases - builds, but affects performance" This reverts commit 4f9089e. Revert "[apihel] first fix to move XXX function implementation to cc file - testxxx build still fails" This reverts commit f594600. Revert "[apihel] try again to move XXX function implementation to cc file - testxxx build fails" This reverts commit a570168.

…eemumu auto - complete CPPProcess.cc

…eemumu auto - complete CPPProcess.h

…re identical (NB moved XXX function implementation earlier on in HelAmps.h manual too)

… bit faster as observed for double

…hst random) - all ok

…bo) - all ok

…cIPC) - all ok

…y syncManu)

… on 512y?)

…d files

…lightly faster, 512y/z slightly slower

…er, 512y/z MUCH slower??

… all looks the same as before?)

…ote ME values have changed?!

…th hrd1 - all ok, ~same perf

valassi · 2022-01-06T10:01:37Z

This is now complete.

This is where I stripped off what was initially "the second part of apirambo" PR #321. Essentially in apirambo I have the new MemoryAccessMomenta for rambo, the old MemoryAcccessMomenta for MEs in HelAmps. This new PR is about porting the new MemoryAccessMomenta to HelAmps, too. Note that new BufferMEs and MemoryAccessMEs will be created in the upcoming apimes PR.

With respect to the previous WIP

I had only prototyped one function imzxxx, now also all other functions have been done. Note that the choice of imz was bad: there was a functional bug and I was reading always the same first event, but as the initial momenta are the same for all events, this bug had gone unnoticed (the avereage ME was ok).
The SIMD/512y performance in eemumu of the new memory access was giving a 10% performance hit in eemumu. After fixing the functional bug, this disappeared and 'magically' I even got 5-10% better performance in eemumu. I have just run the ggtt and ggttgg tests however and the ggtt performance seems worse (to be understood?), especially with hardcoded parameters(?!), and only in avx2 or avx512, while none/sse4 are faster(?!). For ggttgg however all is ok.

With respect to the other points that I mentioned to be done

fix the performance in eemumu: done, with caveats as above for ggtt
port all other ixxx/oxxx functions: done
rerun the throuput test in eemumu and check all is ok: done
remove neppM from mgOnGpuConfig, now it will be enough to encapsulate it in MemoryAccessMomenta: done
strip off p4type from MemryAccess.h, rename it as MemoryAccessVectors.h (just the fptype/fptye_v conversions): done
cleanup, remove all unused code (eg Memory.h, possibly more): done, but note that Memory.h will need the apimes PR to be removed
backport to code generate: done
rerun tests for ggtt and ggttgg, check performance: done

In summary, pending:

apimes PR (upcoming "API5"): new Buffer and MemoryAccess for MEs, also get rid of Memory.h
understand better ggtt performance from these changes

valassi · 2022-01-06T17:04:43Z

This completes the "API4" step described in #323

valassi · 2022-01-10T16:10:26Z

This was presented at the meeting today
https://indico.cern.ch/event/1103955/
I am merging this now

valassi added 21 commits December 24, 2021 07:13

[apihel] add __host__ to __device__ functions for MEs

ae643fb

[apihel] improve namespaces in testxxx.cc, WIP on new MemoryAccess, b…

ab8db15

…uilds ok

[apihel] first successful proof of concept for new MemoryAccess in He…

2c0baf2

…lAmps - none check/gcheck builds (Still to do: testxxx port and SIMD vector types)

[apihel] temporarely add back also the old imzxxx to debug

fd68ee5

[apihel] add optional debug printouts to MemoryAccessMomenta

8e598db

[apihel] improve debug printouts in testxxx

f14912e

[apihel] more debug printouts for testxxx, fix some comments in Memor…

b49e602

…yAccess classes

[apihel] BUG FIX in testxxx - a call to ieventAccessRecordConst was m…

5ad503c

…issing (was always testing event 0)

[apihel] cleanup - disable all debug printouts, runTest.exe is ok

04995dc

[apihel] WIP on fptye_sv in MemeoryAccess: first of all, imzxxx now t…

e1e3398

…akes fptype* as input, not fptype_sv*

[apihel] comment out debug printouts instead of using "#if 0"

d25a0a4

[apihel] add partial hack for fptype_sv in MemoryAccessMomenta kernel…

6cc2ec3

… const - check.exe ok, runTest segfaults

[apihel] add fptypevFromAlignedArray to wrap reinterpret cast

0b0a5d5

[apihel] return by value in kernelAccessIp4IparConst (prepare to supp…

0a0fadb

…ort unaligned/arbitrary arrays)

[apihel] WIP on unaligned/arbitrary access - testxxx now fails (no se…

650bcd0

…gfault), check is slow 4.40E6 512y?

[apihel] use HostBufferMomenta in testxxx.cc - runTest still fails

073d877

(NB runTest still segfaults if the checks are skipped)

[apihel] reenable a "if constexpr" that I had forgotten - check.exe s…

03ecc91

…till slow

[apihel] BUG FIX in testxxx: with vector new MemoryAccess, locate ipa…

83fdd80

…gM*neppM, not ievt! - runTest now ok

[apihel] first successful throughput test of imzxxx prototype - tests…

2e9e26e

… ok, cuda perf ok, SIMD 10% slower... (NB cuda performance is definitely not affected, one test even was 2% faster than the previous reference)

valassi self-assigned this Dec 24, 2021

valassi marked this pull request as draft December 24, 2021 09:44

This was referenced Dec 24, 2021

API (3) Rambo kernel launchers #321

Merged

Workplan for January 2022 #323

Closed

valassi added 5 commits January 5, 2022 09:41

[apihel] remove unused code in HelAmps_sm.cc and rerun first test in …

13bd625

…2022 - eemumu performance fluctuations (Note in particular a 2% fluctuation in the cuda results)

[apihel] test disabling the main changes of apihel over apirambo - re…

8fcc6a7

…cover apirambo performance (Note there are still reproducible 2% fluctuations in both cuda and c++, but today I get them in apirambo too)

[apihel] reenable the memory access changes and lose performance again

3664b52

[apihel] try out oxzxxx: logical bug, imzxxx worked by chance as all …

662bdda

…events have same initial momenta (Hence the MEs are all 0 and the tests fail)

[apihel] WIP on fixing memory access - still MEs are 0

97fa331

valassi added 21 commits January 6, 2022 09:28

[apihel] more WIP on backport to code generation and regeneration of …

ecb2ca8

…eemumu auto - complete CPPProcess.cc

[apihel] more WIP on backport to code generation and regeneration of …

e54f638

…eemumu auto - complete CPPProcess.h

[apihel] COMPLETE BACKPORT of HelAmps.h/cc - eemumu manual and auto a…

f37bce3

…re identical (NB moved XXX function implementation earlier on in HelAmps.h manual too)

[apihel] rerun performance test for eemumu manual (double) - all ok

79b3922

[apihel] rerun performance test for eemumu manual (float) - all ok, a…

b276b35

… bit faster as observed for double

[apihel] rerun performance test for eemumu manual (double, common/cur…

294a7a6

…hst random) - all ok

[apihel] rerun performance test for eemumu manual (double, rmbhst ram…

1e2d2d3

…bo) - all ok

[apihel] rerun performance test for eemumu manual (double, hardcoded …

6490968

…cIPC) - all ok

[apihel] regenerate ggtt auto

a0809e6

[apihel] bug fix in syncManu script

de07ca7

[apihel] resync ggtt manual to ggtt auto

9b0cf4d

[apihel] manually delete MemoryAccess.h in ggtt manual (not handled b…

54e09f4

…y syncManu)

[apihel] test performance of new ggtt double - all ok (slightly worse…

30f70f0

… on 512y?)

[apihel] rerun a second time ggtt - confirms previous performance

5dcb821

[apihel] regenerate ggttgg auto and sync to manu including new/delete…

6b10827

…d files

[apihel] test also float performance of ggtt - as before, none/sse4 s…

897ffec

…lightly faster, 512y/z slightly slower

[apihel] test also hrd1 performance of ggtt - none/sse4 slightly fast…

579e7bb

…er, 512y/z MUCH slower??

[apihel] test performance of new ggttgg double - all ok (unlike ggtt,…

77c3d4a

… all looks the same as before?)

[apihel] test performance of new ggttgg float - all ok, ~same perf, n…

af6ee70

…ote ME values have changed?!

[apihel] (** COMPLETE APIHEL **) test performance of ggttgg double wi…

0218277

…th hrd1 - all ok, ~same perf

valassi changed the title ~~WIP API(4) Move HelAmps to the latest MemorAccess classes~~ API(4) Move HelAmps to the latest MemoryAccess classes Jan 6, 2022

valassi marked this pull request as ready for review January 6, 2022 10:01

valassi mentioned this pull request Jan 6, 2022

API(5) MatrixElement kernel launcher (and ME buffers and mememory access) #324

Merged

Merge remote-tracking branch 'upstream/master' into apihel

5738378

valassi merged commit 2c36ef9 into madgraph5:master Jan 10, 2022

valassi mentioned this pull request Jan 14, 2022

separate buffers for different particles (SOAOSOA?) #309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API(4) Move HelAmps to the latest MemoryAccess classes #322

API(4) Move HelAmps to the latest MemoryAccess classes #322

valassi commented Dec 24, 2021

valassi commented Jan 6, 2022

valassi commented Jan 6, 2022

valassi commented Jan 10, 2022

API(4) Move HelAmps to the latest MemoryAccess classes #322

API(4) Move HelAmps to the latest MemoryAccess classes #322

Conversation

valassi commented Dec 24, 2021

valassi commented Jan 6, 2022

valassi commented Jan 6, 2022

valassi commented Jan 10, 2022