extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

valassi · 2023-11-10T06:47:02Z

This is a WIP PR for extending the CI testsuite.

I keep this in a PR so that the CI can run (I have disabled on:push triggers)

…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.

…t into two separate jobs, and add a codegen cache (which is really a compulsory build artifact)

…eat the build/test jobs twice (for FPTYPE=d,f) This must be cleaned up - the cache cleanup job must be split up (codegen cache cleanup once, build cache cleanup once per build type) - the Process+fptype tag must become a more general build tag for caches (eventually add inl, hrdcod)

…esses

…also affected by madgraph5#696 [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/SubProcesses/P1_Sigma_loop_sm_no_b_mass_gg_ttx> make HRDCOD=1 OMPFLAGS=-fopenmp AVX=512y FPTYPE=d HELINL=0 HRDCOD=1 RNDGEN=hasCurand Building in BUILDDIR=. for tag=512y_d_inl0_hrd1_hasCurand (USEBUILDDIR is not set) make -C ../../src -f cudacpp_src.mk make[1]: Entering directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' AVX=512y ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -O3 -std=c++17 -I. -fPIC -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -c Parameters_loop_sm_no_b_mass.cc -o Parameters_loop_sm_no_b_mass.o In file included from Parameters_loop_sm_no_b_mass.cc:15: Parameters_loop_sm_no_b_mass.h: In function ‘const Parameters_loop_sm_no_b_mass_dependentCouplings::DependentCouplings_sv Parameters_loop_sm_no_b_mass_dependentCouplings::computeDependentCouplings_fromG(const fptype_sv&)’: Parameters_loop_sm_no_b_mass.h:291:46: error: ‘COND’ was not declared in this scope 291 | const fptype_sv mdl_GWcft_UV_t_1EPS_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF ); | ^~~~ Parameters_loop_sm_no_b_mass.h:300:138: error: ‘reglog’ was not declared in this scope 300 | const fptype_sv mdl_G_UVt_FIN_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF * reglog( mdl_MT__exp__2 / mdl_MU_R__exp__2 ) ); | ^~~~~~ make[1]: *** [cudacpp_src.mk:241: Parameters_loop_sm_no_b_mass.o] Error 1 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' make: *** [makefile:520: ../../lib/libmg5amc_common.so] Error 2

… ttW and ttZ production

… list of physics processes (test madgraph5#783?)

…xt (for debugging madgraph5#701) cp dump_SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU_MadgraphTest.CompareMomentaAndME_0.txt ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt This is necessary because runTest was failing otherwise pushd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make cleanall; HRDCOD=1 make -j ./runTest.exe Before this succeeds however, it is necessary to rebuild

…p_ttW: results have changed and seem more correct... INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it [ FAILED ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7ac410 (10 ms) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest (10 ms total) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest [ RUN ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt MadgraphTest.h:299: Failure The difference between testDriver->getMatrixElement( ievt ) and referenceData[iiter].MEs[ievt] is 1.4553189634594381e-10, which exceeds toleranceMEs * referenceData[iiter].MEs[ievt], where testDriver->getMatrixElement( ievt ) evaluates to 1.4553189634594381e-10, referenceData[iiter].MEs[ievt] evaluates to 0, and toleranceMEs * referenceData[iiter].MEs[ievt] evaluates to 0. Google Test trace: MadgraphTest.h:278: In comparing event 0 from iteration 0 0 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 7.500000000000000e+02 ref0 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 7.500000000000000e+02 1 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 -7.500000000000000e+02 ref1 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 -7.500000000000000e+02 2 2.045233209356228e+02 6.877986897204741e+01 -1.905381248013139e+02 2.818406336784427e+01 ref2 2.045233209356227e+02 6.877986897204741e+01 -1.905381248013139e+02 2.818406336784428e+01 3 5.474933604313479e+02 -4.596225360107567e+02 3.030720946352406e+01 2.959350894402092e+02 ref3 5.474933604313477e+02 -4.596225360107564e+02 3.030720946352398e+01 2.959350894402091e+02 4 5.014688717565998e+02 4.188441856206845e+02 2.572754903817052e+02 -9.924666020293013e+01 ref4 5.014688717565996e+02 4.188441856206844e+02 2.572754903817050e+02 -9.924666020293004e+01 5 2.465144468764298e+02 -2.800151858197540e+01 -9.704457504391526e+01 -2.248724926051235e+02 ref5 2.465144468764297e+02 -2.800151858197538e+01 -9.704457504391526e+01 -2.248724926051234e+02 ME 1.455318963459438e-10 r.ME 0.000000000000000e+00 [ FAILED ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7c5f20 (37 ms) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest (37 ms total)

CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest.exe ; mv dump_CPUTest* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/

…h, add some debug printouts about comparison of generated code

….yml, disable on:push triggers to avoid launching two jobs instead of one

…h, fix a bash bug and disable comparisons to the existing repo

valassi · 2023-11-10T06:49:48Z

One thing TODO?

add the branch name to the cache name? there may be two different PRs with different generated code, and you should not delete all caches (store one with the PR name, retrieve any?).... however remember to delete the cache completely at the end? not sure, it seems a bit complex (but can use pull_request type closed, https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request)

…bleFPE (which already exists in testsuite_oneprocess)

…false (e.g. do not stop double jobs if float has failed)

valassi · 2023-11-24T14:06:04Z

Another thing TODO

run these tests on AVX512 nodes (private runners)... things like FPE Bypass color choice for channelId==0 (floating point exceptions in check.exe for ggttg, gqttq, nobm_pp_ttW) #783 went undetected (and is now about to be fixed)

Fix conflicts: .github/workflows/testsuite_allprocesses.yml .github/workflows/testsuite_oneprocess.yml epochX/cudacpp/CODEGEN/generateAndCompare.sh

valassi · 2024-05-17T07:30:49Z

I have just merged upstream/master into this WIP branch.

TODO:

I realise that some of the stuff in this WIP branch must be removed: there was some testing of FPE tests as a separate option, but by now FPE handling is completely default with no environbment variables, so all this stuff must be removed.

… Node 16 to Node 20

Fix conflicts: .github/workflows/testsuite_oneprocess.yml

valassi · 2024-06-26T14:42:52Z

I have just merged upstream/master into this WIP branch.

TODO:

I realise that some of the stuff in this WIP branch must be removed: there was some testing of FPE tests
as a separate option, but by now FPE handling is completely default with no environbment variables,
so all this stuff must be removed.

I have again merged upstream/master. And I have now also removed the FPE specific stuff

…al issue in codegen caches: restore only codegen caches from the same run_id

…variable steps.split.outputs.prnum for the buildcache name

…tead of github ref_name for buildcache names

… to the prnum

…lete set-output by new GITHUB_OUTPUT mechanism

…_ttg (just a hack to trigger the CI again)

…l processes

…l as ICONFIG for tmad tests, and add the option to use iconfig != 1

valassi · 2024-06-26T21:12:32Z

This is almost ready for review. The tmad tests (#871) are working and are providing very useful results (eg they show rotxxx crashes).

A couple of things to complete before considering this ready for review

Implement a mechanism to bypass known issues (wip, almost done, will complete tomorrow)
Investigate why the tests take a long time even if I specified only 32 events... why is vec size used 16384?... (maybe I need to reset it, quite simply)

The latest CI run gave these errors
https://github.com/madgraph5/madgraph4gpu/actions/runs/9686490186

Most of these are rotxxx crashes
Example https://github.com/madgraph5/madgraph4gpu/actions/runs/9686490186/job/26729084480#step:12:182

… multiple of NLOOP?) and update copyright year range

…P_RUNTIME_VECSIZEUSED=32 in tmad tests

…sm to bypass known issues in tmad tests Currently the following 12 (4 processes x 3 fptypes) issues are bypassed - "No cross section in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#826)" for susy_gg_t1t1 - "SIGFPE crash in rotxxx in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#855)" for gq_ttq, pp_tt012j, nobm_pp_ttW

…sec tolerance from 3E-14 to 3E-13 (else fails for heft_gg_bb/d)

valassi · 2024-06-27T09:36:35Z

Hi @oliviermattelaer this is now ready, can you please review?

I have extended my new CI tests and in particular I added 'tmad' tests that compare xsec and lhe files in madevent.

Note: the current status as of this commit is that all tests pass
b89e093
https://github.com/madgraph5/madgraph4gpu/actions/runs/9694056395
But this is only because I have explicitly bypassed a few known issues: 9 rotxxx crashes #855 and 3 zero cross sections #826.

I will now reenable those tests, which means that the CI will explicitly fail on them. I think this is very useful as it allows us to see if any of the new changes we are devloping (like your 'fix_826' branch PR #852 or my volatile patches PR #857) fix some of these issues.

I would merge this with high priority. Thanks!
Andrea

PS snapshot of completed tests (note, thanks to ccache build caches, the tests complete in 6 minutes, which is reasonable; note also that I fixed the number of events, so now vecsize used is 32 and I only use 32 events in madevent)

… will now fail on rotxx crashes madgraph5#855 and on zero cross section madgraph5#826

…/testsuite_oneprocess.sh

oliviermattelaer · 2024-06-27T09:47:23Z

Sure this can be merge then (but then if we allow test that does not pass, we should also add my new CI test but that is likely waiting your review)

valassi · 2024-06-27T09:57:37Z

Ok as mentioned I have reenabled the 12 failing tests (rotxxx and zero cross section).
It is expected that there are 12 failing tests (until we fix them!)

This is now ready to be merged, I would do this ASAP.

valassi · 2024-06-27T09:59:42Z

Sure this can be merge then (but then if we allow test that does not pass, we should also add my new CI test but that is likely waiting your review)

Thanks Olivier! Merging NOW.

Can you remind me which PR I should review about your CI please?

…and valgrind fixes madgraph5#869) into tmad

…adgraph5#794 and valgrind fixes madgraph5#869): no change in the code

…and valgrind fixes madgraph5#869) into susy Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)

valassi added 13 commits November 10, 2023 07:06

[actions] in .github/workflows/testsuite, split codegen and build/tes…

4570514

…t into two separate jobs, and add a codegen cache (which is really a compulsory build artifact)

[actions] in .github/workflows/testsuite, reeenable more physics proc…

c9002d4

…esses

[nobm] distinguish between loop_sm-no_b_mass and sm-no_b_mass for tt,…

f9dd8ec

… ttW and ttZ production

[actions/nobm] in .github/workflows/testsuite, add nobm_pp_ttW to the…

ac27efd

… list of physics processes (test madgraph5#783?)

[nobm/actions] in CODEGEN, add many missing ref files from nobm_pp_ttW

5129e79

CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest.exe ; mv dump_CPUTest* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/

[floating_type_interface] in .github/workflows/testsuite_oneprocess.s…

7586844

…h, add some debug printouts about comparison of generated code

[floating_type_interface] in .github/workflows/testsuite_allprocesses…

a13af0f

….yml, disable on:push triggers to avoid launching two jobs instead of one

[floating_type_interface] in .github/workflows/testsuite_oneprocess.s…

bdb7ae9

…h, fix a bash bug and disable comparisons to the existing repo

valassi self-assigned this Nov 10, 2023

valassi marked this pull request as draft November 10, 2023 06:47

valassi changed the title ~~WIP: extend testuite CI (split codegen from build/test and execute the latter both for float and double)~~ WIP: extend testsuite CI (split codegen from build/test and execute the latter both for float and double) Nov 10, 2023

[actions] in .github/workflows/testsuite_allprocesses, add inputs ena…

cfd261c

…bleFPE (which already exists in testsuite_oneprocess)

valassi force-pushed the actions branch from 2cf10e1 to cfd261c Compare November 10, 2023 12:28

valassi added 2 commits November 10, 2023 13:41

[actions] in .github/workflows/testsuite_oneprocesses, set fail-fast=…

6dae07a

…false (e.g. do not stop double jobs if float has failed)

Merge remote-tracking branch 'upstream/master' into valassi/actions

0332333

Merge remote-tracking branch 'upstream/master' into actions

4fc294a

Fix conflicts: .github/workflows/testsuite_allprocesses.yml .github/workflows/testsuite_oneprocess.yml epochX/cudacpp/CODEGEN/generateAndCompare.sh

valassi added 3 commits May 17, 2024 09:48

[actions] update to github actions v4 (see madgraph5#848) moving from…

1fc4bc6

… Node 16 to Node 20

Merge remote-tracking branch 'upstream/master' into actions

46be26f

Fix conflicts: .github/workflows/testsuite_oneprocess.yml

[actions] in .github/workflows/testsuite* remove FPE-specific CI config

b9b5975

[actions] in .github/workflows/testsuite_oneprocess.yml fix a potenti…

bf226e0

…al issue in codegen caches: restore only codegen caches from the same run_id

valassi mentioned this pull request Jun 26, 2024

add PR number to github cache for the CI #799

Closed

valassi added 8 commits June 26, 2024 22:22

[actions] in .github/workflows/testsuite_oneprocess.yml create a new …

8248df9

…variable steps.split.outputs.prnum for the buildcache name

[actions] in .github/workflows/testsuite_oneprocess.yml use prnum ins…

4918296

…tead of github ref_name for buildcache names

[actions] in .github/workflows/testsuite_oneprocess.yml add prefix PR…

1b780ce

… to the prnum

[actions] in .github/workflows/testsuite_oneprocess.yml, replace obso…

523b080

…lete set-output by new GITHUB_OUTPUT mechanism

[actions] in .github/workflows/testsuite_allprocesses.yml add also gg…

a88ae72

…_ttg (just a hack to trigger the CI again)

[actions] in .github/workflows/testsuite* updatecopyright year range

f9500d5

[actions] in .github/workflows/testsuite_allprocesses.yml reenable al…

e10474b

…l processes

[actions] in .github/workflows/testsuite_oneprocess.sh, rename channe…

f2e3395

…l as ICONFIG for tmad tests, and add the option to use iconfig != 1

valassi mentioned this pull request Jun 26, 2024

Fix SIGFPE crash (855) in rotxxx by adding volatile in aloha_functions.f #857

Merged

valassi added 4 commits June 27, 2024 10:47

[actions] in tmad/madX.sh, add a comment (should check that nevt is a…

ee592ad

… multiple of NLOOP?) and update copyright year range

[actions] in .github/workflows/testsuite_oneprocess.sh, export CUDACP…

93a81e3

…P_RUNTIME_VECSIZEUSED=32 in tmad tests

[actions] in .github/workflows/testsuite_oneprocess.sh increase the x…

b89e093

…sec tolerance from 3E-14 to 3E-13 (else fails for heft_gg_bb/d)

valassi changed the title ~~WIP: extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests)~~ extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) Jun 27, 2024

valassi marked this pull request as ready for review June 27, 2024 09:31

valassi requested a review from a team as a code owner June 27, 2024 09:31

valassi requested a review from oliviermattelaer June 27, 2024 09:32

valassi added 2 commits June 27, 2024 11:37

[actions] ** COMPLETE ACTIONS ** reenable the 12 known issues: the CI…

7466525

… will now fail on rotxx crashes madgraph5#855 and on zero cross section madgraph5#826

[actions] ** COMPLETE ACTIONS (again) ** bug fix in .github/workflows…

1704ae3

…/testsuite_oneprocess.sh

valassi merged commit a9871fc into madgraph5:master Jun 27, 2024
157 of 169 checks passed

valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024

Merge remote-tracking branch 'upstream/master' (new CI madgraph5#794 …

f912941

…and valgrind fixes madgraph5#869) into tmad

valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024

[tmad] regenerate all processes after merging upstream/master (new CI m…

f7b9e04

…adgraph5#794 and valgrind fixes madgraph5#869): no change in the code

valassi mentioned this pull request Jun 28, 2024

No cross section in SUSY gg_t1t1 log file #826

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

valassi commented Nov 10, 2023

valassi commented Nov 10, 2023 •

edited

Loading

valassi commented Nov 24, 2023

valassi commented May 17, 2024

valassi commented Jun 26, 2024 •

edited

Loading

valassi commented Jun 26, 2024

valassi commented Jun 27, 2024 •

edited

Loading

oliviermattelaer commented Jun 27, 2024

valassi commented Jun 27, 2024

valassi commented Jun 27, 2024

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

Conversation

valassi commented Nov 10, 2023

valassi commented Nov 10, 2023 • edited Loading

valassi commented Nov 24, 2023

valassi commented May 17, 2024

valassi commented Jun 26, 2024 • edited Loading

valassi commented Jun 26, 2024

valassi commented Jun 27, 2024 • edited Loading

oliviermattelaer commented Jun 27, 2024

valassi commented Jun 27, 2024

valassi commented Jun 27, 2024

valassi commented Nov 10, 2023 •

edited

Loading

valassi commented Jun 26, 2024 •

edited

Loading

valassi commented Jun 27, 2024 •

edited

Loading