-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794
Conversation
…d for push/manual, disabled for PRs) Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds. I will add those in a more complex workflow with one codegen job and several build/test jobs.
…t into two separate jobs, and add a codegen cache (which is really a compulsory build artifact)
…eat the build/test jobs twice (for FPTYPE=d,f) This must be cleaned up - the cache cleanup job must be split up (codegen cache cleanup once, build cache cleanup once per build type) - the Process+fptype tag must become a more general build tag for caches (eventually add inl, hrdcod)
…also affected by madgraph5#696 [avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/SubProcesses/P1_Sigma_loop_sm_no_b_mass_gg_ttx> make HRDCOD=1 OMPFLAGS=-fopenmp AVX=512y FPTYPE=d HELINL=0 HRDCOD=1 RNDGEN=hasCurand Building in BUILDDIR=. for tag=512y_d_inl0_hrd1_hasCurand (USEBUILDDIR is not set) make -C ../../src -f cudacpp_src.mk make[1]: Entering directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' AVX=512y ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++ -O3 -std=c++17 -I. -fPIC -Wall -Wshadow -Wextra -ffast-math -fopenmp -march=skylake-avx512 -mprefer-vector-width=256 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -c Parameters_loop_sm_no_b_mass.cc -o Parameters_loop_sm_no_b_mass.o In file included from Parameters_loop_sm_no_b_mass.cc:15: Parameters_loop_sm_no_b_mass.h: In function ‘const Parameters_loop_sm_no_b_mass_dependentCouplings::DependentCouplings_sv Parameters_loop_sm_no_b_mass_dependentCouplings::computeDependentCouplings_fromG(const fptype_sv&)’: Parameters_loop_sm_no_b_mass.h:291:46: error: ‘COND’ was not declared in this scope 291 | const fptype_sv mdl_GWcft_UV_t_1EPS_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF ); | ^~~~ Parameters_loop_sm_no_b_mass.h:300:138: error: ‘reglog’ was not declared in this scope 300 | const fptype_sv mdl_G_UVt_FIN_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF * reglog( mdl_MT__exp__2 / mdl_MU_R__exp__2 ) ); | ^~~~~~ make[1]: *** [cudacpp_src.mk:241: Parameters_loop_sm_no_b_mass.o] Error 1 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src' make: *** [makefile:520: ../../lib/libmg5amc_common.so] Error 2
… ttW and ttZ production
… list of physics processes (test madgraph5#783?)
…xt (for debugging madgraph5#701) cp dump_SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU_MadgraphTest.CompareMomentaAndME_0.txt ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt This is necessary because runTest was failing otherwise pushd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu make cleanall; HRDCOD=1 make -j ./runTest.exe Before this succeeds however, it is necessary to rebuild
…p_ttW: results have changed and seem more correct... INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it [ FAILED ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7ac410 (10 ms) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest (10 ms total) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest [ RUN ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0 INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt MadgraphTest.h:299: Failure The difference between testDriver->getMatrixElement( ievt ) and referenceData[iiter].MEs[ievt] is 1.4553189634594381e-10, which exceeds toleranceMEs * referenceData[iiter].MEs[ievt], where testDriver->getMatrixElement( ievt ) evaluates to 1.4553189634594381e-10, referenceData[iiter].MEs[ievt] evaluates to 0, and toleranceMEs * referenceData[iiter].MEs[ievt] evaluates to 0. Google Test trace: MadgraphTest.h:278: In comparing event 0 from iteration 0 0 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 7.500000000000000e+02 ref0 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 7.500000000000000e+02 1 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 -7.500000000000000e+02 ref1 7.500000000000000e+02 0.000000000000000e+00 0.000000000000000e+00 -7.500000000000000e+02 2 2.045233209356228e+02 6.877986897204741e+01 -1.905381248013139e+02 2.818406336784427e+01 ref2 2.045233209356227e+02 6.877986897204741e+01 -1.905381248013139e+02 2.818406336784428e+01 3 5.474933604313479e+02 -4.596225360107567e+02 3.030720946352406e+01 2.959350894402092e+02 ref3 5.474933604313477e+02 -4.596225360107564e+02 3.030720946352398e+01 2.959350894402091e+02 4 5.014688717565998e+02 4.188441856206845e+02 2.572754903817052e+02 -9.924666020293013e+01 ref4 5.014688717565996e+02 4.188441856206844e+02 2.572754903817050e+02 -9.924666020293004e+01 5 2.465144468764298e+02 -2.800151858197540e+01 -9.704457504391526e+01 -2.248724926051235e+02 ref5 2.465144468764297e+02 -2.800151858197538e+01 -9.704457504391526e+01 -2.248724926051234e+02 ME 1.455318963459438e-10 r.ME 0.000000000000000e+00 [ FAILED ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7c5f20 (37 ms) [----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest (37 ms total)
CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest.exe ; mv dump_CPUTest* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/
…h, add some debug printouts about comparison of generated code
….yml, disable on:push triggers to avoid launching two jobs instead of one
…h, fix a bash bug and disable comparisons to the existing repo
One thing TODO?
|
…bleFPE (which already exists in testsuite_oneprocess)
…false (e.g. do not stop double jobs if float has failed)
Another thing TODO
|
Fix conflicts: .github/workflows/testsuite_allprocesses.yml .github/workflows/testsuite_oneprocess.yml epochX/cudacpp/CODEGEN/generateAndCompare.sh
I have just merged upstream/master into this WIP branch. TODO:
|
… Node 16 to Node 20
Fix conflicts: .github/workflows/testsuite_oneprocess.yml
I have again merged upstream/master. And I have now also removed the FPE specific stuff |
…al issue in codegen caches: restore only codegen caches from the same run_id
…variable steps.split.outputs.prnum for the buildcache name
…tead of github ref_name for buildcache names
…lete set-output by new GITHUB_OUTPUT mechanism
…_ttg (just a hack to trigger the CI again)
…l as ICONFIG for tmad tests, and add the option to use iconfig != 1
This is almost ready for review. The tmad tests (#871) are working and are providing very useful results (eg they show rotxxx crashes). A couple of things to complete before considering this ready for review
The latest CI run gave these errors Most of these are rotxxx crashes |
… multiple of NLOOP?) and update copyright year range
…P_RUNTIME_VECSIZEUSED=32 in tmad tests
…sm to bypass known issues in tmad tests Currently the following 12 (4 processes x 3 fptypes) issues are bypassed - "No cross section in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#826)" for susy_gg_t1t1 - "SIGFPE crash in rotxxx in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#855)" for gq_ttq, pp_tt012j, nobm_pp_ttW
…sec tolerance from 3E-14 to 3E-13 (else fails for heft_gg_bb/d)
Hi @oliviermattelaer this is now ready, can you please review? I have extended my new CI tests and in particular I added 'tmad' tests that compare xsec and lhe files in madevent. Note: the current status as of this commit is that all tests pass I will now reenable those tests, which means that the CI will explicitly fail on them. I think this is very useful as it allows us to see if any of the new changes we are devloping (like your 'fix_826' branch PR #852 or my volatile patches PR #857) fix some of these issues. I would merge this with high priority. Thanks! PS snapshot of completed tests (note, thanks to ccache build caches, the tests complete in 6 minutes, which is reasonable; note also that I fixed the number of events, so now vecsize used is 32 and I only use 32 events in madevent) |
… will now fail on rotxx crashes madgraph5#855 and on zero cross section madgraph5#826
…/testsuite_oneprocess.sh
Sure this can be merge then (but then if we allow test that does not pass, we should also add my new CI test but that is likely waiting your review) |
Thanks Olivier! Merging NOW. Can you remind me which PR I should review about your CI please? |
…and valgrind fixes madgraph5#869) into tmad
…adgraph5#794 and valgrind fixes madgraph5#869): no change in the code
…and valgrind fixes madgraph5#869) into susy Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
…and valgrind fixes madgraph5#869) into susy Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
This is a WIP PR for extending the CI testsuite.
I keep this in a PR so that the CI can run (I have disabled on:push triggers)