Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) #794

Merged
merged 48 commits into from
Jun 27, 2024

Conversation

valassi
Copy link
Member

@valassi valassi commented Nov 10, 2023

This is a WIP PR for extending the CI testsuite.

I keep this in a PR so that the CI can run (I have disabled on:push triggers)

…d for push/manual, disabled for PRs)

Note: the FPE crashes in madgraph5#783 are not shown here because they need FPTYPE=f builds.
I will add those in a more complex workflow with one codegen job and several build/test jobs.
…t into two separate jobs, and add a codegen cache (which is really a compulsory build artifact)
…eat the build/test jobs twice (for FPTYPE=d,f)

This must be cleaned up
- the cache cleanup job must be split up (codegen cache cleanup once, build cache cleanup once per build type)
- the Process+fptype tag must become a more general build tag for caches (eventually add inl, hrdcod)
…also affected by madgraph5#696

[avalassi@itscrd80 gcc11.2/cvmfs] /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/SubProcesses/P1_Sigma_loop_sm_no_b_mass_gg_ttx> make HRDCOD=1
OMPFLAGS=-fopenmp
AVX=512y
FPTYPE=d
HELINL=0
HRDCOD=1
RNDGEN=hasCurand
Building in BUILDDIR=. for tag=512y_d_inl0_hrd1_hasCurand (USEBUILDDIR is not set)
make -C ../../src  -f cudacpp_src.mk
make[1]: Entering directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src'
AVX=512y
ccache /cvmfs/sft.cern.ch/lcg/releases/gcc/11.2.0-ad950/x86_64-centos8/bin/g++  -O3  -std=c++17 -I.  -fPIC -Wall -Wshadow -Wextra -ffast-math  -fopenmp -march=skylake-avx512 -mprefer-vector-width=256  -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_HARDCODE_PARAM -c Parameters_loop_sm_no_b_mass.cc -o Parameters_loop_sm_no_b_mass.o
In file included from Parameters_loop_sm_no_b_mass.cc:15:
Parameters_loop_sm_no_b_mass.h: In function ‘const Parameters_loop_sm_no_b_mass_dependentCouplings::DependentCouplings_sv Parameters_loop_sm_no_b_mass_dependentCouplings::computeDependentCouplings_fromG(const fptype_sv&)’:
Parameters_loop_sm_no_b_mass.h:291:46: error: ‘COND’ was not declared in this scope
  291 |       const fptype_sv mdl_GWcft_UV_t_1EPS_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF );
      |                                              ^~~~
Parameters_loop_sm_no_b_mass.h:300:138: error: ‘reglog’ was not declared in this scope
  300 |       const fptype_sv mdl_G_UVt_FIN_ = COND( mdl_MT, 0., -( ( mdl_G__exp__2 ) / ( 2. * 48. * ( ( M_PI ) * ( M_PI ) ) ) ) * 4. * mdl_TF * reglog( mdl_MT__exp__2 / mdl_MU_R__exp__2 ) );
      |                                                                                                                                          ^~~~~~
make[1]: *** [cudacpp_src.mk:241: Parameters_loop_sm_no_b_mass.o] Error 1
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/nobm_gg_tt.sa/src'
make: *** [makefile:520: ../../lib/libmg5amc_common.so] Error 2
…xt (for debugging madgraph5#701)

  cp dump_SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU_MadgraphTest.CompareMomentaAndME_0.txt ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt

This is necessary because runTest was failing otherwise
 pushd nobm_pp_ttW.mad/SubProcesses/P1_gd_ttxwmu
 make cleanall; HRDCOD=1 make -j
 ./runTest.exe
Before this succeeds however, it is necessary to rebuild
…p_ttW: results have changed and seem more correct...

INFO: The application is built for skylake-avx512 (AVX512VL) and the host supports it
[  FAILED  ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7ac410 (10 ms)
[----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_CPU/MadgraphTest (10 ms total)

[----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest
[ RUN      ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_no_b_mass_gd_ttxwmu.txt
MadgraphTest.h:299: Failure
The difference between testDriver->getMatrixElement( ievt ) and referenceData[iiter].MEs[ievt] is 1.4553189634594381e-10, which exceeds toleranceMEs * referenceData[iiter].MEs[ievt], where
testDriver->getMatrixElement( ievt ) evaluates to 1.4553189634594381e-10,
referenceData[iiter].MEs[ievt] evaluates to 0, and
toleranceMEs * referenceData[iiter].MEs[ievt] evaluates to 0.
Google Test trace:
MadgraphTest.h:278: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  2.045233209356228e+02  6.877986897204741e+01 -1.905381248013139e+02  2.818406336784427e+01
ref2  2.045233209356227e+02  6.877986897204741e+01 -1.905381248013139e+02  2.818406336784428e+01

   3  5.474933604313479e+02 -4.596225360107567e+02  3.030720946352406e+01  2.959350894402092e+02
ref3  5.474933604313477e+02 -4.596225360107564e+02  3.030720946352398e+01  2.959350894402091e+02

   4  5.014688717565998e+02  4.188441856206845e+02  2.572754903817052e+02 -9.924666020293013e+01
ref4  5.014688717565996e+02  4.188441856206844e+02  2.572754903817050e+02 -9.924666020293004e+01

   5  2.465144468764298e+02 -2.800151858197540e+01 -9.704457504391526e+01 -2.248724926051235e+02
ref5  2.465144468764297e+02 -2.800151858197538e+01 -9.704457504391526e+01 -2.248724926051234e+02

  ME  1.455318963459438e-10
r.ME  0.000000000000000e+00

[  FAILED  ] SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest.CompareMomentaAndME/0, where GetParam() = 0x7c5f20 (37 ms)
[----------] 1 test from SIGMA_SM_NO_B_MASS_GD_TTXWMU_GPU/MadgraphTest (37 ms total)
CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest.exe ; mv dump_CPUTest* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/
…h, add some debug printouts about comparison of generated code
….yml, disable on:push triggers to avoid launching two jobs instead of one
…h, fix a bash bug and disable comparisons to the existing repo
@valassi valassi self-assigned this Nov 10, 2023
@valassi valassi marked this pull request as draft November 10, 2023 06:47
@valassi
Copy link
Member Author

valassi commented Nov 10, 2023

One thing TODO?

@valassi valassi changed the title WIP: extend testuite CI (split codegen from build/test and execute the latter both for float and double) WIP: extend testsuite CI (split codegen from build/test and execute the latter both for float and double) Nov 10, 2023
…bleFPE (which already exists in testsuite_oneprocess)
@valassi
Copy link
Member Author

valassi commented Nov 24, 2023

Another thing TODO

Fix conflicts:
	.github/workflows/testsuite_allprocesses.yml
	.github/workflows/testsuite_oneprocess.yml
	epochX/cudacpp/CODEGEN/generateAndCompare.sh
@valassi
Copy link
Member Author

valassi commented May 17, 2024

I have just merged upstream/master into this WIP branch.

TODO:

  • I realise that some of the stuff in this WIP branch must be removed: there was some testing of FPE tests as a separate option, but by now FPE handling is completely default with no environbment variables, so all this stuff must be removed.

@valassi
Copy link
Member Author

valassi commented Jun 26, 2024

I have just merged upstream/master into this WIP branch.

TODO:

I realise that some of the stuff in this WIP branch must be removed: there was some testing of FPE tests
as a separate option, but by now FPE handling is completely default with no environbment variables,
so all this stuff must be removed.

I have again merged upstream/master. And I have now also removed the FPE specific stuff

…al issue in codegen caches: restore only codegen caches from the same run_id
@valassi
Copy link
Member Author

valassi commented Jun 26, 2024

This is almost ready for review. The tmad tests (#871) are working and are providing very useful results (eg they show rotxxx crashes).

A couple of things to complete before considering this ready for review

  • Implement a mechanism to bypass known issues (wip, almost done, will complete tomorrow)
  • Investigate why the tests take a long time even if I specified only 32 events... why is vec size used 16384?... (maybe I need to reset it, quite simply)

The latest CI run gave these errors
https://github.com/madgraph5/madgraph4gpu/actions/runs/9686490186
image

Most of these are rotxxx crashes
Example https://github.com/madgraph5/madgraph4gpu/actions/runs/9686490186/job/26729084480#step:12:182
image

valassi added 4 commits June 27, 2024 10:47
… multiple of NLOOP?) and update copyright year range
…sm to bypass known issues in tmad tests

Currently the following 12 (4 processes x 3 fptypes) issues are bypassed
- "No cross section in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#826)" for susy_gg_t1t1
- "SIGFPE crash in rotxxx in ${proc%.mad} for FPTYPE=d,f,m (madgraph5#855)" for gq_ttq, pp_tt012j, nobm_pp_ttW
…sec tolerance from 3E-14 to 3E-13 (else fails for heft_gg_bb/d)
@valassi valassi changed the title WIP: extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) extend testsuite CI (split codegen from build/test, execute tests for many fptypes, add tmad tests) Jun 27, 2024
@valassi valassi marked this pull request as ready for review June 27, 2024 09:31
@valassi valassi requested a review from a team as a code owner June 27, 2024 09:31
@valassi valassi requested a review from oliviermattelaer June 27, 2024 09:32
@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

Hi @oliviermattelaer this is now ready, can you please review?

I have extended my new CI tests and in particular I added 'tmad' tests that compare xsec and lhe files in madevent.

Note: the current status as of this commit is that all tests pass
b89e093
https://github.com/madgraph5/madgraph4gpu/actions/runs/9694056395
But this is only because I have explicitly bypassed a few known issues: 9 rotxxx crashes #855 and 3 zero cross sections #826.

I will now reenable those tests, which means that the CI will explicitly fail on them. I think this is very useful as it allows us to see if any of the new changes we are devloping (like your 'fix_826' branch PR #852 or my volatile patches PR #857) fix some of these issues.

I would merge this with high priority. Thanks!
Andrea

PS snapshot of completed tests (note, thanks to ccache build caches, the tests complete in 6 minutes, which is reasonable; note also that I fixed the number of events, so now vecsize used is 32 and I only use 32 events in madevent)

image

@oliviermattelaer
Copy link
Member

Sure this can be merge then (but then if we allow test that does not pass, we should also add my new CI test but that is likely waiting your review)

@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

Ok as mentioned I have reenabled the 12 failing tests (rotxxx and zero cross section).
It is expected that there are 12 failing tests (until we fix them!)
image

This is now ready to be merged, I would do this ASAP.

@valassi
Copy link
Member Author

valassi commented Jun 27, 2024

Sure this can be merge then (but then if we allow test that does not pass, we should also add my new CI test but that is likely waiting your review)

Thanks Olivier! Merging NOW.

Can you remind me which PR I should review about your CI please?

@valassi valassi merged commit a9871fc into madgraph5:master Jun 27, 2024
157 of 169 checks passed
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024
…and valgrind fixes madgraph5#869) into susy

Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
valassi added a commit to valassi/madgraph4gpu that referenced this pull request Jun 27, 2024
…and valgrind fixes madgraph5#869) into susy

Fix conflicts in MG5aMC/mg5amcnlo (keep the latest gpucpp_826 version including the recent gpucpp changes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tmad tests (xsec and LHE file comparison) to the CI add PR number to github cache for the CI
2 participants