Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

selected color comparison fails for HIP on LUMI #931

Closed
valassi opened this issue Jul 22, 2024 · 2 comments
Closed

selected color comparison fails for HIP on LUMI #931

valassi opened this issue Jul 22, 2024 · 2 comments
Assignees

Comments

@valassi
Copy link
Member

valassi commented Jul 22, 2024

New issue, the selected color comparison fails for HIP on LUMI. This succeeds for CUDA instead.

[valassia@nid007961 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx > ./build.hip_d_inl0_hrd0/runTest_hip.exe 
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
[==========] Running 4 tests from 4 test suites.
[----------] Global test environment set-up.
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX
[ RUN      ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx
[       OK ] SIGMA_SM_GG_TTX_GPU_XXX.testxxx (0 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_XXX (0 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc
[       OK ] SIGMA_SM_GG_TTX_GPU_MISC.testmisc (12 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MISC (12 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_NOMULTICHANNEL
[ RUN      ] SIGMA_SM_GG_TTX_GPU_NOMULTICHANNEL.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt
./MadgraphTest.h:328: Failure
Expected equality of these values:
  testDriver->getSelectedColor( ievt )
    Which is: -1094795586
  referenceData[iiter].SelCols[ievt]
    Which is: 0
Google Test trace:
./MadgraphTest.h:296: In comparing event 0 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.499999999999994e+02  5.849331413473446e+02 -3.138365726669762e+02 -3.490842674916364e+02
ref2  7.500000000000001e+02  5.849331413473452e+02 -3.138365726669762e+02 -3.490842674916370e+02

   3  7.499999999999998e+02 -5.849331413473448e+02  3.138365726669760e+02  3.490842674916368e+02
ref3  7.499999999999999e+02 -5.849331413473452e+02  3.138365726669762e+02  3.490842674916369e+02

  ME  6.797636301916550e-01
r.ME  6.797636301916554e-01
  ChanId1
r.ChanId1
  SelHel1
r.SelHel1
  SelCol-1094795586
r.SelCol0

[  FAILED  ] SIGMA_SM_GG_TTX_GPU_NOMULTICHANNEL.compareMomAndME (18 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_NOMULTICHANNEL (18 ms total)

[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MULTICHANNEL
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt2
./MadgraphTest.h:328: Failure
Expected equality of these values:
  testDriver->getSelectedColor( ievt )
    Which is: 1
  referenceData[iiter].SelCols[ievt]
    Which is: 2
Google Test trace:
./MadgraphTest.h:296: In comparing event 64 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.499999999999999e+02  3.967977363399760e+02  1.264105709968620e+02 -6.237563017523095e+02
ref2  7.499999999999998e+02  3.967977363399760e+02  1.264105709968621e+02 -6.237563017523095e+02

   3  7.500000000000000e+02 -3.967977363399760e+02 -1.264105709968621e+02  6.237563017523095e+02
ref3  7.499999999999999e+02 -3.967977363399760e+02 -1.264105709968621e+02  6.237563017523095e+02

  ME  2.911980883026979e+00
r.ME  2.911980883026980e+00
  ChanId3
r.ChanId3
  SelHel1
r.SelHel1
  SelCol1
r.SelCol2

[  FAILED  ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME (9 ms)
[----------] 1 test from SIGMA_SM_GG_TTX_GPU_MULTICHANNEL (9 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 4 test suites ran. (40 ms total)
[  PASSED  ] 2 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] SIGMA_SM_GG_TTX_GPU_NOMULTICHANNEL.compareMomAndME
[  FAILED  ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME

 2 FAILED TESTS
INFO: No Floating Point Exceptions have been reported
INFO: No Floating Point Exceptions have been reported
@valassi valassi self-assigned this Jul 22, 2024
@valassi
Copy link
Member Author

valassi commented Jul 22, 2024

Ok this is fixed. Will also be added to #882

There were three issues, in order in which I saw them

  • fixed debug printout r.SelHel1 into r.SelHel 1
  • fixed color selection for channelId==0, now SelCol-1094795586 becomes SelCol 0
  • main issue: the random numbers for helicity and color selection were not computed, and this was 0 for cuda, a random negative number for HIP

Now regenerating all processes and recomputing all reference logs

In practice, I think that this is because CUDA arrays were initialsed to 0 while HIP arrays were not (I will not touch this, it is actually useful for debugging...)

@valassi
Copy link
Member Author

valassi commented Jul 22, 2024

Mentioned in 882, closing

@valassi valassi closed this as completed Jul 22, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…neldId==0 (fix nomultichannel test on LUMI madgraph5#931)

Also add debug printouts for color selection (commented out)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…for madgraph5#931 on LUMI

The problem is that rndCol is always 0 and in the HIP case it is even negative...

CUDA:
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt2
sigmaKin: iconfig=1 icolC=0 icolamp=1 targetamp=4.420031
sigmaKin: iconfig=1 icolC=1 icolamp=1 targetamp=35.558544
sigmaKin: ievt=   0 rndcol=0.000000 icolC=0 target/total=0.124303
sigmaKin: ievt=0 icol=1
sigmaKin: iconfig=3 icolC=0 icolamp=0 targetamp=0.000000
sigmaKin: iconfig=3 icolC=1 icolamp=1 targetamp=20.371895
sigmaKin: ievt=   0 rndcol=0.000000 icolC=0 target/total=0.000000
sigmaKin: ievt=   0 rndcol=0.000000 icolC=1 target/total=1.000000
sigmaKin: ievt=0 icol=2
[       OK ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME (23 ms)

HIP/LUMI
[ RUN      ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME
INFO: Opening reference file ../../test/ref/dump_CPUTest.Sigma_sm_gg_ttx.txt2
sigmaKin: iconfig=1 icolC=0 icolamp=1 targetamp=4.420031
sigmaKin: iconfig=1 icolC=1 icolamp=1 targetamp=35.558544
sigmaKin: ievt=   0 rndcol=-0.000002 icolC=0 target/total=0.124303
sigmaKin: ievt=0 icol=1
./MadgraphTest.h:328: Failure
Expected equality of these values:
  testDriver->getSelectedColor( ievt )
    Which is: 1
  referenceData[iiter].SelCols[ievt]
    Which is: 2
Google Test trace:
./MadgraphTest.h:296: In comparing event 64 from iteration 0
   0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02
ref0  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00  7.500000000000000e+02

   1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02
ref1  7.500000000000000e+02  0.000000000000000e+00  0.000000000000000e+00 -7.500000000000000e+02

   2  7.499999999999999e+02  3.967977363399760e+02  1.264105709968620e+02 -6.237563017523095e+02
ref2  7.499999999999998e+02  3.967977363399760e+02  1.264105709968621e+02 -6.237563017523095e+02

   3  7.500000000000000e+02 -3.967977363399760e+02 -1.264105709968621e+02  6.237563017523095e+02
ref3  7.499999999999999e+02 -3.967977363399760e+02 -1.264105709968621e+02  6.237563017523095e+02

  ME  2.911980883026979e+00
r.ME  2.911980883026980e+00
  ChanId       3
r.ChanId       3
  SelHel       1
r.SelHel       1
  SelCol       1
r.SelCol       2

sigmaKin: iconfig=3 icolC=0 icolamp=0 targetamp=0.000000
sigmaKin: iconfig=3 icolC=1 icolamp=1 targetamp=20.371895
sigmaKin: ievt=   0 rndcol=-0.000002 icolC=0 target/total=0.000000
sigmaKin: ievt=0 icol=1
[  FAILED  ] SIGMA_SM_GG_TTX_GPU_MULTICHANNEL.compareMomAndME (11 ms)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…random numbers for helicities and colors (fix madgraph5#931)

NB tests now fail, the txt/txt2 reference logs must be regenerated...
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
Revert "[june24] in gg_tt.mad CPPProcess.cc, temporarely add debug printouts for madgraph5#931 on LUMI"
This reverts commit 16836f1.
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…ding selhel, selcol, channel madgraph5#925 madgraph5#924 after fixing madgraph5#931

CUDACPP_RUNTEST_DUMPEVENTS=1 ./runTest_cuda.exe
\cp ../../test/ref/dump* ../../../CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/test/ref/

The tests now pass on LUMI using the reference logs created on itscrd90
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…helicity selection causing test failures on LUMI/HIP
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
…ding selhel, selcol, channel madgraph5#925 madgraph5#924 after fixing madgraph5#931

./CODEGEN/recreateRefs.sh
valassi added a commit to valassi/madgraph4gpu that referenced this issue Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant