Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

Closed
valassi opened this issue Jul 5, 2024 · 4 comments
Assignees
Milestone

Comments

@valassi
Copy link
Member

valassi commented Jul 5, 2024

Another issue introduced in #830 and being reviewed in #882.

In WIP PR #882 for master_june24, I tried to use NB_WARP=512 and WARP_SIZE=16384 ie VECSIZE_MEMMAX=16384. This is bede049

In the CI tmad tests (which use VECSIZE_USED=32) I still get the crash of #885, but I also get the following:
https://github.com/madgraph5/madgraph4gpu/actions/runs/9806731881/job/27079146521

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
At line 412 of file auto_dsig1.f
Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3

Error termination. Backtrace:
#0  0x7f74b5a23960 in ???
#1  0x7f74b5a244d9 in ???
#2  0x55edd8ae6fd9 in dsig1_vec_
#3  0x55edd8ae7de8 in dsigproc_vec_
#4  0x55edd8ae88e3 in dsig_vec_
#5  0x55edd8afec68 in sample_full_
#6  0x55edd8ae4cbd in MAIN__
#7  0x55edd8abc69e in main
ERROR! ' ./madevent_fortran < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_ > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_' failed

For reference, with the previous values NB_WARP=1, WARP_SIZE=16384, VECSIZE_MEMAMX=16384 (and always VECSIZE_USED=32), this was 64a7c0d
And I was getting no such 'Fortran runtime error in symconf'
https://github.com/madgraph5/madgraph4gpu/actions/runs/9797840410/job/27055291574#step:12:77

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7ff0a5423960 in ???
#1  0x7ff0a5422ac5 in ???
#2  0x7ff0a504251f in ???
#3  0x556bad8564aa in dsig1_vec_
#4  0x556bad857509 in dsigproc_vec_
#5  0x556bad8582b2 in dsig_vec_
#6  0x556bad86e5de in sample_full_
#7  0x556bad853d2a in MAIN__
#8  0x556bad82b6de in main
.github/workflows/testsuite_oneprocess.sh: line 289:  3672 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_none > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_none' failed
@roiser
Copy link
Member

roiser commented Jul 5, 2024

Hi, I just looked at my tests that I did at the time, I set e.g.

set vector_size 32
set nb_warp 256

which e.g. then gave me a vector width of 8192, note this was when testing it with configs passed into bin/mg5_aMC

@oliviermattelaer
Copy link
Member

NB_WARP=512 and WARP_SIZE=16384
means and actual grid size of 8,388,608 so this is clearly extreme (and not in the spirit of having warp_size small).

So I will ignore that issue for the moment.

But we need to investigate the error

At line 412 of file auto_dsig1.f
Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3

This seems to indicate that we use symconf in a wrong way. Do you have a dedicate issue for this?

@oliviermattelaer oliviermattelaer self-assigned this Jul 16, 2024
@oliviermattelaer
Copy link
Member

So concerning the segfault line I have investigated the reported line and long story short do not see any potential issue here.
So I have to bet that this is memory corruption due to an inconsistent use of nb_warp/warp_size on the fortran size and the one used to compile c++.

At least I would stop focus on this issue (and propose that we close it for the moment), Do you agree @valassi

Here is the detail of my investigation. (please ignore this, not really relevant, but for@valassi that wanted to see if/how the fortran code was implementing the assignement of the channelID array).

Here is the definition of symconf):

maxconfigs.inc:2:      PARAMETER(LMAXCONFIGS=3)
./auto_dsig.f:49:      INTEGER SYMCONF(0:LMAXCONFIGS)
./auto_dsig.f:50:      COMMON /TO_SYMCONF/ SYMCONF

So this array should indeed be accessed with number lower than 3.
The problematic line is:

CHANNELS(IVEC) = CONFSUB(1,SYMCONF(ICONF_VEC(CURR_WARP)))

So the issue should be in the assignment of iconf_vec
This is an array defined in dsample.f and defined element by element in that array

./dsample.f:41:      integer imirror_vec(NB_WARP), iproc, ICONF_VEC(NB_WARP)
./dsample.f:222:               call select_grouping(imirror_vec(iwarp), iproc, iconf_vec(iwarp), all_wgt, iwarp)

The definition of a single element (ICONF=iconf_vec(iwarp)) is in done in auto_dsig.f and the allowed value are any value out of this loop:

DO J=1,SYMCONF(0) 
   ....
   IF (...)   ICONF=J

So checking the assignment of SYMCONF(0), this is a runtime variable that depends on the content of ../symfact.dat,
the reading of such file is done in auto_dsig.f (around line 508)

if the file is found the code does

         DO WHILE(.TRUE.)
           READ(LUN,*,ERR=10,END=10) XDUM, ICONF
           IF(ICONF.EQ.-MAPCONFIG(ICONFIG))THEN
             IPROC=IPROC+1
             SYMCONF(IPROC)=INT(XDUM)
           ENDIF
         ENDDO
  10     SYMCONF(0)=IPROC

given the symfact.dat

                1            1
                2            1
                3           -2

The symconf(0) is either 1 (for G1) or 2 (for G2)

If the file is not found:

          IPROC=1
          SYMCONF(IPROC)=ICONFIG
          OPEN(UNIT=LUN,FILE='../symfact.dat',STATUS='OLD',ERR=20)
  -> 20     SYMCONF(0)=IPROC
           WRITE(*,*)'Error opening symfact.dat. No permutations used.'

So symconf(0) =1 in that case --as expected-- (and in this case G3 does make sense).

@valassi
Copy link
Member Author

valassi commented Jul 19, 2024

So concerning the segfault line I have investigated the reported line and long story short do not see any potential issue here. So I have to bet that this is memory corruption due to an inconsistent use of nb_warp/warp_size on the fortran size and the one used to compile c++.

At least I would stop focus on this issue (and propose that we close it for the moment), Do you agree @valassi

Thanks Olivier :-)

Yes I agree. Most likely this may be related to the nb_warp_used crashes, fixed (with a patch to be improved later) in #885.

So ok for me to close this, thanks.

So checking the assignment of SYMCONF(0), this is a runtime variable that depends on the content of ../symfact.dat,
the reading of such file is done in auto_dsig.f (around line 508)

Thanks also for this explanation. I think it is useful also for another issue, i.e. making sure that madevent tests several channels. I opened #927 as a placeholder and added your post as a link.

Closing. Fixed by #882 (probably via 885)

@valassi valassi closed this as completed Jul 19, 2024
@valassi valassi assigned oliviermattelaer and unassigned valassi Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants