master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

valassi · 2024-07-05T11:25:50Z

Another issue introduced in #830 and being reviewed in #882.

In WIP PR #882 for master_june24, I tried to use NB_WARP=512 and WARP_SIZE=16384 ie VECSIZE_MEMMAX=16384. This is bede049

In the CI tmad tests (which use VECSIZE_USED=32) I still get the crash of #885, but I also get the following:
https://github.com/madgraph5/madgraph4gpu/actions/runs/9806731881/job/27079146521

*** (1) EXECUTE MADEVENT_FORTRAN (create results.dat) ***
At line 412 of file auto_dsig1.f
Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3

Error termination. Backtrace:
#0  0x7f74b5a23960 in ???
#1  0x7f74b5a244d9 in ???
#2  0x55edd8ae6fd9 in dsig1_vec_
#3  0x55edd8ae7de8 in dsigproc_vec_
#4  0x55edd8ae88e3 in dsig_vec_
#5  0x55edd8afec68 in sample_full_
#6  0x55edd8ae4cbd in MAIN__
#7  0x55edd8abc69e in main
ERROR! ' ./madevent_fortran < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_ > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_' failed

For reference, with the previous values NB_WARP=1, WARP_SIZE=16384, VECSIZE_MEMAMX=16384 (and always VECSIZE_USED=32), this was 64a7c0d
And I was getting no such 'Fortran runtime error in symconf'
https://github.com/madgraph5/madgraph4gpu/actions/runs/9797840410/job/27055291574#step:12:77

*** (2-none) EXECUTE MADEVENT_CPP xQUICK (create events.lhe) ***

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7ff0a5423960 in ???
#1  0x7ff0a5422ac5 in ???
#2  0x7ff0a504251f in ???
#3  0x556bad8564aa in dsig1_vec_
#4  0x556bad857509 in dsigproc_vec_
#5  0x556bad8582b2 in dsig_vec_
#6  0x556bad86e5de in sample_full_
#7  0x556bad853d2a in MAIN__
#8  0x556bad82b6de in main
.github/workflows/testsuite_oneprocess.sh: line 289:  3672 Floating point exception(core dumped) $timecmd $cmd < ${tmpin} > ${tmp}
ERROR! ' ./build.none_d_inl0_hrd0/madevent_cpp < /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/input_gg_tt_none > /home/runner/work/madgraph4gpu/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/output_gg_tt_none' failed

The text was updated successfully, but these errors were encountered:

roiser · 2024-07-05T11:33:16Z

Hi, I just looked at my tests that I did at the time, I set e.g.

set vector_size 32
set nb_warp 256

which e.g. then gave me a vector width of 8192, note this was when testing it with configs passed into bin/mg5_aMC

oliviermattelaer · 2024-07-16T12:41:50Z

NB_WARP=512 and WARP_SIZE=16384
means and actual grid size of 8,388,608 so this is clearly extreme (and not in the spirit of having warp_size small).

So I will ignore that issue for the moment.

But we need to investigate the error

At line 412 of file auto_dsig1.f
Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3

This seems to indicate that we use symconf in a wrong way. Do you have a dedicate issue for this?

oliviermattelaer · 2024-07-16T14:50:48Z

So concerning the segfault line I have investigated the reported line and long story short do not see any potential issue here.
So I have to bet that this is memory corruption due to an inconsistent use of nb_warp/warp_size on the fortran size and the one used to compile c++.

At least I would stop focus on this issue (and propose that we close it for the moment), Do you agree @valassi

Here is the detail of my investigation. (please ignore this, not really relevant, but for@valassi that wanted to see if/how the fortran code was implementing the assignement of the channelID array).

Here is the definition of symconf):

maxconfigs.inc:2:      PARAMETER(LMAXCONFIGS=3)
./auto_dsig.f:49:      INTEGER SYMCONF(0:LMAXCONFIGS)
./auto_dsig.f:50:      COMMON /TO_SYMCONF/ SYMCONF

So this array should indeed be accessed with number lower than 3.
The problematic line is:

CHANNELS(IVEC) = CONFSUB(1,SYMCONF(ICONF_VEC(CURR_WARP)))

So the issue should be in the assignment of iconf_vec
This is an array defined in dsample.f and defined element by element in that array

./dsample.f:41:      integer imirror_vec(NB_WARP), iproc, ICONF_VEC(NB_WARP)
./dsample.f:222:               call select_grouping(imirror_vec(iwarp), iproc, iconf_vec(iwarp), all_wgt, iwarp)

The definition of a single element (ICONF=iconf_vec(iwarp)) is in done in auto_dsig.f and the allowed value are any value out of this loop:

DO J=1,SYMCONF(0) 
   ....
   IF (...)   ICONF=J

So checking the assignment of SYMCONF(0), this is a runtime variable that depends on the content of ../symfact.dat,
the reading of such file is done in auto_dsig.f (around line 508)

if the file is found the code does

         DO WHILE(.TRUE.)
           READ(LUN,*,ERR=10,END=10) XDUM, ICONF
           IF(ICONF.EQ.-MAPCONFIG(ICONFIG))THEN
             IPROC=IPROC+1
             SYMCONF(IPROC)=INT(XDUM)
           ENDIF
         ENDDO
  10     SYMCONF(0)=IPROC

given the symfact.dat

                1            1
                2            1
                3           -2

The symconf(0) is either 1 (for G1) or 2 (for G2)

If the file is not found:

          IPROC=1
          SYMCONF(IPROC)=ICONFIG
          OPEN(UNIT=LUN,FILE='../symfact.dat',STATUS='OLD',ERR=20)
  -> 20     SYMCONF(0)=IPROC
           WRITE(*,*)'Error opening symfact.dat. No permutations used.'

So symconf(0) =1 in that case --as expected-- (and in this case G3 does make sense).

valassi · 2024-07-19T11:49:16Z

So concerning the segfault line I have investigated the reported line and long story short do not see any potential issue here. So I have to bet that this is memory corruption due to an inconsistent use of nb_warp/warp_size on the fortran size and the one used to compile c++.

At least I would stop focus on this issue (and propose that we close it for the moment), Do you agree @valassi

Thanks Olivier :-)

Yes I agree. Most likely this may be related to the nb_warp_used crashes, fixed (with a patch to be improved later) in #885.

So ok for me to close this, thanks.

So checking the assignment of SYMCONF(0), this is a runtime variable that depends on the content of ../symfact.dat,
the reading of such file is done in auto_dsig.f (around line 508)

Thanks also for this explanation. I think it is useful also for another issue, i.e. making sure that madevent tests several channels. I opened #927 as a placeholder and added your post as a link.

Closing. Fixed by #882 (probably via 885)

valassi self-assigned this Jul 5, 2024

valassi mentioned this issue Jul 5, 2024

Merge of master into master_june24 and channelid fixes/reimplementation #882

Merged

valassi mentioned this issue Jul 5, 2024

master_june24: document WARP_SIZE in vector.inc (clarified: WARP_SIZE vs VECSIZE_MEMMAX) #887

Open

oliviermattelaer self-assigned this Jul 16, 2024

oliviermattelaer removed their assignment Jul 16, 2024

oliviermattelaer added this to the warp milestone Jul 18, 2024

valassi mentioned this issue Jul 19, 2024

(after master_june24) ensure that some madevent tests use several channels #927

Open

valassi closed this as completed Jul 19, 2024

valassi assigned oliviermattelaer and unassigned valassi Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

valassi commented Jul 5, 2024 •

edited

Loading

roiser commented Jul 5, 2024

oliviermattelaer commented Jul 16, 2024

oliviermattelaer commented Jul 16, 2024

valassi commented Jul 19, 2024

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

master_june24: Fortran runtime error: Index '32765' of dimension 1 of array 'symconf' above upper bound of 3 #888

Comments

valassi commented Jul 5, 2024 • edited Loading

roiser commented Jul 5, 2024

oliviermattelaer commented Jul 16, 2024

oliviermattelaer commented Jul 16, 2024

valassi commented Jul 19, 2024

valassi commented Jul 5, 2024 •

edited

Loading