NPROCS .gt. NELEM segfaults in CAM #1083

erichlf · 2016-09-30T19:00:58Z

When using more MPI tasks than the number of elements CAM segfaults.

cesm.exe 0000000002216020 ll_mod_mp_llfree_ 69 ll_mod.F90
cesm.exe 00000000020552E2 metagraph_mod_mp_ 469 metagraph_mod.F90
cesm.exe 0000000001E5F7B5 prim_driver_mod_m 291 prim_driver_mod.F90
cesm.exe 0000000001B017B8 dyn_comp_mp_dyn_i 179 dyn_comp.F90
cesm.exe 0000000000790C4E restart_dynamics_ 352 restart_dynamics.F90
cesm.exe 0000000000661902 cam_restart_mp_ca 397 cam_restart.F90
cesm.exe 000000000061E5B1 cam_comp_mp_cam_i 167 cam_comp.F90
cesm.exe 0000000000611C09 atm_comp_mct_mp_a 262 atm_comp_mct.F90
cesm.exe 0000000000551DDB component_mod_mp_ 230 component_mod.F90
cesm.exe 00000000005411A3 cesm_comp_mod_mp_ 1145 cesm_comp_mod.F90
cesm.exe 000000000054CD11 MAIN__ 102 cesm_driver.F90

mt5555 · 2016-09-30T19:55:49Z

1st question, maybe @gold2718 knows this from memory: will the dycore routines be called by all MPI tasks (so the dycore has to handle the case where a task has 0 elements), or will only a subset of MPI tasks make dycore calls? If the later, then the dycore will need it's on subcommunicator because it has a few global reductions.

Edit: answering my own question: dycore calls are only done by MPI tasks with id < par%nprocs.

The only way I can see this working is if the "par" struct is associated with a dycore subcommunicator that has at most nelem MPI tasks.

If the code is setup to work this way, the first place to look for a bug would be in the code which initializes "par". It should create a subcommunicator with at most 'nelem' MPI tasks.

erichlf · 2016-10-13T17:23:26Z

I have been looking over the code in components/cam/src/dynamics/se/dyn_comp.F90, which is the code that seems to be at issue here. In particular, there are some if iam < par%nprocs then, which seem crazy if you assume that iam is the global rank and par%nprocs is the global number of processor. However, if iam is the global rank and par%nprocs is a local number of processors this would make a lot of sense.

While trying to figure out these details I am led to components/homme/src/share/parallel_mod.F90 and components/cam/src/util/spmd_utils.F90. parallel_mod.F90 is where par%nprocs is defined and it turns out that this is indeed a local communicator. On the other hand iam comes from spmd_utils.F90 and this appears to be a local rank. So if iam < par%nprocs then doesn't seem to make sense in this case.

Now, if I could get a hold of the global rank I believe this would fix the problem. Does anyone know how to get a hold of the global rank? Also, please feel free to verify that those variables are in deed local mpi comm variables.

mt5555 · 2016-10-13T17:46:40Z

i believe iam will be the rank in the CAM communicator, while par%nprocs is the number of procs in the dycore subcommunicator. So the code if iam < par%nprocs looks correct.

In parallel_mod.F90, this block of code is splitting the atmosphere communicator, creating a subcommunicator for the dycore, stored in par%comm.

if(present(npes_in)) then
   npes_homme=npes_in
else
   npes_homme=npes_cam
end if
call MPI_comm_size(mpicom,npes_cam,ierr)
call MPI_comm_rank(mpicom,iam_cam,ierr)
color = iam_cam/npes_homme
call mpi_comm_split(mpicom, color, iam_cam, par%comm, ierr)

So I suspect the only problem is that the computation of npes_homme is bad. If npes_in is missing, then npes_cam is used unitialized (clearly a bug). In the missing case, it needs to be set to the max(npes_cam,nelem), after npes_cam is computed. If npes_in is specified, then this would be a good place to check that npes_in <= nelem and abort if not.

There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong. [BFB] fixes #1083

There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.

add --mpilib option to create_test Add an mpilib option to create_test, this option will override the default but not any mpilib explicitly added in an _M testname modifier. Test suite: scripts_regression_tests & create_test cime_developer --mpilib mpi-serial Test baseline: Test namelist changes: Test status: bit for bit Fixes #1083 User interface changes?: adds --mpilib optional argument to create_test Code review:

There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.

erichlf added bug Atmosphere labels Sep 30, 2016

erichlf self-assigned this Sep 30, 2016

erichlf mentioned this issue Sep 30, 2016

NPROCS .gt. NELEMS segfault #1082

Closed

mt5555 assigned gold2718 and mt5555 Sep 30, 2016

erichlf mentioned this issue Nov 3, 2016

nprocs .gt. nelems #1121

Merged

mt5555 closed this as completed in f617102 Nov 17, 2016

gold2718 mentioned this issue Apr 10, 2017

Edison slow for ne30 A_WCYCL* #1387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPROCS .gt. NELEM segfaults in CAM #1083

NPROCS .gt. NELEM segfaults in CAM #1083

erichlf commented Sep 30, 2016

mt5555 commented Sep 30, 2016 •

edited

Loading

erichlf commented Oct 13, 2016

mt5555 commented Oct 13, 2016

NPROCS .gt. NELEM segfaults in CAM #1083

NPROCS .gt. NELEM segfaults in CAM #1083

Comments

erichlf commented Sep 30, 2016

mt5555 commented Sep 30, 2016 • edited Loading

erichlf commented Oct 13, 2016

mt5555 commented Oct 13, 2016

mt5555 commented Sep 30, 2016 •

edited

Loading