-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPROCS .gt. NELEM segfaults in CAM #1083
Comments
1st question, maybe @gold2718 knows this from memory: will the dycore routines be called by all MPI tasks (so the dycore has to handle the case where a task has 0 elements), or will only a subset of MPI tasks make dycore calls? If the later, then the dycore will need it's on subcommunicator because it has a few global reductions. Edit: answering my own question: dycore calls are only done by MPI tasks with id < par%nprocs. The only way I can see this working is if the "par" struct is associated with a dycore subcommunicator that has at most nelem MPI tasks. If the code is setup to work this way, the first place to look for a bug would be in the code which initializes "par". It should create a subcommunicator with at most 'nelem' MPI tasks. |
I have been looking over the code in components/cam/src/dynamics/se/dyn_comp.F90, which is the code that seems to be at issue here. In particular, there are some While trying to figure out these details I am led to components/homme/src/share/parallel_mod.F90 and components/cam/src/util/spmd_utils.F90. parallel_mod.F90 is where par%nprocs is defined and it turns out that this is indeed a local communicator. On the other hand iam comes from spmd_utils.F90 and this appears to be a local rank. So Now, if I could get a hold of the global rank I believe this would fix the problem. Does anyone know how to get a hold of the global rank? Also, please feel free to verify that those variables are in deed local mpi comm variables. |
i believe iam will be the rank in the CAM communicator, while par%nprocs is the number of procs in the dycore subcommunicator. So the code if iam < par%nprocs looks correct. In parallel_mod.F90, this block of code is splitting the atmosphere communicator, creating a subcommunicator for the dycore, stored in par%comm.
So I suspect the only problem is that the computation of npes_homme is bad. If npes_in is missing, then npes_cam is used unitialized (clearly a bug). In the missing case, it needs to be set to the max(npes_cam,nelem), after npes_cam is computed. If npes_in is specified, then this would be a good place to check that npes_in <= nelem and abort if not. |
There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong. [BFB] fixes #1083
There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.
add --mpilib option to create_test Add an mpilib option to create_test, this option will override the default but not any mpilib explicitly added in an _M testname modifier. Test suite: scripts_regression_tests & create_test cime_developer --mpilib mpi-serial Test baseline: Test namelist changes: Test status: bit for bit Fixes #1083 User interface changes?: adds --mpilib optional argument to create_test Code review:
There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.
There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.
There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.
When using more MPI tasks than the number of elements CAM segfaults.
cesm.exe 0000000002216020 ll_mod_mp_llfree_ 69 ll_mod.F90
cesm.exe 00000000020552E2 metagraph_mod_mp_ 469 metagraph_mod.F90
cesm.exe 0000000001E5F7B5 prim_driver_mod_m 291 prim_driver_mod.F90
cesm.exe 0000000001B017B8 dyn_comp_mp_dyn_i 179 dyn_comp.F90
cesm.exe 0000000000790C4E restart_dynamics_ 352 restart_dynamics.F90
cesm.exe 0000000000661902 cam_restart_mp_ca 397 cam_restart.F90
cesm.exe 000000000061E5B1 cam_comp_mp_cam_i 167 cam_comp.F90
cesm.exe 0000000000611C09 atm_comp_mct_mp_a 262 atm_comp_mct.F90
cesm.exe 0000000000551DDB component_mod_mp_ 230 component_mod.F90
cesm.exe 00000000005411A3 cesm_comp_mod_mp_ 1145 cesm_comp_mod.F90
cesm.exe 000000000054CD11 MAIN__ 102 cesm_driver.F90
The text was updated successfully, but these errors were encountered: