Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nprocs .gt. nelems #1121

Merged
merged 5 commits into from
Nov 17, 2016
Merged

nprocs .gt. nelems #1121

merged 5 commits into from
Nov 17, 2016

Conversation

erichlf
Copy link
Member

@erichlf erichlf commented Nov 3, 2016

There has been an existing issue (#1083) where we get a segfault when requesting more mpi processes than number of elements in cam. To mitigate this issue I have added an abort when this happens so that we at least don't segfault and get a nice error message which tells us what went wrong.

Eventually, the plan is to track down the underlying issue, since we are suppose to be able to use more MPI processes than elements in CAM.

Erich L Foster added 3 commits October 24, 2016 10:18
Saving my place. Right now I still have not determined a place where nelem is
set and won't cause a segfault.
This fixes the segfault when requesting more processes than elements. Now we
abort the simulation and print an error message.

[BFB]
My vi replaced existing tabs with spaces and I didn't notice. This is to fix the
tabulation to be aligned appropriately.

[BFB]
@mt5555
Copy link
Contributor

mt5555 commented Nov 3, 2016

two requests:

  1. Please add to this PR the bugfix we found in parallel_mod.F90 (npes_cam) used before it is set.
  2. The error message needs to inform users about the dyn_npes namelist variable. Something like
    "To run with more MPI tasks than elements in CAM, set namelist variable dyn_npes"

Added a mention of dyn_pes in the error message when using more pes than
elements. Also, fixed an issue where npes_cam was being used before it was set.

[BFB]
@erichlf
Copy link
Member Author

erichlf commented Nov 3, 2016

Okay, I have fixed those issues.

@mt5555
Copy link
Contributor

mt5555 commented Nov 3, 2016

error message looks to be more than 128 characters, overflowing the errmsg variable.

@mt5555
Copy link
Contributor

mt5555 commented Nov 6, 2016

Also: format statement integer sizes (I5) are too small for nelem and par%nprocs.

How about we simplify everything and just use
call abortmp('Error: too many MPI tasks. set dyn_npes <= nelem')

Made the nprocs gt nelem error message simpler and switched from endrun to
mpiabort.

[BFB]
@erichlf
Copy link
Member Author

erichlf commented Nov 7, 2016

@mt5555 Done.

mt5555 pushed a commit that referenced this pull request Nov 10, 2016
There has been an existing issue (#1083) where we get a segfault when
requesting more mpi processes than number of elements in cam. To
mitigate this issue I have added an abort when this happens so that we
at least don't segfault and get a nice error message which tells us
what went wrong.

[BFB]

fixes #1083
@mt5555 mt5555 merged commit e2fb2eb into master Nov 17, 2016
mt5555 added a commit that referenced this pull request Nov 17, 2016
There has been an existing issue (fixes #1083) where we get a segfault when
requesting more mpi processes than number of elements in cam. To
mitigate this issue I have added an abort when this happens so that we
at least don't segfault and get a nice error message which tells us
what went wrong.
@erichlf erichlf deleted the erichlf/cam/NPROCS-gt-NELEMS branch July 25, 2017 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Atmosphere BFB PR leaves answers BFB bug fix PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants