Skip to content
This repository has been archived by the owner on Jan 10, 2024. It is now read-only.

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. #234

Open
WHEREISSHE opened this issue May 22, 2022 · 4 comments

Comments

@WHEREISSHE
Copy link

Hi,there. When I running Exec/RegTests/EB_FlamePastCylinder, the make process went well but something work not properly during ./PeleLM3d.gnu.MPI.ex inputs.3d-regt . The error occured with "amrex::Abort::0::MLMG failed !!!
SIGABRT
See Backtrace.0 file for details
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them."

@esclapez
Copy link
Contributor

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

@WHEREISSHE
Copy link
Author

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

Thank you! I followed your instruction, but it didn't work properly with the notion---MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12
amrex::Abort::0::MLMG failed !!!
SIGABRT
Should I tune other parameters? More specificly, how could I find suitable parameters to be optimized?

@WHEREISSHE
Copy link
Author

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

It seemed worked properly when I increased the tolerance to 1.0e-8. But I still have no idea if this value is suitable. Actually, I am wondering how to choose good values for tolerance and verbose. Thanks.

@esclapez
Copy link
Contributor

esclapez commented Jun 3, 2022

So, if you keep the verbose to 2, the standard output will get significantly longer but you will be able to keep track of the linear solver(s) behavior. When it comes to tolerances, the one you mostly want to control is the relative one:

mac_proj.rtol = 1e-10
nodal_proj.rtol = 1e-10

And in my experience, going higher than 1e-9 might indicates that something is wrong in the setup, unless you have added multiple levels and have very fine grids. From the message you pasted above,

MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12

the relative tolerance hanged ~1e-12, so going to 1e-10 should be relaxing the constraint enough for the solver to move forward.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants