You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For MPI-based ensemble management in gmxapi 0.0.7, there are various ways race conditions could develop because seemingly natural assumptions in one use-case may be very non-general.
MPI barriers before and/or after simulations could mitigate some race conditions and even avoid (or at least clarify) some MPI errors (stemming from untagged mismatched MPI calls), and the barriers shouldn't introduce much cost. Still, they should probably be optional.
Barrier before starting simulation can make sure that no simulation launches until the MPI environment is in the expected state everywhere.
Barrier after simulation makes sure that no process proceeds to try to access artifacts or proceed with additional MPI calls until all simulations in the ensemble are complete.
However, gmxapi already creates a new communicator for each Session in which to run the simulations, so it should not be susceptible to problems from mismatched MPI calls across multiple simulation phases. Note that a wider-than-necessary communicator causes a warning, but gmx.context.py::_get_mpi_ensemble_communicator() does split the comm to an appropriate size to avoid unmatched MPI calls.
We can document:
Ensembles using MPI or ensemble facilities (like ensemble_reduce) will hang if not all simulations in the ensemble make matching calls.
If users need to synchronize the start or completion of an ensemble simulation phase, they can use their own barrier.
The text was updated successfully, but these errors were encountered:
Yes, actually, we avoid some of the problems this is intended to solve by just using a sub communicator for the simulations. In other cases, I now recall that we had talked about a more idiomatic synchronization through special WorkElements.
This whole issue is very different in gmxapi 0.0.7 and gmxapi 0.1 that I'm not sure what, if anything, needs to be done in 0.0.7. Maybe it is just a documentation issue.
eirrgang
changed the title
MPI barrier to synchronize ensembles
Avoid race conditions from unsynchronized ensembles
Jul 18, 2019
For MPI-based ensemble management in gmxapi 0.0.7, there are various ways race conditions could develop because seemingly natural assumptions in one use-case may be very non-general.
MPI barriers before and/or after simulations could mitigate some race conditions and even avoid (or at least clarify) some MPI errors (stemming from untagged mismatched MPI calls), and the barriers shouldn't introduce much cost. Still, they should probably be optional.
However, gmxapi already creates a new communicator for each Session in which to run the simulations, so it should not be susceptible to problems from mismatched MPI calls across multiple simulation phases. Note that a wider-than-necessary communicator causes a warning, but gmx.context.py::_get_mpi_ensemble_communicator() does split the comm to an appropriate size to avoid unmatched MPI calls.
We can document:
The text was updated successfully, but these errors were encountered: