Avoid race conditions from unsynchronized ensembles #231

eirrgang · 2019-07-17T11:06:39Z

For MPI-based ensemble management in gmxapi 0.0.7, there are various ways race conditions could develop because seemingly natural assumptions in one use-case may be very non-general.

MPI barriers before and/or after simulations could mitigate some race conditions and even avoid (or at least clarify) some MPI errors (stemming from untagged mismatched MPI calls), and the barriers shouldn't introduce much cost. Still, they should probably be optional.

Barrier before starting simulation can make sure that no simulation launches until the MPI environment is in the expected state everywhere.
Barrier after simulation makes sure that no process proceeds to try to access artifacts or proceed with additional MPI calls until all simulations in the ensemble are complete.

However, gmxapi already creates a new communicator for each Session in which to run the simulations, so it should not be susceptible to problems from mismatched MPI calls across multiple simulation phases. Note that a wider-than-necessary communicator causes a warning, but gmx.context.py::_get_mpi_ensemble_communicator() does split the comm to an appropriate size to avoid unmatched MPI calls.

We can document:

Ensembles using MPI or ensemble facilities (like ensemble_reduce) will hang if not all simulations in the ensemble make matching calls.
If users need to synchronize the start or completion of an ensemble simulation phase, they can use their own barrier.

peterkasson · 2019-07-17T12:52:38Z

I think we want to be very careful with such an idea. This breaks a lot of performance benefits. It's a Bad Thing to have barriers as a general case.

peterkasson · 2019-07-17T12:53:24Z

PS a barrier before the simulations start might be reasonable, but we'd want to examine the idea carefully.

eirrgang · 2019-07-17T13:05:47Z

Yes, actually, we avoid some of the problems this is intended to solve by just using a sub communicator for the simulations. In other cases, I now recall that we had talked about a more idiomatic synchronization through special WorkElements.

This whole issue is very different in gmxapi 0.0.7 and gmxapi 0.1 that I'm not sure what, if anything, needs to be done in 0.0.7. Maybe it is just a documentation issue.

eirrgang added enhancement gmxapi pertains to this repository and the Python module support labels Jul 17, 2019

eirrgang self-assigned this Jul 17, 2019

eirrgang changed the title ~~MPI barrier to synchronize ensembles~~ Avoid race conditions from unsynchronized ensembles Jul 18, 2019

eirrgang assigned jmhays Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid race conditions from unsynchronized ensembles #231

Avoid race conditions from unsynchronized ensembles #231

eirrgang commented Jul 17, 2019 •

edited

Loading

peterkasson commented Jul 17, 2019

peterkasson commented Jul 17, 2019

eirrgang commented Jul 17, 2019

Avoid race conditions from unsynchronized ensembles #231

Avoid race conditions from unsynchronized ensembles #231

Comments

eirrgang commented Jul 17, 2019 • edited Loading

peterkasson commented Jul 17, 2019

peterkasson commented Jul 17, 2019

eirrgang commented Jul 17, 2019

eirrgang commented Jul 17, 2019 •

edited

Loading