Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid race conditions from unsynchronized ensembles #231

Open
eirrgang opened this issue Jul 17, 2019 · 3 comments
Open

Avoid race conditions from unsynchronized ensembles #231

eirrgang opened this issue Jul 17, 2019 · 3 comments
Assignees
Labels
enhancement gmxapi pertains to this repository and the Python module support

Comments

@eirrgang
Copy link
Collaborator

eirrgang commented Jul 17, 2019

For MPI-based ensemble management in gmxapi 0.0.7, there are various ways race conditions could develop because seemingly natural assumptions in one use-case may be very non-general.

MPI barriers before and/or after simulations could mitigate some race conditions and even avoid (or at least clarify) some MPI errors (stemming from untagged mismatched MPI calls), and the barriers shouldn't introduce much cost. Still, they should probably be optional.

  1. Barrier before starting simulation can make sure that no simulation launches until the MPI environment is in the expected state everywhere.
  2. Barrier after simulation makes sure that no process proceeds to try to access artifacts or proceed with additional MPI calls until all simulations in the ensemble are complete.

However, gmxapi already creates a new communicator for each Session in which to run the simulations, so it should not be susceptible to problems from mismatched MPI calls across multiple simulation phases. Note that a wider-than-necessary communicator causes a warning, but gmx.context.py::_get_mpi_ensemble_communicator() does split the comm to an appropriate size to avoid unmatched MPI calls.

We can document:

  1. Ensembles using MPI or ensemble facilities (like ensemble_reduce) will hang if not all simulations in the ensemble make matching calls.
  2. If users need to synchronize the start or completion of an ensemble simulation phase, they can use their own barrier.
@eirrgang eirrgang added enhancement gmxapi pertains to this repository and the Python module support labels Jul 17, 2019
@eirrgang eirrgang self-assigned this Jul 17, 2019
@peterkasson
Copy link
Collaborator

I think we want to be very careful with such an idea. This breaks a lot of performance benefits. It's a Bad Thing to have barriers as a general case.

@peterkasson
Copy link
Collaborator

PS a barrier before the simulations start might be reasonable, but we'd want to examine the idea carefully.

@eirrgang
Copy link
Collaborator Author

Yes, actually, we avoid some of the problems this is intended to solve by just using a sub communicator for the simulations. In other cases, I now recall that we had talked about a more idiomatic synchronization through special WorkElements.

This whole issue is very different in gmxapi 0.0.7 and gmxapi 0.1 that I'm not sure what, if anything, needs to be done in 0.0.7. Maybe it is just a documentation issue.

@eirrgang eirrgang changed the title MPI barrier to synchronize ensembles Avoid race conditions from unsynchronized ensembles Jul 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement gmxapi pertains to this repository and the Python module support
Projects
None yet
Development

No branches or pull requests

3 participants