Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] MPI update #328

Merged
merged 47 commits into from
Dec 20, 2023
Merged

[WIP] MPI update #328

merged 47 commits into from
Dec 20, 2023

Conversation

quaquel
Copy link
Owner

@quaquel quaquel commented Dec 8, 2023

This pull request makes the MPIEvaluator feature complete. It adds

  • centralized logging analogous to the MultiprocessingEvaluator and the IpyparallelEvaluator
  • handling of WorkingDirectoryModels
  • fixing the use of initializer functions for worker processes
  • make it possible to use chunksize as a kwarg on perform_experiment.
  • Various performance enhancements

Some more thoughts

  • The code has been tested on DelftBlue already using the new example_mpi_lake_model and associated slurm file.
  • DelftBlue sometimes seems to fail to spawn all workers. This results in a deadlock for the logging because creating an MPI intercommunicator is a blocking operation. The MPI intercommunicator that is created is only used to centralize logging. There are two possible fixes: have a timeout and fail the process or make centralized logging a user-specifiable choice.
  • The handling of WorkingDirectoryModels is implemented using the same structure as the MultiprocessingEvaluator but still needs to be tested

fixes #261

columns with the same value for all entries don't matter for feature scoring, so they can be ignored.
bunch of things happening at the same time
* code reorganization, moving mpi-specific code into a separate module
* Return of initializer
* update to logging (hacky) to include rank information in all log messages
in preparation for future updates to the mpi code, all evaluator code has been reorganized. All evaluators go into their respective module, and code sthared among 2 or more goes into a new util package.

All relevant modules also have been renamed to future_{style of parallelization}
attempted import error fix
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@quaquel quaquel added this to the 2.5.0 milestone Dec 8, 2023
@coveralls
Copy link

coveralls commented Dec 8, 2023

Coverage Status

Changes unknown
when pulling 7482825 on mpi_update
into ** on master**.

@EwoutH
Copy link
Collaborator

EwoutH commented Dec 8, 2023

Looks like an impressive effort. If you want me to review at some point, let me know!

@quaquel
Copy link
Owner Author

quaquel commented Dec 8, 2023

No rush. This is just in preparation for our meeting next week. I hope to have tested WorkingDirectory models by then as well.

Most of the commits were just stupid tests on DelftBlue. I learned a lot about MPICH and OpenMPI as well as about how slurm interacts with MPI.

@EwoutH
Copy link
Collaborator

EwoutH commented Dec 8, 2023

Most of the commits were just stupid tests on DelftBlue. I learned a lot about MPICH and OpenMPI as well as about how slurm interacts with MPI.

That really describes my experience in Q1! There are so many hidden assumptions everywhere. So much trial and error.

Even for me, and I really like trial and error.

@EwoutH EwoutH marked this pull request as draft December 13, 2023 20:56
@quaquel quaquel marked this pull request as ready for review December 20, 2023 12:29
Copy link
Collaborator

@EwoutH EwoutH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Testen working directories

@quaquel quaquel merged commit a968cc8 into master Dec 20, 2023
@quaquel quaquel deleted the mpi_update branch December 20, 2023 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants