-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed Ensemble (MPI Support) #1090
Conversation
3a1c2ce
to
1987d9a
Compare
I've created a simple test case that will either run on a local node (if worker count <= device count) or across multiple nodes (I could probably extend this to ensure it's 1 worker per node, but that would require some MPI comms to setup the test). The issue with MPI testing is that Perhaps an argument for Pete's CMake test magic, as I understand that runs the test suite for each individual test. Alternative would be to add a backdoor, that tells Ensemble not to finalize when it detects tests (and add some internal finalize equivalent to ensure sync). Requires discussion/agreement. Simplest option would be to provide a The only possible use-case I can see for distributed ensemble calling Changes for err test Add this to if (FLAMEGPU->getStepCounter() == 1 && counter%13==0) {
throw flamegpu::exception::VersionMismatch("Counter - %d", counter);
} Add this to the actual test body, adjust through ensemble.Config().error_level = CUDAEnsemble::EnsembleConfig::Fast; |
1987d9a
to
fce684e
Compare
https://mpitutorial.com/tutorials/running-an-mpi-cluster-within-a-lan/ Setting up MPI to run across mav+waimu seems a bit involved, probably better to try Bede. I would hope the fact it works on single node is evidence that it will work though. |
fce684e
to
775e1f4
Compare
This comment was marked as resolved.
This comment was marked as resolved.
0ad57e6
to
423af91
Compare
Happy for this to be tested on Bede and merged whilst I'm on leave. Functionality should be complete, may just want to test on Bede and refine how we wish to test it (e.g. make it ctest exclusive and include error handling test). |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
423af91
to
6da1d88
Compare
I'll review this and test it on bede while you're on leave, and try to figure out a decent way to test it (and maybe move mpi_finalize to cleanup or similar, though again that would mean it can only be tested once). |
6da1d88
to
1dcfaff
Compare
As discussed with @ptheywood (on Slack), will move This will require adjustments to the documentation and tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add MPI to requirements list in README if option is enabled.
Also need to consider how this will behave with telemetry: flag indicating mpi use, number of ranks?, how to do the list of devices from each node etc. (this is a note mostly for me when I review this in the near future) |
I can throw in a telemetry envelope at the final barrier if desired so rank 0 receives all gpu names. |
74c99b0
to
1189af9
Compare
I've now added MPI to readme requirements and ensured all tests pass with local MPI and sans MPI. |
ae2d823
to
928cef9
Compare
…king further reported errors. Current changes *should* now report late errors on rank 0's local runners, but need more code to handle non-rank 0 late errors.
Fixed 1 lint issue, another requires some restructuring (to reduce the LoC in the main ensemble fn)
Moves MPI commands into a seperate file. Still need to lint and split h into h+cpp May also move local error process into a util fn.
A single simple Python test case (it calls no MPI commands externally to make it more precise)
Co-authored-by: Peter Heywood <p.heywood@sheffield.ac.uk>
…/Size (they aren't getters). Also marks them as static methods, they do not do anything to the instance
… detail::MPIEnsemble
…rations MPI ensembles can use multiple mpi ranks per node, evenly(ish) distributing GPUs across the ranks per shared memory system. If more MPI ranks are used on a node than GPUs, additional ranks will do nothing and a warning is reported. I.e. any number of mpi ranks can be launched, but only the sensible amount will be used. If the user specifies device indices, they will be load balanced, otherwise all visible devices within the node will be balanced. Only one rank per node sends the device string back for telemetry, others send back an empty string (while the assembleGPUsString method is expecting a message from each rank in the world. If no valid cuda devices are provided, an exception is raised Device allocation is implemented in a static method so it can be tested programatically, withotu launching the test N times with different MPI configurations.
…be called once per process
9a55497
to
9e2d0cd
Compare
The implementation of MPI Ensembles within this PR is designed for each
CUDAEnsemble
to have exclusive access to all the GPUs available to it (or specified with thedevices
config). Ideally a user will launch 1 MPI worker per node, however it could be 1 worker per GPU per node.It would be possible to use MPI shared-memory groups to identify workers on the same node and negotiate division of GPUs, and/or for some workers to become idle, however this has not been implemented.
Full notes of identified edge cases are in the below todo list.
CUDAEnsemble
RunLog
, with those not handled empty of data, checking step counter > 0 is a hack, this presumably also affects failed runs under normal error modes)RunPlan
(it was initially assumed that vector would be parallel withRunPlanVector
used as input). [Agreed with Paul 2023-07-26]Rank 0 gets all the logs, others get emptyAll runners get all logsCatch and handle local MPI runs insideCUDAEnsemble
? (we want users to avoid using multiple MPI runners that have access to same GPU)Should we expose world_size/rank toHostAPI
? (I don't think it's necessary)CudaEnsemble::Config().mpi=false;
)Do we need to handle a race condition with RTC cache?Closes #1073
Closes #1114
Edit for visibility (by @ptheywood) - needs to be clear in the next release notes that this includes a breaking change to the return type of
CUDAEnsemble::getLogs
from astd::vector
to astd::map