Make CPU-GPU memory copy highly asynchronous #1082
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the ability to make the cpu-to-gpu and gpu-to-cpu memory copy of the MPI-communications data highly asynchronous. Up to one slice of computation can be done while the memory copy is active. This is achieved by changing the gpu stream, queuing up the memory copy and changing the stream back again.
Test:
Note that even with the async memcopy, the buffer on cpu version is not as fast as buffer on gpu. This is because the MPI data transfer between ranks on the same node is quite slow with a profile suggesting only 5 GB/s. Furthermore, this transfer is done in the MPI_Isend function and not after it or in MPI_Wait, resulting in blocking behavior. Between the three HPC platforms available to me, this one had the least broken cpu-cpu and gpu-gpu MPI data transfer.
Buffer on CPU, with async memcopy:
Buffer on CPU, no async memcopy:
Buffer on GPU:
const
isconst
)