Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPI HALOEXCHANGE kernels #332

Merged
merged 98 commits into from
Dec 21, 2023
Merged

Conversation

MrBurmark
Copy link
Member

@MrBurmark MrBurmark commented Jun 9, 2023

Summary

Add halo-exchange kernels with MPI communication.

This PR adds halo-exchange communication kernels with actual MPI communicaiton.
It includes a version of the HALOEXCHANGE kernel and the HALOEXCHANGE_FUSED kernel.
This is a simple version of what is done in Comb.
The MPI_HALOEXCHANGE kernels do the same packing and unpacking as non-MPI HALOEXCHANGE kernels and they actually communicate the data via MPI. I made some changes to the non-MPI kernels to ensure consistency with the MPI kernels.
In all of the HALOEXCHANGE kernels there is a NxNxN structured block in each rank. In the MPI versions the number of ranks is factored and the blocks of each MPI process are connected into a larger 3D block. The number of processes on each size of this block can be manually set via --mpi_3d_division I J K. By default a division is created by factoring the number of MPI processes and trying to make a relatively square grid. If there is only a single factor in the number of MPI processes then the grid will be a like a pencil and if it only has two prime factors then it will be like a flat plane. Care should be taken when choosing the number of MPI processes and the mpi 3d division to ensure a realistic distribution of messages.

  • This PR is a feature
  • It does the following:
    • Adds MPI comms kernels at the request of myself

MrBurmark added 20 commits May 31, 2023 14:03
Change corresponding pack and unpack extents to be on
opposite sides of the domain so they make sense with
periodic boundary conditions.
Also refactor extent generation to be easier to understand.
Only Base_Seq is implemented
Add support for separate buffers where for example you pack into
one buffer, then copy that to the MPI buffer, and then call MPI.
In addition to a single buffer where for example you pack into the
one buffer, and then call MPI.
Note that this is incorrect for periodic boundaries
Remove all variants of MPI_HALOEXCHANGE when the number of ranks is not
a perfect cube of an integer.
Restore periodic boundaries.
Note that this may result in ranks communicating mostly with themselves
if the number of mpi ranks does not have 3 prime factors
@MrBurmark MrBurmark requested review from rhornung67 and pearce8 June 9, 2023 20:17
@MrBurmark
Copy link
Member Author

Do we do any MPI builds in the CI?

@rhornung67
Copy link
Member

Not yet. We could add in this PR or wait until CI stuff is synced up between projects. We're in the process of moving to a new Spack. Then, we can switch over to newer compiler versions.

@MrBurmark
Copy link
Member Author

To keep things simple let's worry about it in a different PR.

@rhornung67 rhornung67 mentioned this pull request Jul 10, 2023
24 tasks
@MrBurmark MrBurmark force-pushed the feature/burmark1/MPI_HALOEXCHANGE branch from d3700a0 to 2b5c5a9 Compare October 9, 2023 19:45
@MrBurmark
Copy link
Member Author

@adrienbernede @rhornung67 Do either of you know if other projects with MPI tests actually enable MPI in CI testing?

@adrienbernede
Copy link
Member

@MrBurmark Serac, MFEM, Umpire do. Now, Umpire does not test with MPI on every machine...

@MrBurmark
Copy link
Member Author

Ok, I'll try adding MPI tests on every machine.

Move ruby mpi CI job into list of extras
@MrBurmark
Copy link
Member Author

@adrienbernede I am seeing failures on lassen due to what appears to be invalid arguments to mpirun. Have you seen this before?

      Start  3: blt_mpi_smoke
 3/10 Test  #3: blt_mpi_smoke ....................***Failed    0.10 sec
/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/bin/mpirun: Error: unknown option "-np"

@adrienbernede
Copy link
Member

@MrBurmark, yes, it means that MPIEXEC_NUMPROC_FLAG is not set correctly...
I can have a look tomorrow.

Copy link

@pearce8 pearce8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through the Caliper instrumentation, all looks great now.

* Allow controlled overlapping in flux allocations

* Update to RAJA@develop (grab improved MPI test handling in CI)

* From RAJA: From RSC : fix MPIEXEC flag with spectrum-mpi

* Fix typo

* From RAJA: From RSC: Import the full logic instead of overriding

* From RAJA: From RSC: Fix syntax

* From RAJA: From RSC: Fix missing import

* Remove extra layer of allocation nesting on lassen

* Passing flag to sub allocation

* Attempt to fix quotes usage

* Move to RADIUSS Shared CI 2023.12.0 release branch

* Simplify build_and_test script further, remove alloc option

* From RAJA: From RSC: Do not change flag for spectrum-mpi (fixed error elsewhere) but change it for cray-mpich

* From RAJA: From RSC: Fix typo

* From RAJA: From RSC: Fix syntax
@adrienbernede
Copy link
Member

@MrBurmark @rhornung67 I solved both the lassen and tioga issues. The CI should pass once corona is back.

Comment on lines +230 to +235
echo "~~~~~~~~~ Run Command: ~~~~~~~~~~~~~~~~~~~~~"
echo "ctest --output-on-failure -T test 2>&1 | tee tests_output.txt"
echo "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~"
date
ctest --output-on-failure -T test 2>&1 | tee tests_output.txt
date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MrBurmark I don’t know what --smpiargs="-disable_gpu_hooks" is used for. Tests are running without it. Attempts to define it in lalloc failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is something you need on lassen if you're not running through lrun/mpi.

@MrBurmark
Copy link
Member Author

I'm working on the merge conflicts and I'll push an updated branch shortly

@adrienbernede
Copy link
Member

I’m finally seeing green !
Waiting for LLNL/RAJA#1584 to be merged.

@MrBurmark MrBurmark force-pushed the feature/burmark1/MPI_HALOEXCHANGE branch from b7de33a to ae949e0 Compare December 21, 2023 18:41
@MrBurmark MrBurmark merged commit 6ef8308 into develop Dec 21, 2023
19 checks passed
@MrBurmark MrBurmark deleted the feature/burmark1/MPI_HALOEXCHANGE branch December 21, 2023 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants