Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IReduce! and IAllreduce! #827

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Keluaa
Copy link
Contributor

@Keluaa Keluaa commented Mar 13, 2024

Added basic nonblocking reductions, alongside with some tests.

@vchuravy vchuravy force-pushed the nonblocking_reduction branch from 36530c4 to fafe00e Compare June 23, 2024 21:49
@Keluaa
Copy link
Contributor Author

Keluaa commented Jun 24, 2024

Mmm... only the CUDA tests fail. I feared this was a Julia-side issue, but no, it is simply that MPI_Iallreduce is not CUDA-aware in OpenMPI.

This is a known issue (open-mpi/ompi#9845), and it also happens with MPI_Ireduce (open-mpi/ompi#12045).
I was also able to reproduce it in C.

It is very surprising to me that the ROCm support apparently covers all non-blocking ops, but not CUDA.

What would be the best course of action? Merge anyway and let users stumble upon an unhelpful segfault? Or would a warning (if using OpenMPI + CUDA is loaded) be enough?

@vchuravy
Copy link
Member

Yeah we don't currently have a good mechanism to declare which operations can and can not take GPU memory.
It seems even worse for OpenMPI since the set of supported operations is depending on whether or not UCX is used.

We certainly need to branch in the tests, but I don't think we have prior art for this.

@simonbyrne any ideas?

@simonbyrne
Copy link
Member

simonbyrne commented Jun 25, 2024

Unfortunately it is probably implementation (and configuration) dependent, so I don't think we can provide a complete solution. My best suggestion would be to make it so the test suite can soft fail and report which operations are supported.

If you want something easy that does work, the simplest option is to use the regular blocking operation spawned on a separate thread:

task = Threads.@spawn MPI.Allreduce(....)
# other work
wait(task)

If your other work involves MPI ops, you will also need to MPI.Init(threadlevel=:multiple).

@PetrKryslUCSD PetrKryslUCSD mentioned this pull request Nov 18, 2024
@Keluaa Keluaa force-pushed the nonblocking_reduction branch from fafe00e to 311629f Compare November 18, 2024 13:34
@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 18, 2024

Thank you @PetrKryslUCSD for reminding me about this PR.

My original solution to detect support for IAllreduce was unsatisfactory, so I put it aside and then forgot about it.

The main problem is that MPI raises a SIGSEGV when IAllreduce/Ireduce is called with a CUDA array, so we cannot use threads or other Julia tricks to detect the issue.
I also don't know if it possible to disable the tests when OpenMPI + UCX + CUDA is used, as I do not know how to detect this.

My proposed solution is to call those problematic functions at the beginning of the test suite in a separate process, and set env vars appropriately when they work.
The other tests can use them from everywhere else in the test suite.

Since some RMA functions and other non-blocking collectives suffer from the same problem, if we were to add them to MPI.jl, it is easy to extend this to know if they work.

@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 18, 2024

The CUDA and AMDGPU tests on Julia 1.10 are all passing, however it detects IAllreduce and IReduce as non-functional for both, which seem correct since from the segfault message, but it was previously working with AMDGPU a few months ago.
On my local cluster, with ROCm 6.0.2, OpenMPI 4.1.2 and a Mi250 GPU, those two functions are working correctly.
I don't know where the issue could come from...

The tests failing with Julia nightly on CPU and for Julia 1.6-1.8 with CUDA seems unrelated to this PR.

@Keluaa
Copy link
Contributor Author

Keluaa commented Nov 21, 2024

I just realized that tests on AMDGPU have JULIA_MPI_TEST_EXCLUDE set to "test_allreduce.jl,test_reduce.jl,test_scan.jl", therefore those tests simply never ran from the start...

Is it related to the fact that those operations are unsupported on ROCm? Because if they are we could detect this using my method and exclude those tests automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants