Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General way to test whether CUDA-aware MPI is available? #886

Closed
glwagner opened this issue Nov 12, 2024 · 7 comments
Closed

General way to test whether CUDA-aware MPI is available? #886

glwagner opened this issue Nov 12, 2024 · 7 comments

Comments

@glwagner
Copy link
Contributor

glwagner commented Nov 12, 2024

I think it would be useful to have a (more) general way to test whether CUDA-aware MPI is available (eg making has_cuda() work for more than just Open MPI). This would allow us to distinguish between a configuration issue vs other errors when complex applications fail on a cluster. I'm curious whether this is impossible or simply a lot of work (for example each MPI installation requiring some bespoke method).

Perhaps a practical solution would involve some empirical test, ie testing some simple piece of code that should work in most circumstances if CUDA-aware MPI is available, and reporting whether it errors or not.

@luraess
Copy link
Contributor

luraess commented Nov 12, 2024

Currently there is https://juliaparallel.org/MPI.jl/stable/usage/#CUDA-aware-MPI-support and if IIRC only OpenMPI allows for checking the MPI.has_cuda(). If you run the code snippet there, it would fail on non-CUDA-aware implementations.

@glwagner
Copy link
Contributor Author

Yes, that is helpful! I guess I am wondering whether it make sense to wrap this code:

https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2

in a function. We could then use it to provide more information like ("the "all to all" test fails, so you probably don't have CUDA-aware MPI).

@luraess
Copy link
Contributor

luraess commented Nov 12, 2024

That could be helper tool. However, it may not fully cover the case where, e.g., you have a non-functionning GPU-aware implementation or installation.

@glwagner
Copy link
Contributor Author

Hmm, yes if it is non-comprehensive then perhaps it belongs downstream instead of here. We can prototype it in Oceananigans.

So I make sure I understand --- are you saying that the test can pass even when the MPI implementation is non-functioning?

@luraess
Copy link
Contributor

luraess commented Nov 12, 2024

the test can pass even when the MPI implementation is non-functioning?

The opposite way, tests may fail although CUDA-aware MPI may be supported. But as a downstream check, it can be useful indeed.

@glwagner
Copy link
Contributor Author

glwagner commented Nov 12, 2024

At the very least if MPI.Sendrecv fails we can conclude that Sendrecv does not work, right?

This seems like a good path for us; we simply run tiny tests of all the MPI.jl functions we need before doing something more complicated and expensive, and then we can throw a specific error about what worked and what didn't to help users debug their configuration.

I recently ran into the issue where something didn't work because of incorrect linking when libraries were installed on a cluster (eg by the vendor that installed MPI). It took me almost a week to figure out what was wrong! So I'm searching for ways to speed this up for other systems and also users; I suspect that MPI usage is going to increase quite a bit in the near future.

@glwagner
Copy link
Contributor Author

I'll close this but feel free to re-open if you think that users would benefit from helper functions implemented directly in MPI.jl or the CUDA extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants