General way to test whether CUDA-aware MPI is available? #886

glwagner · 2024-11-12T12:47:50Z

I think it would be useful to have a (more) general way to test whether CUDA-aware MPI is available (eg making has_cuda() work for more than just Open MPI). This would allow us to distinguish between a configuration issue vs other errors when complex applications fail on a cluster. I'm curious whether this is impossible or simply a lot of work (for example each MPI installation requiring some bespoke method).

Perhaps a practical solution would involve some empirical test, ie testing some simple piece of code that should work in most circumstances if CUDA-aware MPI is available, and reporting whether it errors or not.

The text was updated successfully, but these errors were encountered:

luraess · 2024-11-12T12:58:01Z

Currently there is https://juliaparallel.org/MPI.jl/stable/usage/#CUDA-aware-MPI-support and if IIRC only OpenMPI allows for checking the MPI.has_cuda(). If you run the code snippet there, it would fail on non-CUDA-aware implementations.

glwagner · 2024-11-12T13:00:52Z

Yes, that is helpful! I guess I am wondering whether it make sense to wrap this code:

https://gist.github.com/luraess/0063e90cb08eb2208b7fe204bbd90ed2

in a function. We could then use it to provide more information like ("the "all to all" test fails, so you probably don't have CUDA-aware MPI).

luraess · 2024-11-12T13:08:31Z

That could be helper tool. However, it may not fully cover the case where, e.g., you have a non-functionning GPU-aware implementation or installation.

glwagner · 2024-11-12T14:04:46Z

Hmm, yes if it is non-comprehensive then perhaps it belongs downstream instead of here. We can prototype it in Oceananigans.

So I make sure I understand --- are you saying that the test can pass even when the MPI implementation is non-functioning?

luraess · 2024-11-12T14:14:09Z

the test can pass even when the MPI implementation is non-functioning?

The opposite way, tests may fail although CUDA-aware MPI may be supported. But as a downstream check, it can be useful indeed.

glwagner · 2024-11-12T14:17:36Z

At the very least if MPI.Sendrecv fails we can conclude that Sendrecv does not work, right?

This seems like a good path for us; we simply run tiny tests of all the MPI.jl functions we need before doing something more complicated and expensive, and then we can throw a specific error about what worked and what didn't to help users debug their configuration.

I recently ran into the issue where something didn't work because of incorrect linking when libraries were installed on a cluster (eg by the vendor that installed MPI). It took me almost a week to figure out what was wrong! So I'm searching for ways to speed this up for other systems and also users; I suspect that MPI usage is going to increase quite a bit in the near future.

glwagner · 2024-11-12T14:20:48Z

I'll close this but feel free to re-open if you think that users would benefit from helper functions implemented directly in MPI.jl or the CUDA extension.

glwagner mentioned this issue Nov 12, 2024

Throw error if CUDA-aware MPI is not found for distributed GPU architecture CliMA/Oceananigans.jl#3883

Closed

glwagner closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General way to test whether CUDA-aware MPI is available? #886

General way to test whether CUDA-aware MPI is available? #886

glwagner commented Nov 12, 2024 •

edited

Loading

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024 •

edited

Loading

glwagner commented Nov 12, 2024

General way to test whether CUDA-aware MPI is available? #886

General way to test whether CUDA-aware MPI is available? #886

Comments

glwagner commented Nov 12, 2024 • edited Loading

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024

luraess commented Nov 12, 2024

glwagner commented Nov 12, 2024 • edited Loading

glwagner commented Nov 12, 2024

glwagner commented Nov 12, 2024 •

edited

Loading

glwagner commented Nov 12, 2024 •

edited

Loading