-
Notifications
You must be signed in to change notification settings - Fork 892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA-aware MPI build on fresh Ubuntu 24.04 LTS, MPIX_Query_cuda_support()
returns zero
#13130
Comments
I am unable to replicate, as long as I run your example on a machine with CUDA devices attached it works just as expected. If I run on a machine without GPUs then it fails are runtime. Can you run |
Thanks for responding so quickly --- here's the output you asked for:
I've also tried to add
I think these lines from
It seems that most of the components can find Finally, both machines were running with CUDA GPUs inside. Here's some relevant output for that:
Admittedly, there's a mismatch in CUDA versions between the driver and compiler. Not sure if that's the issue. |
You are correct, most of the components building as DSO found CUDA. However, coll:cuda decided to be built statically and failed. Something is weird with your build because at the end the CUDA accelerator module has not been build, that's why you don't see it on the output I asked you for, and it fails when you try to force loading it. If you go in build directory then |
There is a Makefile. Here's the output now
|
The CUDA accelerator component is build, but not loaded.
|
FYI I'm still getting the same behaviour from the compile / run commands for |
Everything seems to be in place, but the CUDA accelerator component is not loaded. |
|
Everything looks normal. Let's make sure launching an app does not screw up the environment |
Btw I've not added |
that might be a reason, not PATH but LD_LIBRARY_PATH. Most of the components are build statically in the libmpi.so with a few exceptions, and CUDA-based components are part of these exceptions. But I'm slightly skeptical as ompi_info managed to find the CUDA shared library and all the processes are local so they should inherit the mpirun environement. But just in case you can try export LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
/opt/openmpi/bin/mpirun -np 1 -x LD_LIBRARY_PATH ./mpi_check I'm running out of ideas unfortunately. |
Unfortunately still the same behaviour :( Thank you so much for taking the time. You said you were unable to reproduce this error --- could you tell me what setup you used on your end to produce a working CUDA-aware OpenMPI build on Ubuntu 24.04 LTS? If there's a docker container that has a working installation that I could run my code in, that would work too. I'm also really puzzled that the output of simply
|
#12334 (comment) might be related. I encountered same issue, and it was resolved by adding the
Environment: OpenMPI: v5.0.3, Ubuntu: 20.04.6 LTS |
Hey thanks for your advice! I tried this, but still no luck :/ My exact installation instructions this time (still on Ubuntu 24.04 LTS, but this time with OpenMPI v5.0.3 and no # clean up old install
cd /path/to/openmpi-<version>/build
sudo make uninstall
make clean
cd ../..
rm -rf openmpi-<version>
# also ensure there's no existing openmpi install
sudo apt-get remove --purge openmpi*
which mpicc # should return nothing
which mpirun # should return nothing
ls /usr/local/lib/openmpi/ # should return nothing, or complain about nonexistent dir
ls /usr/local/lib # should have nothing to do with OpenMPI
# install CUDA toolkit from https://developer.nvidia.com/cuda-downloads
nvidia-smi # should print out some basic info about the CUDA driver. In my case: NVIDIA-SMI 550.144.03, CUDA Version: 12.4
nvcc --version # should print out basic info about NVIDIA compiler. In my case: Cuda compilation tools, release 12.8, V12.8.9
# set up env variables appropriately
echo "export LD_LIBRARY_PATH=/usr/local/lib:/usr/local/lib/openmpi:\$LD_LIBRARY_PATH" >> ~/.bashrc
source ~/.bashrc
# just in case, reboot
# check the paths are ok:
echo $LD_LIBRARY_PATH
# output for me: /usr/local/lib:/usr/local/lib/openmpi:/usr/local/cuda/lib64
# rebuild from scratch
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz
tar xf openmpi-5.0.3.tar.gz
cd openmpi-5.0.3
mkdir build
cd build
../configure --with-cuda=/usr/local/cuda --with-cuda-libdir=/usr/local/cuda/lib64/stubs | tee config.out
make -j$(nproc) all | tee make.out
sudo make install | tee install.out
# just in case, reboot (again)
vim mpi_check.c
# paste the contents of `mpi_check.c` from above
mpicc mpi_check.c -o mpi_check
mpirun -n 1 ./mpi_check # passes compile time check, fails run time check
./mpi_check # passes both checks I've also tried this with OpenMPI v5.0.7, v5.0.6, with and without an install prefix, with and without Maybe I'm not removing old installs properly? If anyone has any other ideas / maybe a Docker container with a working installation, I would really appreciate it :) |
For anyone stumbling across this with the same issue, I've made a Dockerfile which appears to do the trick. It does mean that you will have to run your code in a container (requiring Nvidia Container Toolkit to be able to attach gpus etc), but at least it builds CUDA-aware MPI correctly. FROM nvidia/cuda:12.8.1-devel-ubuntu22.04
# Prevent interactive prompts during package installation
ENV DEBIAN_FRONTEND=noninteractive
ENV PATH=${PATH}:/usr/local/cuda/bin
# No clue why you need to install CUDA when you're already using a CUDA Docker,
# but otherwise you don't have `libcuda.so.1`.
RUN apt-get update && apt-get install -y \
build-essential \
wget \
git \
python3 \
python3-pip \
pkg-config \
libevent-dev \
file \
cuda \
&& rm -rf /var/lib/apt/lists/*
# Install OpenMPI with CUDA support
RUN wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.3.tar.gz && \
tar xzf openmpi-5.0.3.tar.gz && \
cd openmpi-5.0.3 && \
./configure --prefix=/usr/local \
--with-cuda=/usr/local/cuda \
--with-cuda-libdir=/usr/local/cuda/lib64/stubs && \
make -j$(nproc) && \
make install && \
ldconfig && \
cd .. && \
rm -rf openmpi-5.0.3 openmpi-5.0.3.tar.gz
# Set up environment variables for OpenMPI
ENV PATH=/usr/local/bin:$PATH
ENV LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
ENV OMPI_DIR=/usr/local
ENV OMPI_VERSION=5.0.3
ENV OMPI_ALLOW_RUN_AS_ROOT=1
ENV OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1
# Create a test directory
WORKDIR /mpi_check
COPY mpi_check.c .
# Compile the test program
RUN mpicc mpi_check.c -o mpi_check
# Add a simple test to verify the installation
RUN which mpirun && \
mpirun --version && \
ldd $(which mpirun) && \
file $(which mpirun)
# Set the entrypoint
ENTRYPOINT ["/bin/bash"] (Edit: This uses Ubuntu 22.04 + MPI v5.0.3, but I've since tried this with Ubuntu 24.04 + MPI v5.0.7 and it also works fine) Installation steps:
Result:
I have no idea why this works, but it does. I hope this helps someone else. Ubuntu 24.04 build really ought to be fixed, but in the meantime, use this workaround. |
@niklebedenko once you have manually built and installed the
At first glance it does not make sense why the Also, since we are all running out of ideas, can you please confirm
works as expected (e.g. same output thant when run without |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.7
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Obtained from https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.gz
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
Details of the problem
I'm really struggling to run CUDA-aware MPI on just one node. I want to do this so that I can test my code locally before deploying to a cluster. I've reproduced this on a fresh install of Ubuntu 24.04 on two different machines.
Here's my install steps:
Now, I build a very simple test program:
This was built with:
Then, we get this output:
However, if I just run
./mpi_check
, i.e. nompirun
, I get this output:There's no other MPI installations, this was reproduced on two independent machines.
Perhaps I'm missing a step, or missing some configuration, but I've tried lots of variations of each of the above commands to no avail, and (I think?) I've followed the install instructions in the documentation correctly. So I believe it is a bug.
If I'm missing something, please let me know. Also please let me know if you'd like the
config.out
andmake.out
log files.The text was updated successfully, but these errors were encountered: