CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

jmerizia · 2022-01-07T20:10:06Z

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2, v4.1.1, v4.1.0, and v4.0.7 tested

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

tarball

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

n/a

Please describe the system on which you are running

Operating system/version: Ubuntu 18.04
Computer hardware: AWS p2.xlarge (Nvidia K80 GPU)
Network type: n/a (single node)

Details of the problem

When calling either Ireduce or Iallreduce on PyTorch GPU tensors, a segfault occurs. I haven't exhaustively tested all of the ops, but I don't have problems with Reduce, Allreduce, Isend / Irecv, and Ibcast when tested the same way. I haven't tested CuPy tensors, but it might be worthwhile (numba GPU tensors are affected also). This behavior was discovered by leofang mpi4py/mpi4py#164 (comment) while testing mpi4py.

Here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:

# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys

if len(sys.argv) < 3:
    print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
    sys.exit(1)

use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
    device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
    device = torch.device('cpu')

def test_Iallreduce():
    sendbuf = torch.ones(1, device=device)
    recvbuf = torch.empty_like(sendbuf)
    torch.cuda.synchronize()
    req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    assert recvbuf[0] == size

def test_Ireduce():
    buf = torch.ones(1, device=device)
    if rank == 0:
        sendbuf = MPI.IN_PLACE
        recvbuf = buf
    else:
        sendbuf = buf
        recvbuf = None
    torch.cuda.synchronize()
    req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    if rank == 0:
        assert buf[0] == size

eval('test_' + func_name + '()')

Software/Hardware Versions:

OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/ --with-cuda flag)
mpi4py 3.1.3 (built against above MPI version)
CUDA 11.0
Python 3.6 (also tested under 3.8)
Nvidia K80 GPU (also tested with V100)
OS Ubuntu 18.04 (also tested in containerized environment)
torch 1.10.1 (w/ GPU support)

You can reproduce my environment setup with the following commands:

wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy

Here is the error message from running Ireduce:

[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I appreciate any guidance!

The text was updated successfully, but these errors were encountered:

jsquyres · 2022-01-10T15:15:12Z

@open-mpi/cuda Can someone look at this issue? Thanks.

leofang · 2022-01-26T22:18:40Z

@Akshay-Venkatesh @bureddy mentioned offline that cuda collective component doesn't provide an implementation for Ireduce/Iallreduce, nor does hcoll.

jsquyres · 2022-01-27T13:45:07Z

Does this same problem occur with the equivalent test program written in C?

jorab · 2022-03-11T11:36:17Z

Well, I don't know about equivalent and for C, but it does for the (presumably) analogous program in C++. The OSU micro benchmark suite tests for Ireduce/Iallreduce (osu_ireduce and osu_iallreduce) breaks for me in a very similar way using both OpenMPI 4.0.7 and 4.1.2 on RHEL 8 compiled with system GCC. All other collective components in the OSU suite works when run on an NVIDIA SuperPOD GPUs with the CUDA collective component. OpenMPI was compiled with switch '--with-cuda=/path/to/CUDA' and gave this relevant snippet of output when segfaulting (example with osu_iallreduce only, OMPI_MCA_coll_base_verbose=80 is set):

$ srun --mpi=pmix --gpu-bind map_gpu:0,1,2,3,4,5,6,7 -m cyclic:block --cpu-bind v,map_cpu:$(seq -s, 0 16 127) $OSU_ROOT/get_local_rank $OSU_ROOT/mpi/collective/osu_iallreduce -i 100 -d cuda -c
... < snip > ...
[node052:2141894] coll:base:comm_select: component not available: sm
[node052:2141894] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component not available: sync
[node052:2141894] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component not available: tuned
[node052:2141894] coll:base:comm_select: component disqualified: tuned (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component available: cuda, priority: 78
[node052:2141894] coll:base:comm_select: selecting       basic, priority  10, Enabled
[node052:2141894] coll:base:comm_select: selecting      libnbc, priority  10, Enabled
[node052:2141894] coll:base:comm_select: selecting        self, priority  75, Enabled
[node052:2141894] coll:base:comm_select: selecting        cuda, priority  78, Enabled
[node052:2141897:0:2141897] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141896:0:2141896] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141895:0:2141895] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141894:0:2141894] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141892:0:2141892] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141891:0:2141891] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141893:0:2141893] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141890:0:2141890] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
==== backtrace (tid:2141897) ====
 0 0x0000000000012c20 .annobin_sigaction.c()  sigaction.c:0
 1 0x00000000000dc7a5 ompi_op_base_2buff_sum_float()  op_base_functions.c:0
 2 0x0000000000006bcc NBC_Start_round()  nbc.c:0
 3 0x0000000000006423 NBC_Progress()  ???:0
 4 0x0000000000004ef1 ompi_coll_libnbc_progress()  ???:0
 5 0x0000000000033ddc opal_progress()  ???:0
 6 0x0000000000051c5d ompi_request_default_wait()  ???:0
 7 0x0000000000096052 MPI_Wait()  ???:0
 8 0x00000000004026ba main()  /proj/nsc/users/raber/MPI/OSU/osu-micro-benchmarks-5.9/mpi/collective/osu_iallreduce.c:126
 9 0x0000000000023493 __libc_start_main()  ???:0
10 0x0000000000402dee _start()  ???:0
=================================
[node052:2141897] *** Process received signal ***
[node052:2141897] Signal: Segmentation fault (11)
[node052:2141897] Signal code:  (-6)
[node052:2141897] Failing at address: 0x3f50020aec9
[node052:2141897] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x1555454d7c20]
[node052:2141897] [ 1] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(+0xdc7a5)[0x1555460f07a5]
[node052:2141897] [ 2] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(+0x6bcc)[0x1554f5fe0bcc]
[node052:2141897] [ 3] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1d3)[0x1554f5fe0423]
[node052:2141897] [ 4] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x91)[0x1554f5fdeef1]
[node052:2141897] [ 5] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libopen-pal.so.40(opal_progress+0x2c)[0x155544764ddc]
[node052:2141897] [ 6] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(ompi_request_default_wait+0x3d)[0x155546065c5d]
[node052:2141897] [ 7] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(PMPI_Wait+0x52)[0x1555460aa052]
[node052:2141897] [ 8] /software/sse/manual/OSU/5.9/g83/ompi407_ucx112mt_cu11.4/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x4026ba]
[node052:2141897] [ 9] /lib64/libc.so.6(__libc_start_main+0xf3)[0x155545123493]
[node052:2141897] [10] /software/sse/manual/OSU/5.9/g83/ompi407_ucx112mt_cu11.4/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x402dee]
[node052:2141897] *** End of error message ***
... < end snip > ...

jsquyres · 2022-03-11T17:46:31Z

@Akshay-Venkatesh Can you help?

Akshay-Venkatesh · 2022-03-12T00:45:17Z

@Akshay-Venkatesh @bureddy mentioned offline that cuda collective component doesn't provide an implementation for Ireduce/Iallreduce, nor does hcoll.

@jorab @jsquyres as @leofang mentioned, running osu_iallreduce or any non-blocking MPI collective operation that involves a reduction is not supported over cuda buffers.

leofang · 2022-03-12T01:52:01Z

That said, segfault is not acceptable... Couldn't we return an error code to indicate "not supported"?

jsquyres · 2022-03-12T13:23:40Z

That said, segfault is not acceptable... Couldn't we return an error code to indicate "not supported"?

Agreed. @Akshay-Venkatesh Can we do better?

Akshay-Venkatesh · 2022-03-15T00:27:50Z

Will discuss internally and get back early this week. Hope that works.

Akshay-Venkatesh · 2022-03-18T00:26:02Z

@jsquyres When are the next 4x/5x releases planned for? I don't think targeting for 4.1.4 or 5.0.0 is realistic but we may have resources for beyond that point. If we need better handling for the current problem (i.e reporting not supported instead of segfault), we would need to add cuda detection in nbc components that get picked up to run the collective. This also seems non-trivial work and must be aimed for post 4.1.3/5.0.0 time frame.

jsquyres · 2022-03-18T02:34:38Z

v4.1.3 is quite possibly going to be late next week. There's no date for v4.1.4 yet.

I don't recall the exact timeline for v5.0.0, but it's (currently) within the next few months.

You might want to have some discussions with other Open MPI community members before sprinkling more CUDA code throughout the Open MPI code base (e.g., are you going to need to edit all the NBC collectives? What about the blocking collectives?). There may be some architectural issues at stake here; better to get buy-in before you invest a lot of time/effort.

leofang mentioned this issue Jan 7, 2022

CUDA-aware Ireduce and Iallreduce operations for PyTorch GPU tensors segfault mpi4py/mpi4py#164

Closed

jsquyres added question Target: v4.1.x labels Jan 10, 2022

Keluaa mentioned this issue Jun 24, 2024

Add IReduce! and IAllreduce! JuliaParallel/MPI.jl#827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

jmerizia commented Jan 7, 2022 •

edited

Loading

jsquyres commented Jan 10, 2022

leofang commented Jan 26, 2022

jsquyres commented Jan 27, 2022

jorab commented Mar 11, 2022

jsquyres commented Mar 11, 2022

Akshay-Venkatesh commented Mar 12, 2022

leofang commented Mar 12, 2022

jsquyres commented Mar 12, 2022

Akshay-Venkatesh commented Mar 15, 2022

Akshay-Venkatesh commented Mar 18, 2022

jsquyres commented Mar 18, 2022

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

Comments

jmerizia commented Jan 7, 2022 • edited Loading

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

jsquyres commented Jan 10, 2022

leofang commented Jan 26, 2022

jsquyres commented Jan 27, 2022

jorab commented Mar 11, 2022

jsquyres commented Mar 11, 2022

Akshay-Venkatesh commented Mar 12, 2022

leofang commented Mar 12, 2022

jsquyres commented Mar 12, 2022

Akshay-Venkatesh commented Mar 15, 2022

Akshay-Venkatesh commented Mar 18, 2022

jsquyres commented Mar 18, 2022

jmerizia commented Jan 7, 2022 •

edited

Loading

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.