Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA-aware Ireduce and Iallreduce operations for GPU tensors segfault #9845

Open
jmerizia opened this issue Jan 7, 2022 · 11 comments
Open

Comments

@jmerizia
Copy link

jmerizia commented Jan 7, 2022

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.2, v4.1.1, v4.1.0, and v4.0.7 tested

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

tarball

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

n/a

Please describe the system on which you are running

  • Operating system/version: Ubuntu 18.04
  • Computer hardware: AWS p2.xlarge (Nvidia K80 GPU)
  • Network type: n/a (single node)

Details of the problem

When calling either Ireduce or Iallreduce on PyTorch GPU tensors, a segfault occurs. I haven't exhaustively tested all of the ops, but I don't have problems with Reduce, Allreduce, Isend / Irecv, and Ibcast when tested the same way. I haven't tested CuPy tensors, but it might be worthwhile (numba GPU tensors are affected also). This behavior was discovered by leofang mpi4py/mpi4py#164 (comment) while testing mpi4py.

Here is a minimal script that can be used to demonstrate this behavior. The errors are only present when running on GPU:

# mpirun -np 2 python repro.py gpu Ireduce
from mpi4py import MPI
import torch
import sys

if len(sys.argv) < 3:
    print('Usage: python repro.py [cpu|gpu] [MPI function to test]')
    sys.exit(1)

use_gpu = sys.argv[1] == 'gpu'
func_name = sys.argv[2]

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
if use_gpu:
    device = torch.device('cuda:' + str(rank % torch.cuda.device_count()))
else:
    device = torch.device('cpu')

def test_Iallreduce():
    sendbuf = torch.ones(1, device=device)
    recvbuf = torch.empty_like(sendbuf)
    torch.cuda.synchronize()
    req = comm.Iallreduce(sendbuf, recvbuf, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    assert recvbuf[0] == size

def test_Ireduce():
    buf = torch.ones(1, device=device)
    if rank == 0:
        sendbuf = MPI.IN_PLACE
        recvbuf = buf
    else:
        sendbuf = buf
        recvbuf = None
    torch.cuda.synchronize()
    req = comm.Ireduce(sendbuf, recvbuf, root=0, op=MPI.SUM)  # also fails with MPI.MAX
    req.wait()
    if rank == 0:
        assert buf[0] == size

eval('test_' + func_name + '()')

Software/Hardware Versions:

  • OpenMPI 4.1.2, 4.1.1, 4.1.0, and 4.0.7 (built w/ --with-cuda flag)
  • mpi4py 3.1.3 (built against above MPI version)
  • CUDA 11.0
  • Python 3.6 (also tested under 3.8)
  • Nvidia K80 GPU (also tested with V100)
  • OS Ubuntu 18.04 (also tested in containerized environment)
  • torch 1.10.1 (w/ GPU support)

You can reproduce my environment setup with the following commands:

wget https://www.open-mpi.org//software/ompi/v3.0/downloads/openmpi-4.1.2.tar.gz
tar xvf openmpi-4.1.2.tar.gz
cd openmpi-4.1.2
./configure --with-cuda --prefix=/opt/openmpi-4.1.2
sudo make -j4 all install
export PATH=/opt/openmpi-4.1.2/bin:$PATH
export LD_LIBRARY_PATH=/opt/openmpi-4.1.2/lib:$LD_LIBRARY_PATH
env MPICC=/opt/openmpi-4.1.2/bin/mpicc pip install mpi4py
pip install torch numpy

Here is the error message from running Ireduce:

[<host>:25864] *** Process received signal ***
[<host>:25864] Signal: Segmentation fault (11)
[<host>:25864] Signal code: Invalid permissions (2)
[<host>:25864] Failing at address: 0x1201220000
[<host>:25864] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f00efcf3040]
[<host>:25864] [ 1] /opt/openmpi-4.1.2/lib/openmpi/mca_op_avx.so(+0xc079)[0x7f00e41c0079]
[<host>:25864] [ 2] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(+0x7385)[0x7f00d3330385]
[<host>:25864] [ 3] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1f3)[0x7f00d3330033]
[<host>:25864] [ 4] /opt/openmpi-4.1.2/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x8e)[0x7f00d332e84e]
[<host>:25864] [ 5] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f00edefba3c]
[<host>:25864] [ 6] /opt/openmpi-4.1.2/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xc5)[0x7f00edf025a5]
[<host>:25864] [ 7] /opt/openmpi-4.1.2/lib/libmpi.so.40(ompi_request_default_wait+0x1f9)[0x7f00ee4eafa9]
[<host>:25864] [ 8] /opt/openmpi-4.1.2/lib/libmpi.so.40(PMPI_Wait+0x52)[0x7f00ee532e02]
[<host>:25864] [ 9] /home/ubuntu/venv/lib/python3.6/site-packages/mpi4py/MPI.cpython-36m-x86_64-linux-gnu.so(+0xa81e2)[0x7f00ee8911e2]
[<host>:25864] [10] python[0x50a865]
[<host>:25864] [11] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [12] python[0x509989]
[<host>:25864] [13] python[0x50a6bd]
[<host>:25864] [14] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [15] python[0x507f94]
[<host>:25864] [16] python(PyRun_StringFlags+0xaf)[0x63500f]
[<host>:25864] [17] python[0x600911]
[<host>:25864] [18] python[0x50a4ef]
[<host>:25864] [19] python(_PyEval_EvalFrameDefault+0x444)[0x50c274]
[<host>:25864] [20] python[0x507f94]
[<host>:25864] [21] python(PyEval_EvalCode+0x23)[0x50b0d3]
[<host>:25864] [22] python[0x634dc2]
[<host>:25864] [23] python(PyRun_FileExFlags+0x97)[0x634e77]
[<host>:25864] [24] python(PyRun_SimpleFileExFlags+0x17f)[0x63862f]
[<host>:25864] [25] python(Py_Main+0x591)[0x6391d1]
[<host>:25864] [26] python(main+0xe0)[0x4b0d30]
[<host>:25864] [27] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f00efcd5bf7]
[<host>:25864] [28] python(_start+0x2a)[0x5b2a5a]
[<host>:25864] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node <host> exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I appreciate any guidance!

@jsquyres
Copy link
Member

@open-mpi/cuda Can someone look at this issue? Thanks.

@leofang
Copy link

leofang commented Jan 26, 2022

@Akshay-Venkatesh @bureddy mentioned offline that cuda collective component doesn't provide an implementation for Ireduce/Iallreduce, nor does hcoll.

@jsquyres
Copy link
Member

Does this same problem occur with the equivalent test program written in C?

@jorab
Copy link

jorab commented Mar 11, 2022

Well, I don't know about equivalent and for C, but it does for the (presumably) analogous program in C++. The OSU micro benchmark suite tests for Ireduce/Iallreduce (osu_ireduce and osu_iallreduce) breaks for me in a very similar way using both OpenMPI 4.0.7 and 4.1.2 on RHEL 8 compiled with system GCC. All other collective components in the OSU suite works when run on an NVIDIA SuperPOD GPUs with the CUDA collective component. OpenMPI was compiled with switch '--with-cuda=/path/to/CUDA' and gave this relevant snippet of output when segfaulting (example with osu_iallreduce only, OMPI_MCA_coll_base_verbose=80 is set):

$ srun --mpi=pmix --gpu-bind map_gpu:0,1,2,3,4,5,6,7 -m cyclic:block --cpu-bind v,map_cpu:$(seq -s, 0 16 127) $OSU_ROOT/get_local_rank $OSU_ROOT/mpi/collective/osu_iallreduce -i 100 -d cuda -c
... < snip > ...
[node052:2141894] coll:base:comm_select: component not available: sm
[node052:2141894] coll:base:comm_select: component disqualified: sm (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component not available: sync
[node052:2141894] coll:base:comm_select: component disqualified: sync (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component not available: tuned
[node052:2141894] coll:base:comm_select: component disqualified: tuned (priority -1 < 0)
[node052:2141894] coll:base:comm_select: component available: cuda, priority: 78
[node052:2141894] coll:base:comm_select: selecting       basic, priority  10, Enabled
[node052:2141894] coll:base:comm_select: selecting      libnbc, priority  10, Enabled
[node052:2141894] coll:base:comm_select: selecting        self, priority  75, Enabled
[node052:2141894] coll:base:comm_select: selecting        cuda, priority  78, Enabled
[node052:2141897:0:2141897] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141896:0:2141896] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141895:0:2141895] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141894:0:2141894] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141892:0:2141892] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141891:0:2141891] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141893:0:2141893] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
[node052:2141890:0:2141890] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x15551b200000)
==== backtrace (tid:2141897) ====
 0 0x0000000000012c20 .annobin_sigaction.c()  sigaction.c:0
 1 0x00000000000dc7a5 ompi_op_base_2buff_sum_float()  op_base_functions.c:0
 2 0x0000000000006bcc NBC_Start_round()  nbc.c:0
 3 0x0000000000006423 NBC_Progress()  ???:0
 4 0x0000000000004ef1 ompi_coll_libnbc_progress()  ???:0
 5 0x0000000000033ddc opal_progress()  ???:0
 6 0x0000000000051c5d ompi_request_default_wait()  ???:0
 7 0x0000000000096052 MPI_Wait()  ???:0
 8 0x00000000004026ba main()  /proj/nsc/users/raber/MPI/OSU/osu-micro-benchmarks-5.9/mpi/collective/osu_iallreduce.c:126
 9 0x0000000000023493 __libc_start_main()  ???:0
10 0x0000000000402dee _start()  ???:0
=================================
[node052:2141897] *** Process received signal ***
[node052:2141897] Signal: Segmentation fault (11)
[node052:2141897] Signal code:  (-6)
[node052:2141897] Failing at address: 0x3f50020aec9
[node052:2141897] [ 0] /lib64/libpthread.so.0(+0x12c20)[0x1555454d7c20]
[node052:2141897] [ 1] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(+0xdc7a5)[0x1555460f07a5]
[node052:2141897] [ 2] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(+0x6bcc)[0x1554f5fe0bcc]
[node052:2141897] [ 3] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(NBC_Progress+0x1d3)[0x1554f5fe0423]
[node052:2141897] [ 4] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_progress+0x91)[0x1554f5fdeef1]
[node052:2141897] [ 5] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libopen-pal.so.40(opal_progress+0x2c)[0x155544764ddc]
[node052:2141897] [ 6] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(ompi_request_default_wait+0x3d)[0x155546065c5d]
[node052:2141897] [ 7] /software/sse/manual/OpenMPI/4.0.7/g83/ofed54/ucx1.12.0-mt/cu11.4/nsc1/lib/libmpi.so.40(PMPI_Wait+0x52)[0x1555460aa052]
[node052:2141897] [ 8] /software/sse/manual/OSU/5.9/g83/ompi407_ucx112mt_cu11.4/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x4026ba]
[node052:2141897] [ 9] /lib64/libc.so.6(__libc_start_main+0xf3)[0x155545123493]
[node052:2141897] [10] /software/sse/manual/OSU/5.9/g83/ompi407_ucx112mt_cu11.4/libexec/osu-micro-benchmarks/mpi/collective/osu_iallreduce[0x402dee]
[node052:2141897] *** End of error message ***
... < end snip > ...

@jsquyres
Copy link
Member

@Akshay-Venkatesh Can you help?

@Akshay-Venkatesh
Copy link
Contributor

@Akshay-Venkatesh @bureddy mentioned offline that cuda collective component doesn't provide an implementation for Ireduce/Iallreduce, nor does hcoll.

@jorab @jsquyres as @leofang mentioned, running osu_iallreduce or any non-blocking MPI collective operation that involves a reduction is not supported over cuda buffers.

@leofang
Copy link

leofang commented Mar 12, 2022

That said, segfault is not acceptable... Couldn't we return an error code to indicate "not supported"?

@jsquyres
Copy link
Member

That said, segfault is not acceptable... Couldn't we return an error code to indicate "not supported"?

Agreed. @Akshay-Venkatesh Can we do better?

@Akshay-Venkatesh
Copy link
Contributor

Will discuss internally and get back early this week. Hope that works.

@Akshay-Venkatesh
Copy link
Contributor

@jsquyres When are the next 4x/5x releases planned for? I don't think targeting for 4.1.4 or 5.0.0 is realistic but we may have resources for beyond that point. If we need better handling for the current problem (i.e reporting not supported instead of segfault), we would need to add cuda detection in nbc components that get picked up to run the collective. This also seems non-trivial work and must be aimed for post 4.1.3/5.0.0 time frame.

@jsquyres
Copy link
Member

v4.1.3 is quite possibly going to be late next week. There's no date for v4.1.4 yet.

I don't recall the exact timeline for v5.0.0, but it's (currently) within the next few months.

You might want to have some discussions with other Open MPI community members before sprinkling more CUDA code throughout the Open MPI code base (e.g., are you going to need to edit all the NBC collectives? What about the blocking collectives?). There may be some architectural issues at stake here; better to get buy-in before you invest a lot of time/effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants