[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

BenWibking · 2023-02-09T17:08:47Z

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0rc10 and main (commit dad058e)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From tarball and from git clone

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

$ git submodule status
 415d7044c478b0910c9fbb0f36af700b9483c493 3rd-party/openpmix (v1.1.3-3769-g415d7044)
 dc6ccf65b3356ae7c70bc3a37b4249f03d43966e 3rd-party/prrte (psrvr-v2.0.0rc1-4569-gdc6ccf65b3)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

Please describe the system on which you are running

Operating system/version: RHEL 8.4
Computer hardware: 4 NVIDIA A100 GPUs (40 GB HBM2 RAM each) connected via NVLINK and 1 64-core AMD EPYC 7763 ("Milan")
Network type: Slingshot-10

Details of the problem

I have built against UCX 1.14rc2 with CUDA support, which works correctly with OpenMPI 4.1.4.

However, running the osu_bw benchmark with device buffers (osu_bw D D) with either 5.0.0rc10 or main immediately causes a segmentation fault with the following output:


[gpua015.delta.internal.ncsa.edu:967381] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_69033_1389475.0/shared_mem_cuda_pool.gpua015 could be created.
[gpua015.delta.internal.ncsa.edu:967381] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
# OSU MPI-CUDA Bandwidth Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[gpua015:967381:0:967381] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f2d13200000)
==== backtrace (tid: 967381) ====
 0  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f2d4328cf34]
 1  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(+0x2f0f7) [0x7f2d4328d0f7]
 2  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(+0x2f3c6) [0x7f2d4328d3c6]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f2d446f2b20]
 4  /lib64/libc.so.6(+0x16065c) [0x7f2d4447b65c]
 5  /projects/cvz/bwibking/ompi5_main_20230209/lib/libopen-pal.so.0(mca_btl_sm_sendi+0x33d) [0x7f2d43a952dd]
 6  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(+0x22d900) [0x7f2d454de900]
 7  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(mca_pml_ob1_isend+0x419) [0x7f2d454df869]
 8  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(MPI_Isend+0x134) [0x7f2d45379454]
 9  omb-gpu/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x402ce1]
10  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f2d4433e493]
11  omb-gpu/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x4037be]
=================================
srun: error: gpua015: task 0: Segmentation fault (core dumped)

The text was updated successfully, but these errors were encountered:

amirsojoodi · 2023-02-14T17:23:49Z

I had the same problem on a very similar configuration (no slingshot). See here
In summary, the problem was UCX's newer versions. I downgraded UCX to 1.12.1, and it helped with the situation. You can test it quickly by changing the pml to ob1 instead of UCX. --mca pml ob1

BenWibking · 2023-02-15T17:25:03Z

I rebuilt from main (commit ff1f1b7) against UCX 1.12.1 and it works with osu_bw D D, but fails inside MPI_Testall with my application code (which works with OpenMPI 4.1.4, MPICH, and Cray MPI on multiple systems):

[gpua093:1670235:0:1670235] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fb576e3b900)
[gpua093:1670236:0:1670236] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc47e80b900)
==== backtrace (tid:1670236) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fcc359f0d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7fcc359f0f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7fcc359f11fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7fcc3716eb20]
==== backtrace (tid:1670237) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f7c2c176d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f7c2c176f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f7c2c1771fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f7c2d8f4b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f7c2cac8805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f7c2c4742af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f7c33d3b11d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f7c33d32dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f7c2c475327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f7c2c4755ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f7c2c3cf354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f7c33b9c24f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f7c33be54f6]

I don't immediately have a minimal reproducer. Unfortunately, It looks like MPI_Testall is not used in any of the OSU benchmarks. My code has a typical MPI_Irecv, MPI_Isend, MPI_Waitall pattern.

BenWibking · 2023-02-15T19:24:47Z

Ok, I have a minimal-ish reproducer using the FillBoundaryComparison test from AMReX: https://github.com/AMReX-Codes/amrex/tree/development/Tests/FillBoundaryComparison.

From this directory, you can simply make USE_CUDA=TRUE CUDA_ARCH=80.

Then run:

srun ./main3d.gnu.MPI.CUDA.ex amrex.throw_exception=1 amrex.signal_handling=0 amrex.the_arena_is_managed=0 amrex.use_gpu_aware_mpi=1

I get:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 4 devices.
AMReX (23.02-37-gcbe2b291d3ed-dirty) initialized
min length = 32
num Pts    = 841482240
num boxes  = 25680
num levels = 5
841482240 points on level 0
105185280 points on level 1
13148160 points on level 2
1643520 points on level 3
205440 points on level 4
[gpua037:1094940:0:1094940] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f8ed195efe8)
==== backtrace (tid:1094940) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f95f9df2d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f95f9df2f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f95f9df31fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f95fb570b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f95fa744805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f95fa0f02af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f95fd70c11d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f95fd703dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f95fa0f1327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f95fa0f15ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f95fa04b354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f95fd56d24f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f95fd5b64f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f95fa607493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094939:0:1094939] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7ac59b6640)
==== backtrace (tid:1094939) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f81ec57dd14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f81ec57df27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f81ec57e1fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f81edcfbb20]
 4  /lib64/libc.so.6(+0x160805) [0x7f81ececf805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f81ec87b2af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f81efe9711d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f81efe8edc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f81ec87c327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f81ec87c5ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f81ec7d6354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f81efcf824f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f81efd414f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f81ecd92493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094938:0:1094938] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f9081f71ef0)
==== backtrace (tid:1094938) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f97a8747d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f97a8747f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f97a87481fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f97a9ec5b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f97a9099805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f97a8a452af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f97ac06111d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f97ac058dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f97a8a46327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f97a8a465ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f97a89a0354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f97abec224f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f97abf0b4f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f97a8f5c493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094937:0:1094937] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc32be1e920)
==== backtrace (tid:1094937) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fca50fecd14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7fca50fecf27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7fca50fed1fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7fca5276ab20]
 4  /lib64/libc.so.6(+0x160805) [0x7fca5193e805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7fca512ea2af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7fca5490611d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7fca548fddc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7fca512eb327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7fca512eb5ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7fca51245354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7fca5476724f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7fca547b04f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fca51801493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
srun: error: gpua037: task 2: Segmentation fault (core dumped)
srun: error: gpua037: task 3: Segmentation fault (core dumped)
srun: error: gpua037: task 1: Segmentation fault (core dumped)
srun: error: gpua037: task 0: Segmentation fault (core dumped)

BenWibking · 2023-02-17T18:58:27Z

I can also reproduce this issue when built with OFI rather than UCX (and UCX explicitly disabled):

[gpua025:322643] *** Process received signal ***
[gpua025:322643] Signal: Segmentation fault (11)
[gpua025:322643] Signal code: Invalid permissions (2)
[gpua025:322643] Failing at address: 0x7f04a995d450
[gpua025:322640] *** Process received signal ***
[gpua025:322640] Signal: Segmentation fault (11)
[gpua025:322640] Signal code: Invalid permissions (2)
[gpua025:322640] Failing at address: 0x7f677b91da80
[gpua025:322641] *** Process received signal ***
[gpua025:322641] Signal: Segmentation fault (11)
[gpua025:322641] Signal code: Invalid permissions (2)
[gpua025:322641] Failing at address: 0x7f9839f71ef0
[gpua025:322642] *** Process received signal ***
[gpua025:322642] Signal: Segmentation fault (11)
[gpua025:322642] Signal code: Invalid permissions (2)
[gpua025:322642] Failing at address: 0x7fa6f5db6ce8
[gpua025:322643] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f0bd0794b20]
[gpua025:322640] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f6ea2feab20]
[gpua025:322641] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f9f602d5b20]
[gpua025:322642] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fae1e854b20]
[gpua025:322643] [ 1] /lib64/libc.so.6(+0x160805)[0x7f0bcf968805]
[gpua025:322640] [ 1] /lib64/libc.so.6(+0x160805)[0x7f6ea21be805]
[gpua025:322640] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f6ea1b663df]
[gpua025:322641] [ 1] /lib64/libc.so.6(+0x160805)[0x7f9f5f4a9805]
[gpua025:322642] [ 1] /lib64/libc.so.6(+0x160805)[0x7fae1da28805]
[gpua025:322643] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f0bcf3103df]
[gpua025:322641] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f9f5ee513df]
[gpua025:322642] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7fae1d3d03df]
[gpua025:322640] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f6ea4a94efd]
[gpua025:322643] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f0bd223eefd]
[gpua025:322641] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f9f61d7fefd]
[gpua025:322642] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7fae202feefd]
[gpua025:322640] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f6ea4a8cba1]
[gpua025:322643] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f0bd2236ba1]
[gpua025:322643] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f0bcf311457]
[gpua025:322640] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f6ea1b67457]
[gpua025:322640] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f6ea1b6771c]
[gpua025:322641] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f9f61d77ba1]
[gpua025:322642] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7fae202f6ba1]
[gpua025:322642] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7fae1d3d1457]
[gpua025:322643] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f0bcf31171c]
[gpua025:322643] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f0bcf2766a4]
[gpua025:322640] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f6ea1acc6a4]
[gpua025:322641] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f9f5ee52457]
[gpua025:322641] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f9f5ee5271c]
[gpua025:322642] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7fae1d3d171c]
[gpua025:322642] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7fae1d3366a4]
[gpua025:322640] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f6ea48eb79f]
[gpua025:322641] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f9f5edb76a4]
[gpua025:322643] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f0bd209579f]
[gpua025:322642] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7fae2015579f]
[gpua025:322641] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f9f61bd679f]
[gpua025:322640] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f6ea49349e6]
[gpua025:322640] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322643] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f0bd20de9e6]
[gpua025:322643] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322640] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322640] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322640] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f6ea2081493]
[gpua025:322640] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322640] *** End of error message ***
[gpua025:322641] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f9f61c1f9e6]
[gpua025:322641] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322641] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322642] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7fae2019e9e6]
[gpua025:322642] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322642] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322642] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322643] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322643] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322643] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f0bcf82b493]
[gpua025:322643] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322643] *** End of error message ***
[gpua025:322641] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322641] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f9f5f36c493]
[gpua025:322641] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322641] *** End of error message ***
[gpua025:322642] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fae1d8eb493]
[gpua025:322642] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322642] *** End of error message ***

Is something going wrong inside the progress engine for messages that use CUDA device buffers?

edgargabriel · 2023-02-17T19:55:50Z

How are you forcing using the ucx pml in you environment ? Do you set an environment variable or similar? I see many output lines from ob1 and sm btl, which should not really be involved in a code that uses GPU buffers.

I usually do something like

mpirun --mca pml ucx --mca osc ucx -np x ...

for GPU code

BenWibking · 2023-02-17T20:08:24Z

I am not forcing it to use UCX at all. I assume it will use it automatically if needed. It crashes in the same way when I do not build with UCX support.

I've run addr2line to get the line numbers where it crashes, and for some reason it is using the sm btl to copy the buffer and it is accessing a device pointer there:

 4: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xd6cf7) [0x7fe8f3f94cf7]
    sm_prepare_src
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_module.c:491:39

 5: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x30a157) [0x7fe8f6fc7157]
    mca_bml_base_prepare_src
../../../../ompi/mca/bml/bml.h:339:12

 6: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x2d8) [0x7fe8f6fca561]
    mca_pml_ob1_send_request_schedule_once
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:1179:9

 7: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x2ff1e8) [0x7fe8f6fbc1e8]
    mca_pml_ob1_send_request_schedule_exclusive
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.h:327:14

 8: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x2ff249) [0x7fe8f6fbc249]
    mca_pml_ob1_send_request_schedule
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.h:351:5

 9: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x340) [0x7fe8f6fbe345]
    mca_pml_ob1_recv_frag_callback_ack
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:772:9

10: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x1a8) [0x7fe8f3f98746]
    mca_btl_sm_poll_handle_frag
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:455:9

11: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xda82a) [0x7fe8f3f9882a]
    mca_btl_sm_poll_fifo
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:471:47

12: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xdaae9) [0x7fe8f3f98ae9]
    mca_btl_sm_component_progress
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:563:11

13: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x30) [0x7fe8f3ee54c6]
    opal_progress
runtime/opal_progress.c:224:16

14: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x14c) [0x7fe8f6d61747]
    ompi_request_default_test_all
request/req_test.c:213:16

15: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0x21f) [0x7fe8f6dfae8d]
    PMPI_Testall
/projects/cvz/bwibking/mpi_build/ompi/ompi/mpi/c/testall.c:98:8

BenWibking · 2023-02-17T20:11:36Z

How are you forcing using the ucx pml in you environment ? Do you set an environment variable or similar? I see many output lines from ob1 and sm btl, which should not really be involved in a code that uses GPU buffers.

I usually do something like
mpirun --mca pml ucx --mca osc ucx -np x ...
for GPU code

The cluster I'm running on uses SLURM, so I have to use srun to launch things properly. Can I force using the UCX pml via an environment variable?

edgargabriel · 2023-02-17T20:17:15Z

I would highly recommend that you force using ucx. I use ucx 1.13.x with GPU literally on a daily basis, but you have to tell Open MPI that it should use ucx. Either using the command line that I showed, or setting the corresponding environment variables.

export OMPI_MCA_pml=ucx
export OMPI_MCA_osc=ucx

If you see ob1 or sm btl error messages, you know it did not use the correct components, they should not be used with GPU code.

BenWibking · 2023-02-17T20:22:17Z

According to the OpenMPI FAQ, "the Open MPI library will automatically detect that the pointer being passed in is a CUDA device memory pointer and do the right thing" and "CUDA-aware support is available in the sm, smcuda, tcp, and openib BTLs" (https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-support).

Is this not the case anymore? If not, I hope the documentation will be fixed to reflect this.

edgargabriel · 2023-02-17T20:30:54Z

I think for 5.0 you need to look at this documentation:
https://docs.open-mpi.org/en/v5.0.x/

I let somebody else answer your last question in details, since I personally am only using Open MPI with UCX for GPU code. The openib component is definitely gone in the 5.0 release.

The one additional remark that I have is to make sure that your UCX library has also been compiled with support for the GPUs that you want to use.

BenWibking · 2023-02-17T20:45:33Z

@edgargabriel Thanks for your suggestions. I tried forcing the ucx pml and it works!

However, I still get a segmentation fault if I force the ob1 pml (or let it be selected by default), which should still work with GPU buffers according to the new 5.0 documentation ("CUDA-aware support is available in... Both CUDA-ized shared memory (smcuda) and TCP (tcp) BTLs with the OB1 (ob1) PML" https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi).

wckzhang · 2023-02-21T17:24:36Z

No supposedly the ob1 pml should work with cuda buffers. I've tested the path myself with both ob1 and mtl ofi outside of ucx and it should work.

There should be copying mechanisms in place to handle this

BenWibking · 2023-02-21T18:57:08Z

No supposedly the ob1 pml should work with cuda buffers. I've tested the path myself with both ob1 and mtl ofi outside of ucx and it should work.

There should be copying mechanisms in place to handle this

Are you able to reproduce this with the reproducer I linked? I realize it's a heavy lift to build, but I haven't been able to reproduce this with the OSU benchmarks. Presumably there is some kind of race condition or issue with handling in-flight messages in the progress engine.

wckzhang · 2023-02-21T19:18:56Z

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

wckzhang · 2023-02-21T19:23:03Z

Also do you have a coredump at hand?

BenWibking · 2023-02-21T19:42:41Z

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

No, I haven't tried building with libfabric. I can try this, though. Aside from building with OFI, what env vars or flags do I need to set for both of these?

BenWibking · 2023-02-21T19:43:52Z

Also do you have a coredump at hand?

No coredump, only the backtrace that I copied above. It tries to dereference a device pointer in opal/mca/btl/sm/btl_sm_module.c:491:39. I can try to get it to save a backtrace on the cluster I'm using if you want to look at it directly.

BenWibking · 2023-02-27T18:01:57Z

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

No, I haven't tried building with libfabric. I can try this, though. Aside from building with OFI, what env vars or flags do I need to set for both of these?

@wckzhang I've re-built with libfabric and set OMPI_MCA_pml=cm and OMPI_MCA_mtl=ofi. In this case, the job is killed immediately without any output:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1507916.0 ON gpua079 CANCELLED AT 2023-02-27T11:55:50 ***
srun: error: gpua079: tasks 0-2: Killed
srun: error: gpua079: task 3: Exited with exit code 1

With OMPI_MCA_pml=ob1, the segfault and backtrace are the same as before. With OMPI_MCA_pml=ucx, everything works fine.

(I've also set ulimit -c unlimited, but I don't get a core dump in either run. I'll follow up with the sysadmins to see what is going on.)

BenWibking · 2023-02-27T19:17:11Z

I've re-built libfabric 1.17.0 from source using --with-cuda and I now get MPI_ERR_OTHER when using GPU device buffers in combination with pml/cm and mtl/ofi (it works fine with host buffers):

amrex::Error::2::AMReX MPI Error: File ../../Src/Base/AMReX_ParallelDescriptor.cpp, line 1694, MPI_Irecv(buf, n, Mpi_typemap<char>::type(), pid, tag, comm, &req): MPI_ERR_OTHER: known error not in list !!!

BenWibking · 2023-02-28T17:30:17Z

@wckzhang I've uploaded the core dumps for both ob1 and cm+ofi crashes here: https://cloudstor.aarnet.edu.au/plus/s/Qsc6m9liCJXpi9P

janjust · 2023-03-30T15:23:46Z

@wckzhang is there any update on this issue, sounds like a blocker consider @devreal also encountered it.

devreal · 2023-03-30T16:01:37Z

I will post a patch soon that modifies the btls

janjust · 2023-04-06T14:50:29Z

fixed with #11564

jsquyres added the Target: v5.0.x label Feb 14, 2023

jsquyres modified the milestones: v5.1.0, v5.0.0 Feb 14, 2023

BenWibking changed the title ~~[5.0.0rc10/main] CUDA-aware MPI is broken~~ [5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 pml Feb 17, 2023

BenWibking changed the title ~~[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 pml~~ [5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML Feb 17, 2023

BenWibking mentioned this issue Mar 29, 2023

Segfault in btl/sm on device memory #11542

Closed

devreal mentioned this issue Mar 30, 2023

Have btls check for whether convertor involves device memory #11549

Merged

devreal mentioned this issue Apr 5, 2023

Have btls check for whether convertor involves device memory [v5.0.x] #11564

Merged

janjust closed this as completed Apr 6, 2023

BenWibking mentioned this issue Jul 23, 2023

Issues with CUDA accelerator component initialization #11831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

BenWibking commented Feb 9, 2023

amirsojoodi commented Feb 14, 2023 •

edited

Loading

BenWibking commented Feb 15, 2023 •

edited

Loading

BenWibking commented Feb 15, 2023

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023 •

edited

Loading

wckzhang commented Feb 21, 2023

BenWibking commented Feb 21, 2023

wckzhang commented Feb 21, 2023

wckzhang commented Feb 21, 2023

BenWibking commented Feb 21, 2023

BenWibking commented Feb 21, 2023

BenWibking commented Feb 27, 2023 •

edited

Loading

BenWibking commented Feb 27, 2023

BenWibking commented Feb 28, 2023

janjust commented Mar 30, 2023 •

edited

Loading

devreal commented Mar 30, 2023

janjust commented Apr 6, 2023

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

Comments

BenWibking commented Feb 9, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

amirsojoodi commented Feb 14, 2023 • edited Loading

BenWibking commented Feb 15, 2023 • edited Loading

BenWibking commented Feb 15, 2023

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 • edited Loading

BenWibking commented Feb 17, 2023 • edited Loading

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 • edited Loading

BenWibking commented Feb 17, 2023

edgargabriel commented Feb 17, 2023 • edited Loading

BenWibking commented Feb 17, 2023 • edited Loading

wckzhang commented Feb 21, 2023

BenWibking commented Feb 21, 2023

wckzhang commented Feb 21, 2023

wckzhang commented Feb 21, 2023

BenWibking commented Feb 21, 2023

BenWibking commented Feb 21, 2023

BenWibking commented Feb 27, 2023 • edited Loading

BenWibking commented Feb 27, 2023

BenWibking commented Feb 28, 2023

janjust commented Mar 30, 2023 • edited Loading

devreal commented Mar 30, 2023

janjust commented Apr 6, 2023

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

amirsojoodi commented Feb 14, 2023 •

edited

Loading

BenWibking commented Feb 15, 2023 •

edited

Loading

edgargabriel commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023 •

edited

Loading

edgargabriel commented Feb 17, 2023 •

edited

Loading

edgargabriel commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 17, 2023 •

edited

Loading

BenWibking commented Feb 27, 2023 •

edited

Loading

janjust commented Mar 30, 2023 •

edited

Loading