Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

Closed
BenWibking opened this issue Feb 9, 2023 · 23 comments
Closed

[5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML #11399

BenWibking opened this issue Feb 9, 2023 · 23 comments
Milestone

Comments

@BenWibking
Copy link

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

5.0.0rc10 and main (commit dad058e)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From tarball and from git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 415d7044c478b0910c9fbb0f36af700b9483c493 3rd-party/openpmix (v1.1.3-3769-g415d7044)
 dc6ccf65b3356ae7c70bc3a37b4249f03d43966e 3rd-party/prrte (psrvr-v2.0.0rc1-4569-gdc6ccf65b3)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

Please describe the system on which you are running

  • Operating system/version: RHEL 8.4
  • Computer hardware: 4 NVIDIA A100 GPUs (40 GB HBM2 RAM each) connected via NVLINK and 1 64-core AMD EPYC 7763 ("Milan")
  • Network type: Slingshot-10

Details of the problem

I have built against UCX 1.14rc2 with CUDA support, which works correctly with OpenMPI 4.1.4.

However, running the osu_bw benchmark with device buffers (osu_bw D D) with either 5.0.0rc10 or main immediately causes a segmentation fault with the following output:


[gpua015.delta.internal.ncsa.edu:967381] shmem: mmap: an error occurred while determining whether or not /tmp/spmix_appdir_69033_1389475.0/shared_mem_cuda_pool.gpua015 could be created.
[gpua015.delta.internal.ncsa.edu:967381] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728
# OSU MPI-CUDA Bandwidth Test v7.0
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[gpua015:967381:0:967381] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f2d13200000)
==== backtrace (tid: 967381) ====
 0  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f2d4328cf34]
 1  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(+0x2f0f7) [0x7f2d4328d0f7]
 2  /projects/cvz/bwibking/ucx-1.14/lib/libucs.so.0(+0x2f3c6) [0x7f2d4328d3c6]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f2d446f2b20]
 4  /lib64/libc.so.6(+0x16065c) [0x7f2d4447b65c]
 5  /projects/cvz/bwibking/ompi5_main_20230209/lib/libopen-pal.so.0(mca_btl_sm_sendi+0x33d) [0x7f2d43a952dd]
 6  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(+0x22d900) [0x7f2d454de900]
 7  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(mca_pml_ob1_isend+0x419) [0x7f2d454df869]
 8  /projects/cvz/bwibking/ompi5_main_20230209/lib/libmpi.so.0(MPI_Isend+0x134) [0x7f2d45379454]
 9  omb-gpu/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x402ce1]
10  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f2d4433e493]
11  omb-gpu/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw() [0x4037be]
=================================
srun: error: gpua015: task 0: Segmentation fault (core dumped)
@amirsojoodi
Copy link

amirsojoodi commented Feb 14, 2023

I had the same problem on a very similar configuration (no slingshot). See here
In summary, the problem was UCX's newer versions. I downgraded UCX to 1.12.1, and it helped with the situation. You can test it quickly by changing the pml to ob1 instead of UCX. --mca pml ob1

@jsquyres jsquyres modified the milestones: v5.1.0, v5.0.0 Feb 14, 2023
@BenWibking
Copy link
Author

BenWibking commented Feb 15, 2023

I rebuilt from main (commit ff1f1b7) against UCX 1.12.1 and it works with osu_bw D D, but fails inside MPI_Testall with my application code (which works with OpenMPI 4.1.4, MPICH, and Cray MPI on multiple systems):

[gpua093:1670235:0:1670235] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fb576e3b900)
[gpua093:1670236:0:1670236] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc47e80b900)
==== backtrace (tid:1670236) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fcc359f0d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7fcc359f0f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7fcc359f11fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7fcc3716eb20]
==== backtrace (tid:1670237) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f7c2c176d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f7c2c176f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f7c2c1771fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f7c2d8f4b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f7c2cac8805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f7c2c4742af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f7c33d3b11d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f7c33d32dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f7c2c475327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f7c2c4755ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f7c2c3cf354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f7c33b9c24f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f7c33be54f6]

I don't immediately have a minimal reproducer. Unfortunately, It looks like MPI_Testall is not used in any of the OSU benchmarks. My code has a typical MPI_Irecv, MPI_Isend, MPI_Waitall pattern.

@BenWibking
Copy link
Author

Ok, I have a minimal-ish reproducer using the FillBoundaryComparison test from AMReX: https://github.com/AMReX-Codes/amrex/tree/development/Tests/FillBoundaryComparison.

From this directory, you can simply make USE_CUDA=TRUE CUDA_ARCH=80.

Then run:

srun ./main3d.gnu.MPI.CUDA.ex amrex.throw_exception=1 amrex.signal_handling=0 amrex.the_arena_is_managed=0 amrex.use_gpu_aware_mpi=1

I get:

MPI initialized with 4 MPI processes
MPI initialized with thread support level 0
Initializing CUDA...
CUDA initialized with 4 devices.
AMReX (23.02-37-gcbe2b291d3ed-dirty) initialized
min length = 32
num Pts    = 841482240
num boxes  = 25680
num levels = 5
841482240 points on level 0
105185280 points on level 1
13148160 points on level 2
1643520 points on level 3
205440 points on level 4
[gpua037:1094940:0:1094940] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f8ed195efe8)
==== backtrace (tid:1094940) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f95f9df2d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f95f9df2f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f95f9df31fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f95fb570b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f95fa744805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f95fa0f02af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f95fd70c11d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f95fd703dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f95fa0f1327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f95fa0f15ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f95fa04b354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f95fd56d24f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f95fd5b64f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f95fa607493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094939:0:1094939] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f7ac59b6640)
==== backtrace (tid:1094939) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f81ec57dd14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f81ec57df27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f81ec57e1fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f81edcfbb20]
 4  /lib64/libc.so.6(+0x160805) [0x7f81ececf805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f81ec87b2af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f81efe9711d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f81efe8edc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f81ec87c327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f81ec87c5ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f81ec7d6354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f81efcf824f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f81efd414f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f81ecd92493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094938:0:1094938] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f9081f71ef0)
==== backtrace (tid:1094938) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7f97a8747d14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7f97a8747f27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7f97a87481fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7f97a9ec5b20]
 4  /lib64/libc.so.6(+0x160805) [0x7f97a9099805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7f97a8a452af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7f97ac06111d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7f97ac058dc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7f97a8a46327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7f97a8a465ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7f97a89a0354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7f97abec224f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7f97abf0b4f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f97a8f5c493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
[gpua037:1094937:0:1094937] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7fc32be1e920)
==== backtrace (tid:1094937) ====
 0  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fca50fecd14]
 1  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2df27) [0x7fca50fecf27]
 2  /sw/spack/delta-2022-03/apps/ucx/1.12.1-gcc-11.2.0-dtz76ev/lib/libucs.so.0(+0x2e1fe) [0x7fca50fed1fe]
 3  /lib64/libpthread.so.0(+0x12b20) [0x7fca5276ab20]
 4  /lib64/libc.so.6(+0x160805) [0x7fca5193e805]
 5  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xce2af) [0x7fca512ea2af]
 6  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd) [0x7fca5490611d]
 7  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121) [0x7fca548fddc1]
 8  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87) [0x7fca512eb327]
 9  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xcf5ec) [0x7fca512eb5ec]
10  /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34) [0x7fca51245354]
11  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f) [0x7fca5476724f]
12  /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6) [0x7fca547b04f6]
13  ./main3d.gnu.MPI.CUDA.ex() [0x48f7c0]
14  ./main3d.gnu.MPI.CUDA.ex() [0x435eaa]
15  ./main3d.gnu.MPI.CUDA.ex() [0x41ec9c]
16  /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fca51801493]
17  ./main3d.gnu.MPI.CUDA.ex() [0x42277e]
=================================
srun: error: gpua037: task 2: Segmentation fault (core dumped)
srun: error: gpua037: task 3: Segmentation fault (core dumped)
srun: error: gpua037: task 1: Segmentation fault (core dumped)
srun: error: gpua037: task 0: Segmentation fault (core dumped)

@BenWibking
Copy link
Author

I can also reproduce this issue when built with OFI rather than UCX (and UCX explicitly disabled):

[gpua025:322643] *** Process received signal ***
[gpua025:322643] Signal: Segmentation fault (11)
[gpua025:322643] Signal code: Invalid permissions (2)
[gpua025:322643] Failing at address: 0x7f04a995d450
[gpua025:322640] *** Process received signal ***
[gpua025:322640] Signal: Segmentation fault (11)
[gpua025:322640] Signal code: Invalid permissions (2)
[gpua025:322640] Failing at address: 0x7f677b91da80
[gpua025:322641] *** Process received signal ***
[gpua025:322641] Signal: Segmentation fault (11)
[gpua025:322641] Signal code: Invalid permissions (2)
[gpua025:322641] Failing at address: 0x7f9839f71ef0
[gpua025:322642] *** Process received signal ***
[gpua025:322642] Signal: Segmentation fault (11)
[gpua025:322642] Signal code: Invalid permissions (2)
[gpua025:322642] Failing at address: 0x7fa6f5db6ce8
[gpua025:322643] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f0bd0794b20]
[gpua025:322640] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f6ea2feab20]
[gpua025:322641] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7f9f602d5b20]
[gpua025:322642] [ 0] /lib64/libpthread.so.0(+0x12b20)[0x7fae1e854b20]
[gpua025:322643] [ 1] /lib64/libc.so.6(+0x160805)[0x7f0bcf968805]
[gpua025:322640] [ 1] /lib64/libc.so.6(+0x160805)[0x7f6ea21be805]
[gpua025:322640] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f6ea1b663df]
[gpua025:322641] [ 1] /lib64/libc.so.6(+0x160805)[0x7f9f5f4a9805]
[gpua025:322642] [ 1] /lib64/libc.so.6(+0x160805)[0x7fae1da28805]
[gpua025:322643] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f0bcf3103df]
[gpua025:322641] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7f9f5ee513df]
[gpua025:322642] [ 2] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc13df)[0x7fae1d3d03df]
[gpua025:322640] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f6ea4a94efd]
[gpua025:322643] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f0bd223eefd]
[gpua025:322641] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7f9f61d7fefd]
[gpua025:322642] [ 3] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x1cd)[0x7fae202feefd]
[gpua025:322640] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f6ea4a8cba1]
[gpua025:322643] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f0bd2236ba1]
[gpua025:322643] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f0bcf311457]
[gpua025:322640] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f6ea1b67457]
[gpua025:322640] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f6ea1b6771c]
[gpua025:322641] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7f9f61d77ba1]
[gpua025:322642] [ 4] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x121)[0x7fae202f6ba1]
[gpua025:322642] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7fae1d3d1457]
[gpua025:322643] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f0bcf31171c]
[gpua025:322643] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f0bcf2766a4]
[gpua025:322640] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f6ea1acc6a4]
[gpua025:322641] [ 5] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x87)[0x7f9f5ee52457]
[gpua025:322641] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7f9f5ee5271c]
[gpua025:322642] [ 6] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xc271c)[0x7fae1d3d171c]
[gpua025:322642] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7fae1d3366a4]
[gpua025:322640] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f6ea48eb79f]
[gpua025:322641] [ 7] /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x34)[0x7f9f5edb76a4]
[gpua025:322643] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f0bd209579f]
[gpua025:322642] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7fae2015579f]
[gpua025:322641] [ 8] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x5f)[0x7f9f61bd679f]
[gpua025:322640] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f6ea49349e6]
[gpua025:322640] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322643] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f0bd20de9e6]
[gpua025:322643] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322640] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322640] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322640] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f6ea2081493]
[gpua025:322640] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322640] *** End of error message ***
[gpua025:322641] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7f9f61c1f9e6]
[gpua025:322641] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322641] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322642] [ 9] /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0xa6)[0x7fae2019e9e6]
[gpua025:322642] [10] ./main3d.gnu.MPI.CUDA.ex[0x48f170]
[gpua025:322642] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322642] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322643] [11] ./main3d.gnu.MPI.CUDA.ex[0x43590a]
[gpua025:322643] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322643] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f0bcf82b493]
[gpua025:322643] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322643] *** End of error message ***
[gpua025:322641] [12] ./main3d.gnu.MPI.CUDA.ex[0x41e6fc]
[gpua025:322641] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7f9f5f36c493]
[gpua025:322641] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322641] *** End of error message ***
[gpua025:322642] [13] /lib64/libc.so.6(__libc_start_main+0xf3)[0x7fae1d8eb493]
[gpua025:322642] [14] ./main3d.gnu.MPI.CUDA.ex[0x4221de]
[gpua025:322642] *** End of error message ***

Is something going wrong inside the progress engine for messages that use CUDA device buffers?

@edgargabriel
Copy link
Member

edgargabriel commented Feb 17, 2023

How are you forcing using the ucx pml in you environment ? Do you set an environment variable or similar? I see many output lines from ob1 and sm btl, which should not really be involved in a code that uses GPU buffers.

I usually do something like

mpirun --mca pml ucx --mca osc ucx -np x ...

for GPU code

@BenWibking
Copy link
Author

BenWibking commented Feb 17, 2023

I am not forcing it to use UCX at all. I assume it will use it automatically if needed. It crashes in the same way when I do not build with UCX support.

I've run addr2line to get the line numbers where it crashes, and for some reason it is using the sm btl to copy the buffer and it is accessing a device pointer there:

 4: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xd6cf7) [0x7fe8f3f94cf7]
    sm_prepare_src
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_module.c:491:39

 5: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x30a157) [0x7fe8f6fc7157]
    mca_bml_base_prepare_src
../../../../ompi/mca/bml/bml.h:339:12

 6: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_send_request_schedule_once+0x2d8) [0x7fe8f6fca561]
    mca_pml_ob1_send_request_schedule_once
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.c:1179:9

 7: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x2ff1e8) [0x7fe8f6fbc1e8]
    mca_pml_ob1_send_request_schedule_exclusive
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.h:327:14

 8: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(+0x2ff249) [0x7fe8f6fbc249]
    mca_pml_ob1_send_request_schedule
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_sendreq.h:351:5

 9: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_ack+0x340) [0x7fe8f6fbe345]
    mca_pml_ob1_recv_frag_callback_ack
/projects/cvz/bwibking/mpi_build/ompi/ompi/mca/pml/ob1/pml_ob1_recvfrag.c:772:9

10: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(mca_btl_sm_poll_handle_frag+0x1a8) [0x7fe8f3f98746]
    mca_btl_sm_poll_handle_frag
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:455:9

11: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xda82a) [0x7fe8f3f9882a]
    mca_btl_sm_poll_fifo
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:471:47

12: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(+0xdaae9) [0x7fe8f3f98ae9]
    mca_btl_sm_component_progress
/projects/cvz/bwibking/mpi_build/ompi/opal/mca/btl/sm/btl_sm_component.c:563:11

13: /projects/cvz/bwibking/ompi5_main_20230215/lib/libopen-pal.so.0(opal_progress+0x30) [0x7fe8f3ee54c6]
    opal_progress
runtime/opal_progress.c:224:16

14: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(ompi_request_default_test_all+0x14c) [0x7fe8f6d61747]
    ompi_request_default_test_all
request/req_test.c:213:16

15: /projects/cvz/bwibking/ompi5_main_20230215/lib/libmpi.so.0(PMPI_Testall+0x21f) [0x7fe8f6dfae8d]
    PMPI_Testall
/projects/cvz/bwibking/mpi_build/ompi/ompi/mpi/c/testall.c:98:8

@BenWibking
Copy link
Author

How are you forcing using the ucx pml in you environment ? Do you set an environment variable or similar? I see many output lines from ob1 and sm btl, which should not really be involved in a code that uses GPU buffers.

I usually do something like

mpirun --mca pml ucx --mca osc ucx -np x ...

for GPU code

The cluster I'm running on uses SLURM, so I have to use srun to launch things properly. Can I force using the UCX pml via an environment variable?

@edgargabriel
Copy link
Member

edgargabriel commented Feb 17, 2023

I would highly recommend that you force using ucx. I use ucx 1.13.x with GPU literally on a daily basis, but you have to tell Open MPI that it should use ucx. Either using the command line that I showed, or setting the corresponding environment variables.

export OMPI_MCA_pml=ucx
export OMPI_MCA_osc=ucx

If you see ob1 or sm btl error messages, you know it did not use the correct components, they should not be used with GPU code.

@BenWibking
Copy link
Author

According to the OpenMPI FAQ, "the Open MPI library will automatically detect that the pointer being passed in is a CUDA device memory pointer and do the right thing" and "CUDA-aware support is available in the sm, smcuda, tcp, and openib BTLs" (https://www.open-mpi.org/faq/?category=runcuda#mpi-cuda-support).

Is this not the case anymore? If not, I hope the documentation will be fixed to reflect this.

@edgargabriel
Copy link
Member

edgargabriel commented Feb 17, 2023

I think for 5.0 you need to look at this documentation:
https://docs.open-mpi.org/en/v5.0.x/

I let somebody else answer your last question in details, since I personally am only using Open MPI with UCX for GPU code. The openib component is definitely gone in the 5.0 release.

The one additional remark that I have is to make sure that your UCX library has also been compiled with support for the GPUs that you want to use.

@BenWibking
Copy link
Author

BenWibking commented Feb 17, 2023

@edgargabriel Thanks for your suggestions. I tried forcing the ucx pml and it works!

However, I still get a segmentation fault if I force the ob1 pml (or let it be selected by default), which should still work with GPU buffers according to the new 5.0 documentation ("CUDA-aware support is available in... Both CUDA-ized shared memory (smcuda) and TCP (tcp) BTLs with the OB1 (ob1) PML" https://docs.open-mpi.org/en/v5.0.x/tuning-apps/networking/cuda.html#what-kind-of-cuda-support-exists-in-open-mpi).

@BenWibking BenWibking changed the title [5.0.0rc10/main] CUDA-aware MPI is broken [5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 pml Feb 17, 2023
@BenWibking BenWibking changed the title [5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 pml [5.0.0rc10/main] CUDA-aware MPI is broken when using the ob1 PML Feb 17, 2023
@wckzhang
Copy link
Contributor

No supposedly the ob1 pml should work with cuda buffers. I've tested the path myself with both ob1 and mtl ofi outside of ucx and it should work.

There should be copying mechanisms in place to handle this

@BenWibking
Copy link
Author

No supposedly the ob1 pml should work with cuda buffers. I've tested the path myself with both ob1 and mtl ofi outside of ucx and it should work.

There should be copying mechanisms in place to handle this

Are you able to reproduce this with the reproducer I linked? I realize it's a heavy lift to build, but I haven't been able to reproduce this with the OSU benchmarks. Presumably there is some kind of race condition or issue with handling in-flight messages in the progress engine.

@wckzhang
Copy link
Contributor

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

@wckzhang
Copy link
Contributor

Also do you have a coredump at hand?

@BenWibking
Copy link
Author

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

No, I haven't tried building with libfabric. I can try this, though. Aside from building with OFI, what env vars or flags do I need to set for both of these?

@BenWibking
Copy link
Author

Also do you have a coredump at hand?

No coredump, only the backtrace that I copied above. It tries to dereference a device pointer in opal/mca/btl/sm/btl_sm_module.c:491:39. I can try to get it to save a backtrace on the cluster I'm using if you want to look at it directly.

@BenWibking
Copy link
Author

BenWibking commented Feb 27, 2023

No I have not made an attempt to reproduce yet and probably won't have the time to do so for a few weeks. Were you able to reproduce it using the pml/cm & mtl/ofi path?

No, I haven't tried building with libfabric. I can try this, though. Aside from building with OFI, what env vars or flags do I need to set for both of these?

@wckzhang I've re-built with libfabric and set OMPI_MCA_pml=cm and OMPI_MCA_mtl=ofi. In this case, the job is killed immediately without any output:

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 1507916.0 ON gpua079 CANCELLED AT 2023-02-27T11:55:50 ***
srun: error: gpua079: tasks 0-2: Killed
srun: error: gpua079: task 3: Exited with exit code 1

With OMPI_MCA_pml=ob1, the segfault and backtrace are the same as before. With OMPI_MCA_pml=ucx, everything works fine.

(I've also set ulimit -c unlimited, but I don't get a core dump in either run. I'll follow up with the sysadmins to see what is going on.)

@BenWibking
Copy link
Author

I've re-built libfabric 1.17.0 from source using --with-cuda and I now get MPI_ERR_OTHER when using GPU device buffers in combination with pml/cm and mtl/ofi (it works fine with host buffers):

amrex::Error::2::AMReX MPI Error: File ../../Src/Base/AMReX_ParallelDescriptor.cpp, line 1694, MPI_Irecv(buf, n, Mpi_typemap<char>::type(), pid, tag, comm, &req): MPI_ERR_OTHER: known error not in list !!!

@BenWibking
Copy link
Author

@wckzhang I've uploaded the core dumps for both ob1 and cm+ofi crashes here: https://cloudstor.aarnet.edu.au/plus/s/Qsc6m9liCJXpi9P

@janjust
Copy link
Contributor

janjust commented Mar 30, 2023

@wckzhang is there any update on this issue, sounds like a blocker consider @devreal also encountered it.

@devreal
Copy link
Contributor

devreal commented Mar 30, 2023

I will post a patch soon that modifies the btls

@janjust
Copy link
Contributor

janjust commented Apr 6, 2023

fixed with #11564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants