Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/shm: One Sided Segfault on ARM Platforms #8894

Closed
a-szegel opened this issue May 4, 2023 · 3 comments
Closed

prov/shm: One Sided Segfault on ARM Platforms #8894

a-szegel opened this issue May 4, 2023 · 3 comments
Labels

Comments

@a-szegel
Copy link
Contributor

a-szegel commented May 4, 2023

Describe the bug
The AWS team runs nightly OMPI MTT tests , we discovered a likely one-sided SHM bug while running IBM benchmarks here. We were only seeing this bug on arm platforms (doesn't mean it couldn't happen on x86, but not observed yet). A wild guess that I have is it has to do with SHM not using memory barriers. I want to try this test with SM2 and see if the problem goes away (once SM2 supports one-sided operations).

To Reproduce
Install OMPI v4.0.x
Install libfabric main
Install IBM benchmarks

Run Test: ibm/onesided/c_reqops

This segfault happens every time on ARM platforms.

Expected behavior
I expect the test to pass with flying colors (no segfault).

Output

(env) ┌─[ec2-user@ip-172-31-24-134]─[98:01h19]─[~]
└─╼ mpirun -np 256 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/onesided/c_reqops
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-2,172.31.22.34' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-3,172.31.20.84' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-1,172.31.25.36' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-4,172.31.18.126' (ECDSA) to the list of known hosts.
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] *** Process received signal ***
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Signal: Bus error (7)
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Signal code: Invalid address alignment (1)
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Failing at address: 0xac95003
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 0] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x16dee0)[0xffffaa75aee0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 1] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb1449860]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 2] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1b6c8c)[0xffffaa7a3c8c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 3] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x4fce0)[0xffffaa63cce0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 4] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x162b40)[0xffffaa74fb40]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 5] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x162f88)[0xffffaa74ff88]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 6] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x164384)[0xffffaa751384]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 7] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1647bc)[0xffffaa7517bc]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 8] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1650b0)[0xffffaa7520b0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 9] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68a74)[0xffffaa655a74]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [10] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68d68)[0xffffaa655d68]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [11] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0xf2768)[0xffffaa6df768]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [12] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0xf2d98)[0xffffaa6dfd98]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [13] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x66770)[0xffffaa653770]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [14] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68110)[0xffffaa655110]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [15] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(+0xa45c)[0xffffaa88645c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [16] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_context_progress+0x5c)[0xffffaa887fd4]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [17] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(+0x55dc)[0xffffaa8815dc]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [18] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/libopen-pal.so.40(opal_progress+0x34)[0xffffb0e054b8]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [19] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xab3c)[0xffffa8aa4b3c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [20] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xb778)[0xffffa8aa5778]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [21] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xb954)[0xffffa8aa5954]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [22] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0x108a0)[0xffffa8aaa8a0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [23] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_rget_accumulate+0x104)[0xffffa8aaac88]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [24] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/libmpi.so.40(PMPI_Rget_accumulate+0x550)[0xffffb12d352c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [25] /home/ec2-user/ompi-tests/ibm/onesided/c_reqops[0x402220]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [26] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffffb1044da4]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [27] /home/ec2-user/ompi-tests/ibm/onesided/c_reqops[0x401648]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 20196 on node queue-c6gn16xlarge-st-c6gn16xlarge-4 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
(env) ┌─[ec2-user@ip-172-31-24-134]─[99:01h20]─[~]─[✗]
└─╼ gdb /home/ec2-user/ompi-tests/ibm/onesided/c_reqops core.20196
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.amzn2.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ec2-user/ompi-tests/ibm/onesided/c_reqops...done.
[New LWP 20196]
[New LWP 20220]
[New LWP 20210]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/ec2-user/ompi-tests/ibm/onesided/c_reqops'.
Program terminated with signal SIGBUS, Bus error.
#0  0x0000ffffaa7a3c8c in __aarch64_swp4_acq_rel () from /home/ec2-user/libfabric/install/lib/libfabric.so.1
[Current thread is 1 (Thread 0xffffb143f650 (LWP 20196))]
Missing separate debuginfos, use: debuginfo-install glibc-2.26-62.amzn2.aarch64 hwloc-libs-1.11.8-4.amzn2.aarch64 libatomic-7.3.1-15.amzn2.aarch64 libevent-2.0.21-4.amzn2.0.3.aarch64 libgcc-7.3.1-15.amzn2.aarch64 libibverbs-core-43.0-1.amzn2.0.2.aarch64 libnl3-3.2.28-4.amzn2.0.1.aarch64 librdmacm-43.0-1.amzn2.0.2.aarch64 libtool-ltdl-2.4.2-22.2.amzn2.0.2.aarch64 numactl-libs-2.0.9-7.amzn2.aarch64 zlib-1.2.7-19.amzn2.0.2.aarch64
(gdb) bt
#0  0x0000ffffaa7a3c8c in __aarch64_swp4_acq_rel () from /home/ec2-user/libfabric/install/lib/libfabric.so.1
#1  0x0000ffffaa63cce0 in ofi_readwrite_OFI_OP_READWRITE_uint32_t (dst=0xac95003, src=0xffffa931a980, res=0xfffff734edd8, cnt=1) at prov/util/src/util_atomic.c:939
#2  0x0000ffffaa74fb40 in smr_do_atomic (src=0xffffa931a980, dst=0xac95003, cmp=0x0, datatype=FI_UINT32, op=FI_ATOMIC_WRITE, cnt=1, flags=6) at prov/shm/src/smr_progress.c:648
#3  0x0000ffffaa74ff88 in smr_progress_inject_atomic (cmd=0xffffa92c9ee0, ioc=0xfffff734ff08, ioc_count=1, len=0xfffff734fee0, ep=0x9bb41b0, err=0) at prov/shm/src/smr_progress.c:711
#4  0x0000ffffaa751384 in smr_progress_cmd_atomic (ep=0x9bb41b0, cmd=0xffffa92c9ee0, rma_cmd=0xffffa92c9fe0) at prov/shm/src/smr_progress.c:1043
#5  0x0000ffffaa7517bc in smr_progress_cmd (ep=0x9bb41b0) at prov/shm/src/smr_progress.c:1126
#6  0x0000ffffaa7520b0 in smr_ep_progress (util_ep=0x9bb41b0) at prov/shm/src/smr_progress.c:1304
#7  0x0000ffffaa655a74 in ofi_cq_progress (cq=0x9b73ce0) at prov/util/src/util_cq.c:498
#8  0x0000ffffaa655d68 in ofi_peer_cq_read (cq_fid=0x9b73ce0, buf=0x0, count=0) at prov/util/src/util_cq.c:589
#9  0x0000ffffaa6df768 in fi_cq_read (cq=0x9b73ce0, buf=0x0, count=0) at ./include/rdma/fi_eq.h:394
#10 0x0000ffffaa6dfd98 in efa_rdm_cq_readfrom (cq_fid=0x9bab5c0, buf=0xfffff73501d0, count=64, src_addr=0x0) at prov/efa/src/rdm/efa_rdm_cq.c:92
#11 0x0000ffffaa653770 in fi_cq_readfrom (cq=0x9bab5c0, buf=0xfffff73501d0, count=64, src_addr=0x0) at ./include/rdma/fi_eq.h:400
#12 0x0000ffffaa655110 in ofi_cq_read (cq_fid=0x9bab5c0, buf=0xfffff73501d0, count=64) at prov/util/src/util_cq.c:264
#13 0x0000ffffaa88645c in fi_cq_read (cq=0x9bab5c0, buf=0xfffff73501d0, count=64) at /home/ec2-user/libfabric/install/include/rdma/fi_eq.h:394
#14 0x0000ffffaa887fd4 in mca_btl_ofi_context_progress (context=0x9bb4a00) at btl_ofi_context.c:385
#15 0x0000ffffaa8815dc in mca_btl_ofi_component_progress () at btl_ofi_component.c:736
#16 0x0000ffffb0e054b8 in opal_progress () at runtime/opal_progress.c:231
#17 0x0000ffffa8aa4b3c in ompi_osc_rdma_progress (module=0xcd39fd0) at osc_rdma.h:395
#18 0x0000ffffa8aa5778 in ompi_osc_rdma_btl_cswap (result=0xfffff73506d8, flags=0, value=-9223372036854775808, compare=0, address_handle=0xffffb11eb218, address=281473653322304, 
    endpoint=0xac9fdd0, module=0xcd39fd0) at osc_rdma_lock.h:227
#19 ompi_osc_rdma_lock_btl_cswap (result=0xfffff73506d8, value=-9223372036854775808, compare=0, address=281473653322304, peer=0x9a0de60, module=0xcd39fd0) at osc_rdma_lock.h:240
#20 ompi_osc_rdma_lock_try_acquire_exclusive (module=0xcd39fd0, peer=0x9a0de60, offset=16) at osc_rdma_lock.h:364
#21 0x0000ffffa8aa5954 in ompi_osc_rdma_lock_acquire_exclusive (module=0xcd39fd0, peer=0x9a0de60, offset=16) at osc_rdma_lock.h:400
#22 0x0000ffffa8aaa8a0 in ompi_osc_rdma_rget_accumulate_internal (win=0x9af8c60, origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>, 
    result_addr=0xfffff7350ae4, result_count=1, result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>, 
    op=0x420390 <ompi_mpi_op_replace>, request_out=0xfffff7350a78) at osc_rdma_accumulate.c:918
#23 0x0000ffffa8aaac88 in ompi_osc_rdma_rget_accumulate (origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>, result_addr=0xfffff7350ae4, result_count=1, 
    result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>, op=0x420390 <ompi_mpi_op_replace>, win=0x9af8c60, 
    request=0xfffff7350a78) at osc_rdma_accumulate.c:980
#24 0x0000ffffb12d352c in PMPI_Rget_accumulate (origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>, result_addr=0xfffff7350ae4, result_count=1, 
    result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>, op=0x420390 <ompi_mpi_op_replace>, win=0x9af8c60, 
    request=0xfffff7350a78) at prget_accumulate.c:139
#25 0x0000000000402220 in main (argc=1, argv=0xfffff7350cb8) at c_reqops.c:199
(gdb) 

Environment:
OS = AL2
Instance: AWS c6gn.16xlarge

@a-szegel a-szegel added the bug label May 4, 2023
@a-szegel a-szegel changed the title prov/xxx: short, one-line description prov/shm: One Sided Segfault on ARM Platforms May 4, 2023
@a-szegel
Copy link
Contributor Author

a-szegel commented May 4, 2023

I was able to reproduce this on 1 instance, two ranks mpirun -np 2 -N 2

@aingerson
Copy link
Contributor

The destination address for the actual atomic operation looks wrong
#2 0x0000ffffaa74fb40 in smr_do_atomic (src=0xffffa931a980, dst=0xac95003, cmp=0x0, datatype=FI_UINT32, op=FI_ATOMIC_WRITE, cnt=1, flags=6) at prov/shm/src/smr_progress.c:648

@a-szegel Can you see what the original fi_atomic parameters were?
Specifically the addr and or the key and what memory address that key points to/was registered with. I want to see if this is coming from the application or if shm is doing something weird with the address translation

@a-szegel
Copy link
Contributor Author

a-szegel commented May 4, 2023

Was able to pull liibfabric out, and we still have this issue with BTL SM. Closing this issue to open-mpi/ompi#11646.

@a-szegel a-szegel closed this as completed May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants