You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The AWS team runs nightly OMPI MTT tests , we discovered a likely one-sided SHM bug while running IBM benchmarks here. We were only seeing this bug on arm platforms (doesn't mean it couldn't happen on x86, but not observed yet). A wild guess that I have is it has to do with SHM not using memory barriers. I want to try this test with SM2 and see if the problem goes away (once SM2 supports one-sided operations).
To Reproduce
Install OMPI v4.0.x
Install libfabric main
Install IBM benchmarks
Run Test: ibm/onesided/c_reqops
This segfault happens every time on ARM platforms.
Expected behavior
I expect the test to pass with flying colors (no segfault).
Output
(env) ┌─[ec2-user@ip-172-31-24-134]─[98:01h19]─[~]
└─╼ mpirun -np 256 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/onesided/c_reqops
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-2,172.31.22.34' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-3,172.31.20.84' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-1,172.31.25.36' (ECDSA) to the list of known hosts.
Warning: Permanently added 'queue-c6gn16xlarge-st-c6gn16xlarge-4,172.31.18.126' (ECDSA) to the list of known hosts.
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] *** Process received signal ***
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Signal: Bus error (7)
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Signal code: Invalid address alignment (1)
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] Failing at address: 0xac95003
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 0] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x16dee0)[0xffffaa75aee0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 1] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb1449860]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 2] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1b6c8c)[0xffffaa7a3c8c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 3] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x4fce0)[0xffffaa63cce0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 4] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x162b40)[0xffffaa74fb40]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 5] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x162f88)[0xffffaa74ff88]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 6] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x164384)[0xffffaa751384]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 7] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1647bc)[0xffffaa7517bc]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 8] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x1650b0)[0xffffaa7520b0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [ 9] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68a74)[0xffffaa655a74]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [10] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68d68)[0xffffaa655d68]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [11] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0xf2768)[0xffffaa6df768]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [12] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0xf2d98)[0xffffaa6dfd98]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [13] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x66770)[0xffffaa653770]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [14] /home/ec2-user/libfabric/install/lib/libfabric.so.1(+0x68110)[0xffffaa655110]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [15] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(+0xa45c)[0xffffaa88645c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [16] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(mca_btl_ofi_context_progress+0x5c)[0xffffaa887fd4]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [17] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_btl_ofi.so(+0x55dc)[0xffffaa8815dc]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [18] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/libopen-pal.so.40(opal_progress+0x34)[0xffffb0e054b8]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [19] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xab3c)[0xffffa8aa4b3c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [20] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xb778)[0xffffa8aa5778]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [21] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0xb954)[0xffffa8aa5954]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [22] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(+0x108a0)[0xffffa8aaa8a0]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [23] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_rget_accumulate+0x104)[0xffffa8aaac88]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [24] /home/ec2-user/PortaFiducia/build/libraries/openmpi/v4.1.x-debug/install/lib/libmpi.so.40(PMPI_Rget_accumulate+0x550)[0xffffb12d352c]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [25] /home/ec2-user/ompi-tests/ibm/onesided/c_reqops[0x402220]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [26] /lib64/libc.so.6(__libc_start_main+0xe4)[0xffffb1044da4]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] [27] /home/ec2-user/ompi-tests/ibm/onesided/c_reqops[0x401648]
[queue-c6gn16xlarge-st-c6gn16xlarge-4:20196] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 20196 on node queue-c6gn16xlarge-st-c6gn16xlarge-4 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
(env) ┌─[ec2-user@ip-172-31-24-134]─[99:01h20]─[~]─[✗]
└─╼ gdb /home/ec2-user/ompi-tests/ibm/onesided/c_reqops core.20196
GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-36.amzn2.0.1
Copyright (C) 2017 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/ec2-user/ompi-tests/ibm/onesided/c_reqops...done.
[New LWP 20196]
[New LWP 20220]
[New LWP 20210]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/home/ec2-user/ompi-tests/ibm/onesided/c_reqops'.
Program terminated with signal SIGBUS, Bus error.
#0 0x0000ffffaa7a3c8c in __aarch64_swp4_acq_rel () from /home/ec2-user/libfabric/install/lib/libfabric.so.1
[Current thread is 1 (Thread 0xffffb143f650 (LWP 20196))]
Missing separate debuginfos, use: debuginfo-install glibc-2.26-62.amzn2.aarch64 hwloc-libs-1.11.8-4.amzn2.aarch64 libatomic-7.3.1-15.amzn2.aarch64 libevent-2.0.21-4.amzn2.0.3.aarch64 libgcc-7.3.1-15.amzn2.aarch64 libibverbs-core-43.0-1.amzn2.0.2.aarch64 libnl3-3.2.28-4.amzn2.0.1.aarch64 librdmacm-43.0-1.amzn2.0.2.aarch64 libtool-ltdl-2.4.2-22.2.amzn2.0.2.aarch64 numactl-libs-2.0.9-7.amzn2.aarch64 zlib-1.2.7-19.amzn2.0.2.aarch64
(gdb) bt
#0 0x0000ffffaa7a3c8c in __aarch64_swp4_acq_rel () from /home/ec2-user/libfabric/install/lib/libfabric.so.1
#1 0x0000ffffaa63cce0 in ofi_readwrite_OFI_OP_READWRITE_uint32_t (dst=0xac95003, src=0xffffa931a980, res=0xfffff734edd8, cnt=1) at prov/util/src/util_atomic.c:939
#2 0x0000ffffaa74fb40 in smr_do_atomic (src=0xffffa931a980, dst=0xac95003, cmp=0x0, datatype=FI_UINT32, op=FI_ATOMIC_WRITE, cnt=1, flags=6) at prov/shm/src/smr_progress.c:648
#3 0x0000ffffaa74ff88 in smr_progress_inject_atomic (cmd=0xffffa92c9ee0, ioc=0xfffff734ff08, ioc_count=1, len=0xfffff734fee0, ep=0x9bb41b0, err=0) at prov/shm/src/smr_progress.c:711
#4 0x0000ffffaa751384 in smr_progress_cmd_atomic (ep=0x9bb41b0, cmd=0xffffa92c9ee0, rma_cmd=0xffffa92c9fe0) at prov/shm/src/smr_progress.c:1043
#5 0x0000ffffaa7517bc in smr_progress_cmd (ep=0x9bb41b0) at prov/shm/src/smr_progress.c:1126
#6 0x0000ffffaa7520b0 in smr_ep_progress (util_ep=0x9bb41b0) at prov/shm/src/smr_progress.c:1304
#7 0x0000ffffaa655a74 in ofi_cq_progress (cq=0x9b73ce0) at prov/util/src/util_cq.c:498
#8 0x0000ffffaa655d68 in ofi_peer_cq_read (cq_fid=0x9b73ce0, buf=0x0, count=0) at prov/util/src/util_cq.c:589
#9 0x0000ffffaa6df768 in fi_cq_read (cq=0x9b73ce0, buf=0x0, count=0) at ./include/rdma/fi_eq.h:394
#10 0x0000ffffaa6dfd98 in efa_rdm_cq_readfrom (cq_fid=0x9bab5c0, buf=0xfffff73501d0, count=64, src_addr=0x0) at prov/efa/src/rdm/efa_rdm_cq.c:92
#11 0x0000ffffaa653770 in fi_cq_readfrom (cq=0x9bab5c0, buf=0xfffff73501d0, count=64, src_addr=0x0) at ./include/rdma/fi_eq.h:400
#12 0x0000ffffaa655110 in ofi_cq_read (cq_fid=0x9bab5c0, buf=0xfffff73501d0, count=64) at prov/util/src/util_cq.c:264
#13 0x0000ffffaa88645c in fi_cq_read (cq=0x9bab5c0, buf=0xfffff73501d0, count=64) at /home/ec2-user/libfabric/install/include/rdma/fi_eq.h:394
#14 0x0000ffffaa887fd4 in mca_btl_ofi_context_progress (context=0x9bb4a00) at btl_ofi_context.c:385
#15 0x0000ffffaa8815dc in mca_btl_ofi_component_progress () at btl_ofi_component.c:736
#16 0x0000ffffb0e054b8 in opal_progress () at runtime/opal_progress.c:231
#17 0x0000ffffa8aa4b3c in ompi_osc_rdma_progress (module=0xcd39fd0) at osc_rdma.h:395
#18 0x0000ffffa8aa5778 in ompi_osc_rdma_btl_cswap (result=0xfffff73506d8, flags=0, value=-9223372036854775808, compare=0, address_handle=0xffffb11eb218, address=281473653322304,
endpoint=0xac9fdd0, module=0xcd39fd0) at osc_rdma_lock.h:227
#19 ompi_osc_rdma_lock_btl_cswap (result=0xfffff73506d8, value=-9223372036854775808, compare=0, address=281473653322304, peer=0x9a0de60, module=0xcd39fd0) at osc_rdma_lock.h:240
#20 ompi_osc_rdma_lock_try_acquire_exclusive (module=0xcd39fd0, peer=0x9a0de60, offset=16) at osc_rdma_lock.h:364
#21 0x0000ffffa8aa5954 in ompi_osc_rdma_lock_acquire_exclusive (module=0xcd39fd0, peer=0x9a0de60, offset=16) at osc_rdma_lock.h:400
#22 0x0000ffffa8aaa8a0 in ompi_osc_rdma_rget_accumulate_internal (win=0x9af8c60, origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>,
result_addr=0xfffff7350ae4, result_count=1, result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>,
op=0x420390 <ompi_mpi_op_replace>, request_out=0xfffff7350a78) at osc_rdma_accumulate.c:918
#23 0x0000ffffa8aaac88 in ompi_osc_rdma_rget_accumulate (origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>, result_addr=0xfffff7350ae4, result_count=1,
result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>, op=0x420390 <ompi_mpi_op_replace>, win=0x9af8c60,
request=0xfffff7350a78) at osc_rdma_accumulate.c:980
#24 0x0000ffffb12d352c in PMPI_Rget_accumulate (origin_addr=0xfffff7350ae8, origin_count=1, origin_datatype=0x420b90 <ompi_mpi_int>, result_addr=0xfffff7350ae4, result_count=1,
result_datatype=0x420b90 <ompi_mpi_int>, target_rank=0, target_disp=0, target_count=1, target_datatype=0x420b90 <ompi_mpi_int>, op=0x420390 <ompi_mpi_op_replace>, win=0x9af8c60,
request=0xfffff7350a78) at prget_accumulate.c:139
#25 0x0000000000402220 in main (argc=1, argv=0xfffff7350cb8) at c_reqops.c:199
(gdb)
Environment:
OS = AL2
Instance: AWS c6gn.16xlarge
The text was updated successfully, but these errors were encountered:
The destination address for the actual atomic operation looks wrong #2 0x0000ffffaa74fb40 in smr_do_atomic (src=0xffffa931a980, dst=0xac95003, cmp=0x0, datatype=FI_UINT32, op=FI_ATOMIC_WRITE, cnt=1, flags=6) at prov/shm/src/smr_progress.c:648
@a-szegel Can you see what the original fi_atomic parameters were?
Specifically the addr and or the key and what memory address that key points to/was registered with. I want to see if this is coming from the application or if shm is doing something weird with the address translation
Describe the bug
The AWS team runs nightly OMPI MTT tests , we discovered a likely one-sided SHM bug while running IBM benchmarks here. We were only seeing this bug on arm platforms (doesn't mean it couldn't happen on x86, but not observed yet).
A wild guess that I have is it has to do with SHM not using memory barriers.I want to try this test with SM2 and see if the problem goes away (once SM2 supports one-sided operations).To Reproduce
Install OMPI v4.0.x
Install libfabric main
Install IBM benchmarks
Run Test: ibm/onesided/c_reqops
This segfault happens every time on ARM platforms.
Expected behavior
I expect the test to pass with flying colors (no segfault).
Output
Environment:
OS = AL2
Instance: AWS c6gn.16xlarge
The text was updated successfully, but these errors were encountered: