-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS: verbs;rxm - latest ofi main causes mem corruption when running at scale #6973
Comments
DAOS validation retried Frontera scenarios above with 3 versions of OFI: v1.12.0, 1.130.0 and 7d6d2a1 using verbs;ofi_rxm. Results are: Besides the original crash report above, another crash signature of testing with 7d6d2a1 OFI is following: c155-073.frontera.tacc.utexas.edu ERROR 2021/08/06 16:53:51 daos_engine:0 libfabric:62384:verbs:eq:vrb_set_rnr_timer():474 Unable to modify QP attribute |
We've bisected it to: • 885e643 - GOOD |
See PR #7091 -- both verbs and tcp have the potential for use after free, or double free under connection failures. The underlying problems are in the verbs/tcp providers and have been there for some time. This is an easy change that should avoid the problems until the underlying providers are fixed. There's still an issue with 7091, which is why the QP creation failed in the first place, causing the reject path. |
Changes in #7091 were merged. Initial testing shows memory corruption issues have been resolved. |
When testing latest OFI main on Frontera TACC system at scale one of scenarios that perviously passed a test stage with v1.12.0 using verbs;ofi_rxm is now crashing one of daos servers with the following backtrace:
OFI: 7d6d2a1
System: Frontera
Servers: 16 * 1 engine
Clients: 40
Procs per client: 56
Program ran: "ior easy" with stonewalling set to 20seconds.
Crash:
c171-141: ERROR: daos_engine:0 ** Error in `/work2/08126/dbohninx/frontera/BUILDS/daos-8250/latest/daos/install/bin/daos_engine': double free or corruption (!prev): 0x00002b26a403e1f0 **
c171-141: ERROR: daos_engine:0 ======= Backtrace: =========
c171-141: /lib64/libc.so.6(+0x7f3e4)[0x2b26852463e4]
c171-141: /lib64/libc.so.6(+0x846e0)[0x2b268524b6e0]
c171-141: ERROR: daos_engine:0 /lib64/libc.so.6(realloc+0x1d2)[0x2b268524cd82]
Backtrace from core file:
#0 0x00002b26851fd387 in raise () from /lib64/libc.so.6
#1 0x00002b26851fea78 in abort () from /lib64/libc.so.6
#2 0x00002b268523fed7 in __libc_message () from /lib64/libc.so.6
#3 0x00002b26852463e4 in malloc_printerr () from /lib64/libc.so.6
#4 0x00002b268524b6e0 in _int_realloc () from /lib64/libc.so.6
#5 0x00002b268524cd82 in realloc () from /lib64/libc.so.6
#6 0x00002b268ac4fec1 in ofi_bufpool_grow () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#7 0x00002b268ac8d5c9 in rxm_alloc_conn () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#8 0x00002b268ac8d8ef in rxm_add_conn () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#9 0x00002b268ac8e084 in rxm_handle_event.isra.7 () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#10 0x00002b268ac8e428 in rxm_conn_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#11 0x00002b268ac98bee in rxm_ep_do_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#12 0x00002b268ac98c81 in rxm_ep_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#13 0x00002b268ac4b52d in ofi_cq_progress () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#14 0x00002b268ac4aa6b in ofi_cq_readfrom () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../lib64/../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#15 0x00002b2686708e3e in fi_cq_readfrom (src_addr=0x3e08150, count=16, buf=0x3e081d0, cq=0x2b28f0034ab0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/prereq/release/ofi/include/rdma/fi_eq.h:400
#16 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x3e08150, cq_events=0x3e081d0, context=0x2b28f00896d0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na_ofi.c:3288
#17 na_ofi_progress (na_class=0x2b28f002c070, context=0x2b28f00896d0, timeout=0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na_ofi.c:5451
#18 0x00002b26867040c1 in NA_Progress (na_class=na_class@entry=0x2b28f002c070, context=context@entry=0x2b28f00896d0, timeout=timeout@entry=0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/na/na.c:1267
#19 0x00002b26862dea70 in hg_core_progress_na (na_class=0x2b28f002c070, na_context=0x2b28f00896d0, timeout=0, progressed_ptr=progressed_ptr@entry=0x3e08670 "")
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3896
#20 0x00002b26862e09e4 in hg_core_poll (progressed_ptr=, timeout=, context=0x2b28f00874e0)
at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3838
#21 hg_core_progress (context=0x2b28f00874e0, timeout=0) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:3693
#22 0x00002b26862e5f1b in HG_Core_progress (context=, timeout=) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury_core.c:5056
#23 0x00002b26862d8063 in HG_Progress (context=context@entry=0x2b28f002c0a0, timeout=) at /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/build/external/release/mercury/src/mercury.c:2022
#24 0x00002b26838fa151 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x2b28f0026d38, timeout=timeout@entry=0) at src/cart/crt_hg.c:1277
#25 0x00002b26838bd005 in crt_progress (crt_ctx=0x2b28f0026d20, timeout=0) at src/cart/crt_context.c:1454
#26 0x000000000043c7dd in dss_srv_handler (arg=0x3d173f0) at src/engine/srv.c:474
#27 0x00002b26846e3dba in ABTD_ythread_func_wrapper () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#28 0x00002b26846e3f61 in make_fcontext () from /work2/08126/dbohninx/frontera/BUILDS/daos-8250/20210805/daos/install/bin/../prereq/release/argobots/lib/libabt.so.1
#29 0x0000000000000000 in ?? ()
The text was updated successfully, but these errors were encountered: