Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hfd5 parallel test suite segfaults in psm2 provider #5478

Closed
frostedcmos opened this issue Dec 4, 2019 · 5 comments
Closed

hfd5 parallel test suite segfaults in psm2 provider #5478

frostedcmos opened this issue Dec 4, 2019 · 5 comments
Assignees
Labels

Comments

@frostedcmos
Copy link

frostedcmos commented Dec 4, 2019

Using hdf5 parallel test suite with DAOS/CART/Mercury after recent updates we started seeing segfault with psm2 provider (trace below). Same test suite passes fine over ofi+sockets.

Unfortunately at this point it is unclear which exact change has caused this, as the test is not ran automatically on each update. Filing this with available information to keep track and will add more details as they become available.

Components used:
PSM2: PSM2_11.2.78
OFI: 8634070

Core was generated by `/home/mschaara/source/mpio-box/hdf5/build/testpar/.libs/testphdf5'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fb53dfa0406 in ips_scbctrl_bufalloc (scb=scb@entry=0x7fb537977840) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:235
235 SLIST_REMOVE_HEAD(&scbc->sbuf_free, next);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libgfortran-4.8.5-16.el7.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 libquadmath-4.8.5-16.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-43.el7.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sssd-client-1.15.2-50.el7_4.2.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007fb53dfa0406 in ips_scbctrl_bufalloc (scb=scb@entry=0x7fb537977840) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:235
#1 0x00007fb53dfa05ce in ips_scbctrl_alloc (scbc=0x1e52298, scbnum=0, scbnum@entry=1, len=len@entry=296, flags=)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:263
#2 0x00007fb53dfb6b5a in ips_am_short_reply (tok=, handler=0, args=0x7fff26b48c80, nargs=3, src=0x34c0a60, len=272, flags=0, completion_fn=0x0, completion_ctxt=0x0)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:396
#3 0x00007fb53eb20a68 in psmx2_am_rma_handler () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#4 0x00007fb53dfb62f9 in ips_am_run_handler (p_hdr=p_hdr@entry=0x7fb538e1f440, ipsaddr=ipsaddr@entry=0x206d070, proto_am=proto_am@entry=0x1e52220, payload=0x7fb538c65818, paylen=0)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:468
#5 0x00007fb53dfb7196 in ips_proto_am (rcv_ev=0x7fff26b49040) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:600
#6 0x00007fb53dfa14d1 in ips_proto_process_packet (rcv_ev=0x7fff26b49040) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_help.h:554
#7 ips_recvhdrq_progress (recvq=recvq@entry=0x1e57cd8) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_recvhdrq.c:593
#8 0x00007fb53df9e402 in ips_ptl_poll (ptl_gen=0x1e51f80, _ignored=) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ptl.c:541
#9 0x00007fb53df9c6f7 in __psmi_poll_internal (ep=0x1e51bc0, poll_amsh=poll_amsh@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm.c:1041
#10 0x00007fb53df96900 in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x7fff26b492c8, mq=0x1e4c060) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm_mq.c:1157
#11 __psm2_mq_ipeek (mq=0x1e4c060, oreq=0x7fff26b492c8, status=0x0) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm_mq.c:1196
#12 0x00007fb53eb12121 in psmx2_cq_poll_mq () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#13 0x00007fb53eb14d2c in psmx2_cq_readfrom () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#14 0x00007fb53f222ede in fi_cq_readfrom (src_addr=0x7fff26b494c0, count=16, buf=0x7fff26b49540, cq=0x2028430) at /home/mschaara/install/deps_daos/ofi/include/rdma/fi_eq.h:391
#15 na_ofi_cq_read (max_count=16, na_class=0x1e38340, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x7fff26b494c0,
cq_events=0x7fff26b49540, context=0x1e495d0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na_ofi.c:2492
#16 na_ofi_progress (na_class=0x1e38340, context=0x1e495d0, timeout=0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na_ofi.c:4231
#17 0x00007fb53f220008 in NA_Progress (na_class=0x1e38340, context=0x1e495d0, timeout=timeout@entry=0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na.c:1230
#18 0x00007fb53f43e382 in hg_core_progress_na_cb (arg=0x2026180, error=, progressed=0x7fff26b4996f "") at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:3038
#19 0x00007fb53f0167e7 in hg_poll_wait (poll_set=, timeout=timeout@entry=1, progressed=progressed@entry=0x7fff26b49ccf "")
at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/util/mercury_poll.c:465
#20 0x00007fb53f43e6d3 in hg_core_progress_poll (context=0x2026180, timeout=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:3280
#21 0x00007fb53f44381c in HG_Core_progress (context=, timeout=timeout@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:4850
#22 0x00007fb53f43b39d in HG_Progress (context=context@entry=0x201c2d0, timeout=timeout@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury.c:2225
#23 0x00007fb53ff3ddab in crt_hg_progress (hg_ctx=hg_ctx@entry=0x2020fc8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1364
#24 0x00007fb53feef03b in crt_progress (crt_ctx=0x2020fb0, timeout=-1, cond_cb=0x7fb540f1e060 <ev_progress_cb>, arg=0x7fff26b49e08) at src/cart/crt_context.c:1286
#25 0x00007fb540f21e80 in daos_event_priv_wait () at src/client/api/event.c:1216
#26 0x00007fb540f25dd7 in dc_task_schedule (task=, instant=) at src/client/api/task.c:139
#27 0x00007fb540f198a2 in daos_array_write (iod=iod@entry=0x7fff26b49f50, sgl=sgl@entry=0x7fff26b49f40, csums=csums@entry=0x0, ev=, oh=..., th=...) at src/client/api/array.c:213

@shefty
Copy link
Member

shefty commented Dec 5, 2019

Do you have a known version where this worked? 1.8?

@frostedcmos
Copy link
Author

we are in process now of finding version where it worked before

@mwheinz
Copy link
Contributor

mwheinz commented Dec 19, 2019

Question: Is the DAOS library multi-threaded?

@frostedcmos
Copy link
Author

frostedcmos commented Dec 19, 2019 via email

@github-actions
Copy link
Contributor

There has been no activity on this issue for more than 360 days. Marking it stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants