-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hfd5 parallel test suite segfaults in psm2 provider #5478
Labels
Comments
Do you have a known version where this worked? 1.8? |
we are in process now of finding version where it worked before |
Question: Is the DAOS library multi-threaded? |
Multi-threading in daos is mostly done via usage of argobots.
Numerous cart tests and samples use pthreads for multi-threading.
DAOS performs all communications via CaRT that provides higher level interfaces for various operations such as broadcast/incast, group/rank management and others. CaRT itself is built on top of mercury (https://mercury-hpc.github.io/)
DAOS -> Cart -> mercury -> libfabric -> providers
From: Michael Heinz <notifications@github.com>
Sent: Thursday, December 19, 2019 11:09 AM
To: ofiwg/libfabric <libfabric@noreply.github.com>
Cc: Oganezov, Alexander A <alexander.a.oganezov@intel.com>; Author <author@noreply.github.com>
Subject: Re: [ofiwg/libfabric] hfd5 parallel test suite segfaults in psm2 provider (#5478)
Question: Is the DAOS library multi-threaded?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#5478?email_source=notifications&email_token=AIRRV63LYMDCK42KAKRTA7DQZPBDPA5CNFSM4JVRSJD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHKTR7Y#issuecomment-567621887>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIRRV65FBTKGPOX4LBHKXA3QZPBDPANCNFSM4JVRSJDQ>.
|
There has been no activity on this issue for more than 360 days. Marking it stale. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Using hdf5 parallel test suite with DAOS/CART/Mercury after recent updates we started seeing segfault with psm2 provider (trace below). Same test suite passes fine over ofi+sockets.
Unfortunately at this point it is unclear which exact change has caused this, as the test is not ran automatically on each update. Filing this with available information to keep track and will add more details as they become available.
Components used:
PSM2: PSM2_11.2.78
OFI: 8634070
Core was generated by `/home/mschaara/source/mpio-box/hdf5/build/testpar/.libs/testphdf5'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007fb53dfa0406 in ips_scbctrl_bufalloc (scb=scb@entry=0x7fb537977840) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:235
235 SLIST_REMOVE_HEAD(&scbc->sbuf_free, next);
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 libatomic-4.8.5-28.el7_5.1.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-16.el7.x86_64 libgfortran-4.8.5-16.el7.x86_64 libibverbs-15-7.el7_5.x86_64 libnl3-3.2.28-4.el7.x86_64 libquadmath-4.8.5-16.el7.x86_64 librdmacm-15-7.el7_5.x86_64 libuuid-2.23.2-43.el7.x86_64 libyaml-0.1.4-11.el7_0.x86_64 numactl-libs-2.0.9-7.el7.x86_64 sssd-client-1.15.2-50.el7_4.2.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0 0x00007fb53dfa0406 in ips_scbctrl_bufalloc (scb=scb@entry=0x7fb537977840) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:235
#1 0x00007fb53dfa05ce in ips_scbctrl_alloc (scbc=0x1e52298, scbnum=0, scbnum@entry=1, len=len@entry=296, flags=)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_scb.c:263
#2 0x00007fb53dfb6b5a in ips_am_short_reply (tok=, handler=0, args=0x7fff26b48c80, nargs=3, src=0x34c0a60, len=272, flags=0, completion_fn=0x0, completion_ctxt=0x0)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:396
#3 0x00007fb53eb20a68 in psmx2_am_rma_handler () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#4 0x00007fb53dfb62f9 in ips_am_run_handler (p_hdr=p_hdr@entry=0x7fb538e1f440, ipsaddr=ipsaddr@entry=0x206d070, proto_am=proto_am@entry=0x1e52220, payload=0x7fb538c65818, paylen=0)
at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:468
#5 0x00007fb53dfb7196 in ips_proto_am (rcv_ev=0x7fff26b49040) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_am.c:600
#6 0x00007fb53dfa14d1 in ips_proto_process_packet (rcv_ev=0x7fff26b49040) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_proto_help.h:554
#7 ips_recvhdrq_progress (recvq=recvq@entry=0x1e57cd8) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ips_recvhdrq.c:593
#8 0x00007fb53df9e402 in ips_ptl_poll (ptl_gen=0x1e51f80, _ignored=) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/ptl_ips/ptl.c:541
#9 0x00007fb53df9c6f7 in __psmi_poll_internal (ep=0x1e51bc0, poll_amsh=poll_amsh@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm.c:1041
#10 0x00007fb53df96900 in psmi_mq_ipeek_inner (status_copy=, status=0x0, oreq=0x7fff26b492c8, mq=0x1e4c060) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm_mq.c:1157
#11 __psm2_mq_ipeek (mq=0x1e4c060, oreq=0x7fff26b492c8, status=0x0) at /home/mschaara/source/deps_daos/daos/_build.external/psm2/psm_mq.c:1196
#12 0x00007fb53eb12121 in psmx2_cq_poll_mq () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#13 0x00007fb53eb14d2c in psmx2_cq_readfrom () from /home/mschaara/install/deps_daos/ofi/lib/libfabric.so.1
#14 0x00007fb53f222ede in fi_cq_readfrom (src_addr=0x7fff26b494c0, count=16, buf=0x7fff26b49540, cq=0x2028430) at /home/mschaara/install/deps_daos/ofi/include/rdma/fi_eq.h:391
#15 na_ofi_cq_read (max_count=16, na_class=0x1e38340, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x7fff26b494c0,
cq_events=0x7fff26b49540, context=0x1e495d0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na_ofi.c:2492
#16 na_ofi_progress (na_class=0x1e38340, context=0x1e495d0, timeout=0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na_ofi.c:4231
#17 0x00007fb53f220008 in NA_Progress (na_class=0x1e38340, context=0x1e495d0, timeout=timeout@entry=0) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/na/na.c:1230
#18 0x00007fb53f43e382 in hg_core_progress_na_cb (arg=0x2026180, error=, progressed=0x7fff26b4996f "") at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:3038
#19 0x00007fb53f0167e7 in hg_poll_wait (poll_set=, timeout=timeout@entry=1, progressed=progressed@entry=0x7fff26b49ccf "")
at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/util/mercury_poll.c:465
#20 0x00007fb53f43e6d3 in hg_core_progress_poll (context=0x2026180, timeout=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:3280
#21 0x00007fb53f44381c in HG_Core_progress (context=, timeout=timeout@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury_core.c:4850
#22 0x00007fb53f43b39d in HG_Progress (context=context@entry=0x201c2d0, timeout=timeout@entry=1) at /home/mschaara/source/deps_daos/daos/_build.external/mercury/src/mercury.c:2225
#23 0x00007fb53ff3ddab in crt_hg_progress (hg_ctx=hg_ctx@entry=0x2020fc8, timeout=timeout@entry=1000) at src/cart/crt_hg.c:1364
#24 0x00007fb53feef03b in crt_progress (crt_ctx=0x2020fb0, timeout=-1, cond_cb=0x7fb540f1e060 <ev_progress_cb>, arg=0x7fff26b49e08) at src/cart/crt_context.c:1286
#25 0x00007fb540f21e80 in daos_event_priv_wait () at src/client/api/event.c:1216
#26 0x00007fb540f25dd7 in dc_task_schedule (task=, instant=) at src/client/api/task.c:139
#27 0x00007fb540f198a2 in daos_array_write (iod=iod@entry=0x7fff26b49f50, sgl=sgl@entry=0x7fff26b49f40, csums=csums@entry=0x0, ev=, oh=..., th=...) at src/client/api/array.c:213
The text was updated successfully, but these errors were encountered: