DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

frostedcmos · 2021-03-30T16:46:22Z

In a test we have multiple servers and clients attempting to rdma to/from those servers.
If during rdma we kill the client via CTRL+C, the server side code crashes with the following trace:

OFI: 1.12.0
Provider: tcp;ofi_rxm

(gdb) bt
#0 0x00007f57edaefcb3 in rxm_conn_close () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#1 0x00007f57edaf152d in rxm_conn_handle_event () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#2 0x00007f57edaf277b in rxm_msg_eq_progress () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#3 0x00007f57edaf290d in rxm_cmap_connect () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#4 0x00007f57edaf2d61 in rxm_get_conn () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#5 0x00007f57edaf7fc2 in rxm_ep_tsend () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#6 0x00007f57f2fa27ca in fi_tsend (context=0x7f5720174dc8, tag=, dest_addr=, desc=, len=, buf=, ep=)
at /home/mschaara/install/daos/prereq/dev/ofi/include/rdma/fi_tagged.h:114
#7 na_ofi_cq_process_retries (context=0x7f5720043b10) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:3380
#8 na_ofi_progress (na_class=0x7f572002c0f0, context=0x7f5720043b10, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:5161
#9 0x00007f57f2f99c21 in NA_Progress (na_class=na_class@entry=0x7f572002c0f0, context=context@entry=0x7f5720043b10, timeout=timeout@entry=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na.c:1168
#10 0x00007f57f31c3370 in hg_core_progress_na (na_class=0x7f572002c0f0, na_context=0x7f5720043b10, timeout=0, progressed_ptr=progressed_ptr@entry=0x2dc28c0 "")
at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3896
#11 0x00007f57f31c51a4 in hg_core_poll (progressed_ptr=, timeout=, context=0x7f572002c4d0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3838
#12 hg_core_progress (context=0x7f572002c4d0, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3693
#13 0x00007f57f31ca38b in HG_Core_progress (context=, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:5056
#14 0x00007f57f31bcd52 in HG_Progress (context=context@entry=0x7f572002c120, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury.c:2020
#15 0x00007f57f5c6ce41 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f5720026dc8, timeout=timeout@entry=0) at src/cart/crt_hg.c:1233
#16 0x00007f57f5c2faa5 in crt_progress (crt_ctx=0x7f5720026db0, timeout=0) at src/cart/crt_context.c:1394
#17 0x0000000000422225 in dss_srv_handler (arg=0x2cd1410) at src/engine/srv.c:470
#18 0x00007f57f4bbc7ea in ABTD_ythread_func_wrapper () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#19 0x00007f57f4bbc991 in make_fcontext () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#20 0x0000000000000000 in ?? ()
(gdb) q

nikhilnanal · 2021-04-06T00:06:55Z

Could you please provide information on the setup used and steps (command + parameters) to recreate this issue. We have done the daos and ior setup on our cluster.

frostedcmos · 2021-04-06T00:47:21Z

Hi,

Unfortunately right now we dont have simple reproduction scenario, and it requires few more components to be setup/installed before you can hit the issue. I am currently in the process of trying to reduce reproduction to cart-level test (which would eliminate need for daos server and other components). However at this point I was not able to hit the problem with simpler samples.

In order to recreate this you will need 3 nodes, 2 server and 1 client.
Assuming you have daos already compiled/installed:

Beyond daos stack this requires following:

HDF5 library to be built from here:
https://github.com/HDFGroup/hdf5/tree/hdf5-1_13_0-rc5

configure line to use:
./configure --prefix=... --disable-fortran --enable-parallel --enable-map-api
make all
make install

Next is to compile vol-daos with hdf5:

Download from: https://github.com/HDFGroup/vol-daos
Setup PATH and LD_IBRARY_PATH to point to install location of hdf5, e.g.
export PATH=/path/to/hdf5_install/bin:$PATH
export LD_LIBRARY_PATH=/path/to/hdf5_install/lib:$LD_LIBRARY_PATH

Once you update PATH and LD_LIBRARY_PATH to point to hdf5 prefix location vol ccmake would pick that up automatically

steps to build vol-daos are specified in:
https://github.com/HDFGroup/vol-daos/blob/master/README.md

In short you should be able to just do the folloiwng:
cd hdf5_vol_daos-X
mkdir build
cd build
ccmake ..

Once everything is compiled, you should have 'h5_partest_t_shapesame' test available which reproduces the problem.

Next step is to start daos server:

mkdir /tmp/daos_server
mkdir /mnt/daos/

configure daos_server.yml file. example below
instead of access_points: ['wolf-12vm1:10001'] replace with your host of one of servers

name: daos_server
access_points: ['wolf-12vm1:10001']
port: 10001
provider: ofi+tcp;ofi_rxm
socket_dir: /tmp/daos_server
nr_hugepages: 4096
control_log_mask: INFO
control_log_file: /tmp/msc_control.log
crt_timeout: 120
transport_config:
  allow_insecure: true

servers:
-
  targets: 8
  nr_xs_helpers: 1
  first_core: 1
  fabric_iface_port: 31418
  fabric_iface: ib0
  log_mask: ERR
  log_file: /tmp/msc.log
  env_vars:
#  - HG_NA_LOG_LEVEL=warning
#  - HG_LOG_LEVEL=warning
#  - FI_LOG_LEVEL=warn
  - DAOS_MD_CAP=1024
  - CRT_CREDIT_EP_CTX=0
  - FI_UNIVERSE_SIZE=16383
  - DD_STDERR=ERR
  # Storage definitions
  scm_mount: /mnt/daos  # map to -s /mnt/daos
  scm_class: ram
  scm_size: 64

Start daos server on server nodes; replace wolf-12vm1,wolf-12vm2 with your two server nodes:
clush -w wolf-12vm1,wolf-12vm2 -f 8 -o "-t -t" daos_server start -o /path/to/daos_server.yml --recreate-superblocks

Start daos agent on your client node:
mkdir -p /tmp/daos_agent
/path/to/daos/install/bin/daos_agent -i -s /tmp/daos_agent -o /path/to/daos_agent.yml

Example daos_agent.yml (replace access_points with one of server hosts):

access_points: ['wolf-12vm1']
port: 10001
transport_config:
allow_insecure: true

runtime_dir: /tmp/daos_agent
log_file: /tmp/daos_agent.log

Before running application:
Make sure to set the following on client node:
export HDF5_PLUGIN_PATH=/path/to/vol-daos/build/bin
export HDF5_VOL_CONNECTOR=daos

the easiest way to run is to create a pool and export that in DAOS_POOL env:
Helper script to create the pool and set all other envariables:

#!/bin/bash

command="dmg pool create -s=$1 -n=$2 -o=/path/to/daos.yml"
matches=$( $command | grep UUID | tail -1 )
echo $matches
pool=echo $matches | awk -F '[/= ,]' '{print $3}'

export DAOS_POOL=$pool
export DAOS_SVCL=1
export DAOS_CONT=uuidgen
export DAOS_FUSE=/tmp/dfuse

echo "DAOS_POOL = $DAOS_POOL"
echo "DAOS_CONT = $DAOS_CONT"
echo "DAOS_FUSE = $DAOS_FUSE"

Example of daos.yml (replace host with server host):
hostlist: ['wolf-12vm1:10001']
transport_config:
allow_insecure: true

Once that is done you can finally run the test:
mpirun -np 10 --genvall /home/mschaara/source/vol-daos/build/bin/h5_partest_t_shapesame

Wait for couple of seconds
ctrl-c out of the app

frostedcmos · 2021-04-06T20:58:47Z

Mohamad was able to reproduce this problem running ior, waiting 2 seconds

Example of command run:
mpirun -np 64 --hostfile ~/config/cli_hosts --genvall ior -a DFS --dfs.cont $DAOS_CONT --dfs.pool $DAOS_POOL -b 1g -t 1m -o /testFile -g

DAOS_CONT and DAOS_POOL can be set using helper script above once the daos server has started

nikhilnanal · 2021-04-06T23:39:10Z

I am still having issues with the make of the vol-daos.

An undefined reference to daos_obj_generate_oid.
.....
...
/usr/bin/cc -Wall -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow -O2 -g -DNDEBUG -rdynamic CMakeFiles/h5daos_test_oclass.dir/h5daos_test_oclass.c.o -o ../../bin/h5daos_test_oclass -L/home/nnanal/gitrepos/daos/install/lib64 -Wl,-rpath,/home/nnanal/gitrepos/vol-daos/build/bin:/home/nnanal/gitrepos/daos/install/lib64:/home/nnanal/gitrepos/hdf5/installdir/lib ../../bin/libhdf5_vol_daos.so.1.1.0 /home/nnanal/gitrepos/daos/install/lib64/libdaos.so.1.1.0 -lduns /home/nnanal/gitrepos/hdf5/installdir/lib/libhdf5.so -lz -ldl -lm -lmpi -lrt -lpthread -ldl -lm -lmpi -lrt -lpthread -luuid
../../bin/libhdf5_vol_daos.so.1.1.0**: undefined reference to `daos_obj_generate_oid'**
collect2: error: ld returned 1 exit status
make[2]: *** [test/daos_vol/CMakeFiles/h5daos_test_oclass.dir/build.make:100: bin/h5daos_test_oclass] Error 1
make[2]: Leaving directory '/home/nnanal/gitrepos/vol-daos/build'
make[1]: *** [CMakeFiles/Makefile2:1066: test/daos_vol/CMakeFiles/h5daos_test_oclass.dir/all] Error 2
make[1]: Leaving directory '/home/nnanal/gitrepos/vol-daos/build'
make: *** [Makefile:163: all] Error 2
The readme.md specifies an option HDF5_C_INCLUDE_DIR . But I could not find it in the ccmake configuration. Other parameters are present and have been configured as required.
Here is the ccmake configuration
BUILD_DOCUMENTATION OFF
BUILD_EXAMPLES OFF
BUILD_SHARED_LIBS ON
BUILD_TESTING ON
BZRCOMMAND BZRCOMMAND-NOTFOUND s/vol-daos/build/bin
CMAKE_AR LD_TYPE /usr/bin/ar
CMAKE_ARCHIVE_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
CMAKE_BUILD_TYPE RelWithDebInfo ze=undefined -fno-omit-frame-pointer
CMAKE_COLOR_MAKEFILE ON
CMAKE_CXX_COMPILER /usr/bin/c++ gitrepos/vol-daos/build/bin
CMAKE_CXX_COMPILER_AR /usr/bin/gcc-ar /nnanal/gitrepos/vol-daos/build/bin
CMAKE_CXX_COMPILER_RANLIB /usr/bin/gcc-ranlib
CMAKE_CXX_FLAGS
CMAKE_CXX_FLAGS_DEBUG -g r/lib64/libm.so
CMAKE_CXX_FLAGS_MINSIZEREL -Os -DNDEBUG
CMAKE_CXX_FLAGS_RELEASE -O2 -DNDEBUG
CMAKE_CXX_FLAGS_RELWITHDEBINFO -O2 -g -DNDEBUG
CMAKE_CXX_FLAGS_UBSAN -O1 -g -fsanitize=undefined -fno-omit-frame-pointer
CMAKE_C_COMPILER /usr/bin/cc
CMAKE_C_COMPILER_AR /usr/bin/gcc-ar
CMAKE_C_COMPILER_RANLIB /usr/bin/gcc-ranlib
CMAKE_C_FLAGS -Wall -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow
CMAKE_C_FLAGS_DEBUG -g
CMAKE_C_FLAGS_MINSIZEREL -Os -DNDEBUG
CMAKE_C_FLAGS_RELEASE -O2 -DNDEBUG
CMAKE_C_FLAGS_RELWITHDEBINFO -O2 -g -DNDEBUG
CMAKE_C_FLAGS_UBSAN -O1 -g -fsanitize=undefined -fno-omit-frame-pointer
CMAKE_EXE_LINKER_FLAGS
CMAKE_EXE_LINKER_FLAGS_DEBUG
CMAKE_EXE_LINKER_FLAGS_MINSIZE
CMAKE_EXE_LINKER_FLAGS_RELEASE
CMAKE_EXE_LINKER_FLAGS_RELWITH
CMAKE_EXPORT_COMPILE_COMMANDS OFF
CMAKE_INSTALL_PREFIX /usr/local
CMAKE_LIBRARY_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
CMAKE_LINKER /usr/bin/ld
CMAKE_MAKE_PROGRAM /usr/bin/gmake
CMAKE_MODULE_LINKER_FLAGS
CMAKE_MODULE_LINKER_FLAGS_DEBU
CMAKE_MODULE_LINKER_FLAGS_MINS
CMAKE_MODULE_LINKER_FLAGS_RELE
CMAKE_MODULE_LINKER_FLAGS_RELW
CMAKE_NM /usr/bin/nm
CMAKE_OBJCOPY /usr/bin/objcopy
CMAKE_OBJDUMP /usr/bin/objdump
CMAKE_RANLIB /usr/bin/ranlib
CMAKE_RUNTIME_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
Page 2 of 3
CMAKE_SHARED_LINKER_FLAGS
CMAKE_SHARED_LINKER_FLAGS_DEBU
CMAKE_SHARED_LINKER_FLAGS_MINS
CMAKE_SHARED_LINKER_FLAGS_RELE
CMAKE_SHARED_LINKER_FLAGS_RELW ZRCOMMAND-NOTFOUND s/vol-daos/build/bin
CMAKE_SKIP_INSTALL_RPATH OFF /bin/ar
CMAKE_SKIP_RPATH OFF e/nnanal/gitrepos/vol-daos/build/bin me-pointer
CMAKE_STATIC_LINKER_FLAGS elWithDebInfo ze=undefined -fno-omit-frame-pointer
CMAKE_STATIC_LINKER_FLAGS_DEBU r/local
CMAKE_STATIC_LINKER_FLAGS_MINS usr/bin/c++ gitrepos/vol-daos/build/bin
CMAKE_STATIC_LINKER_FLAGS_RELE -ar /nnanal/gitrepos/vol-daos/build/bin
CMAKE_STATIC_LINKER_FLAGS_RELW usr/bin/gcc-ranlib
CMAKE_STRIP /usr/bin/stripitrepos/hdf5/installdir/lib/libhdf5.so
CMAKE_VERBOSE_MAKEFILE OFFr/lib64/libm.so
COVERAGE_COMMAND INSIZE /usr/bin/gcovDNDEBUG
COVERAGE_EXTRA_FLAGS -O2 -DNDEBl
CPACK_SOURCE_RPM OFF -g -DNDEBUG
CPACK_SOURCE_TBZ2 ON -g -fsanitize=undefined -fno-omit-frame-pointer
CPACK_SOURCE_TGZ ON r/bin/cc
CPACK_SOURCE_TXZ ON r/bin/gcc-ar
CPACK_SOURCE_TZ _RANLIB ON r/bin/gcc-ranlib
CPACK_SOURCE_ZIP OFF l -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow
CTEST_SUBMIT_RETRY_COUNT 3
CTEST_SUBMIT_RETRY_DELAY 5 s -DNDEBUG
CVSCOMMAND GS_RELEASE CVSCOMMAND-NOTFOUND
CVS_UPDATE_OPTIONS THDEBINFO -d -A -P NDEBUG
DAOS_AGENT_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/daos_agent
DAOS_DMG_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/dmg
DAOS_INCLUDE_DIR /home/nnanal/gitrepos/daos/install/include
DAOS_LIBRARY /home/nnanal/gitrepos/daos/install/lib64/libdaos.so.1.1.0
DAOS_POOL_SIZE 4
DAOS_SERVER_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/daos_server
DAOS_SERVER_IFACE lo
DAOS_SERVER_SCM_MNT /mnt/daos
DAOS_SERVER_SCM_SIZE 8 RECTORY /home/nnanal/gitrepos/vol-daos/build/bin
DAOS_SERVER_TRANSPORT ofi+sockets
DAOS_UNS_LIBRARY /home/nnanal/gitrepos/daos/install/lib64/libduns.so
DART_TESTING_TIMEOUT 1500
GITCOMMAND /usr/bin/git
GIT_EXECUTABLE /usr/bin/git
HDF5_C_COMPILER_EXECUTABLE /home/nnanal/gitrepos/hdf5/installdir/bin/h5pcc
HDF5_C_LIBRARY_dl /usr/lib64/libdl.so
HDF5_C_LIBRARY_hdf5 /home/nnanal/gitrepos/hdf5/installdir/lib/libhdf5.so
HDF5_C_LIBRARY_m /usr/lib64/libm.so
HDF5_C_LIBRARY_z /usr/lib64/libz.so
HDF5_DIFF_EXECUTABLE /home/nnanal/gitrepos/hdf5/installdir/bin/h5diff
HDF5_DIR HDF5_DIR-NOTFOUND pos/vol-daos/build/bin
HDF5_VOL_DAOS_ENABLE_COVERAGE OFF
HDF5_VOL_DAOS_ENABLE_DEBUG OFF
HDF5_VOL_DAOS_ENABLE_MEM_TRACK OFF
HDF5_VOL_DAOS_TESTING_USE_SYST ON
HDF5_VOL_TEST_ENABLE_ASYNC OFF ZRCOMMAND-NOTFOUND s/vol-daos/build/bin
HDF5_VOL_TEST_ENABLE_PARALLEL ON /bin/ar
HDF5_VOL_TEST_ENABLE_PART OFF e/nnanal/gitrepos/vol-daos/build/bin me-pointer
HGCOMMAND HGCOMMAND-NOTFOUNDundefined -fno-omit-frame-pointer
MAKECOMMAND /usr/bin/cmake --build . --config "${CTEST_CONFIGURATION_TYPE}" -- -i
MEMORYCHECK_COMMAND /usr/bin/valgrindepos/vol-daos/build/bin
MEMORYCHECK_SUPPRESSIONS_FILE -ar /nnanal/gitrepos/vol-daos/build/bin
MPIEXEC_EXECUTABLE /opt/intel/oneapi/mpi/2021.1-beta09/bin/mpiexec
MPIEXEC_MAX_NUMPROCS 48 r/bin/stripitrepos/hdf5/installdir/lib/libhdf5.so
MPIEXEC_NUMPROC_FLAG -n r/lib64/libm.so
MPIEXEC_POSTFLAGSINSIZE /usr/bin gcovDNDEBUG
MPIEXEC_PREFLAGS -O2 -DNDE
MPI_C_ADDITIONAL_INCLUDE_DIRS -g -DNDEBUG
MPI_C_COMPILER /opt/intel/oneapi/mpi/2021.1-beta09/bin/mpigcc
MPI_C_COMPILE_DEFINITIONS r/bin/cc
MPI_C_COMPILE_OPTIONS r/bin/gcc-ar
MPI_C_HEADER_DIR_RANLIB /opt/intel/oneapi/mpi/2021.1-beta09/include
MPI_C_LIB_NAMES mpi;rt;pthread;dlnline -Wcast-qual -std=gnu99 -Wshadow
MPI_C_LINK_FLAGS Y_COUNT -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /opt/intel/oneapi/mpi/2021.1-beta09/lib/release -Xlinker -
MPI_dl_LIBRARY TRY_DELAY /usr/lib64/libdl.so
MPI_mpi_LIBRARYELEASE /opt/intel/oneapi/mpi/2021.1-beta09/lib/release/libmpi.so
MPI_pthread_LIBRARYTHDEBINFO /usr/lib64/libpthread.so
MPI_rt_LIBRARY UTABLE /usr/lib64/librt.so s/daos/install/bin/daos_agent
P4COMMAND XECUTABLE P4COMMAND-NOTFOUND os/daos/install/bin/dmg
PKG_CONFIG_EXECUTABLE /usr/bin/pkg-config s/daos/install/include
SCPCOMMAND /usr/bin/scp gitrepos/daos/install/lib64/libdaos.so.1.1.0
SITE POOL_SIZE jfcst-dev
SLURM_SBATCH_COMMAND /home/nnanSLURM_SBATCH_COMMAND-NOTFOUND aos_server
SLURM_SRUN_COMMAND SLURM_SRUN_COMMAND-NOTFOUND
SVNCOMMAND _SCM_MNT SVNCOMMAND-NOTFOUND
UUID_INCLUDE_DIR 8 RECTORY /usr/include/gitrepos/vol-daos/build/bin
UUID_LIBRARY RANSPORT /usr/lib64/libuuid.so

shefty · 2021-04-07T00:04:36Z

I don't know what ior is, but it we use that, can we reproduce the problem easier than following that 2 page recipe?

frostedcmos · 2021-04-07T00:07:55Z

According to Mohamad yes, the ior run can also be used to reproduce things without having to setup daos-vol/hdf5 setup.
You would still need to launch daos (so this part of setup is still needed), but after the daos is running all you should need to is run ior as listed and ctrl-c out of it after few seconds

frostedcmos · 2021-04-07T18:51:13Z

I've been able just now to reproduce the problem much easier using cart-level server and client end.

step1: Start server using script below. (All runs assume you are starting from the top of daos/ directory)

Modify 'HOST' envariable to set to your hostname; also change INTERFACE_1/INTERFACE_2
if you dont have ib0 interface (change to eth0).

`export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
unset OFI_DOMAIN
export D_LOG_MASK=WARN
export INTERFACE_1=ib0
export INTERFACE_2=ib0
export CRT_TIMEOUT=10
export CRT_ATTACH_INFO_PATH="."
HOST="wolf-55"

SERVER_APP="./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=."

set -x
export OTHER_ENVARS="-x D_LOG_MASK -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT"
orterun -H ${HOST}:5 --np 4 -x OFI_INTERFACE=${INTERFACE_1} ${OTHER_ENVARS} ${SERVER_APP} : -H ${HOST}:5 --np 1 -x OFI_INTERFACE=${INTERFACE_2} ${OTHER_ENVARS} ${SERVER_APP}
Once it launches you should see something like this output in the terminal:SRV [rank=2 pid=118658] Basic server started, group_size=5
SRV [rank=2 pid=118658] Protocol registered
SRV [rank=2 pid=118658] Contexts created 1
SRV [rank=0 pid=118656] Basic server started, group_size=5
SRV [rank=0 pid=118656] Protocol registered
SRV [rank=0 pid=118656] Contexts created 1
SRV [rank=1 pid=118657] Basic server started, group_size=5
SRV [rank=1 pid=118657] Protocol registered
SRV [rank=1 pid=118657] Contexts created 1
SRV [rank=4 pid=118660] Basic server started, group_size=5
SRV [rank=4 pid=118660] Protocol registered
SRV [rank=4 pid=118660] Contexts created 1
SRV [rank=3 pid=118659] Basic server started, group_size=5
SRV [rank=3 pid=118659] Protocol registered
SRV [rank=3 pid=118659] Contexts created 1
SRV [rank=0 pid=118656] Group config file saved
`

In a separate terminal launch self_test:
export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
export OFI_INTERFACE=ib0
./install/bin/self_test --group-name selftest_srv_grp --endpoint 0-4:0 -q --message-sizes "b100048576" --max-inflight-rpcs 16 --repetitions 100 -t -n -p .

Wait few seconds and ctrl+c out of it.
In the terminal with servers started you will see output similar to:
`--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 0 on node wolf-55 exited on signal 11 (Segmentation fault).

`

one of the traces from generated core files (there seem to be few different failure points) is:
(gdb) bt #0 0x00007f321de352cc in rxm_handle_comp_error () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #1 0x00007f321de2b350 in rxm_conn_handle_event () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #2 0x00007f321de2c03c in rxm_conn_progress () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #3 0x00007f321fecfe25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f321ee49bad in clone () from /lib64/libc.so.6

nikhilnanal · 2021-04-07T23:28:48Z

Just running the servers and the cleints - one of the servers seems to be crashing because it cannot create a mount point at /mnt/daos. The client and the other server report to be listening.
[nnanal@cst-icx1 ~]$ clush -w cst-icx1,cst-icx2 -f 8 -o "-t -t" daos_server start -o ~/daos_server.yml --recreate-superblocks
cst-icx1: DAOS Server config loaded from /home/nnanal/daos_server.yml
cst-icx2: DAOS Server config loaded from /home/nnanal/daos_server.yml
cst-icx1: daos_server logging to file /tmp/daos_server.log
cst-icx2: daos_server logging to file /tmp/daos_server.log
cst-icx1: DAOS Control Server v1.1.3 (pid 22023) listening on 0.0.0.0:100
01
cst-icx1: Checking DAOS I/O Engine instance 0 storage ...
cst-icx1: instance 0 exited: server: code = 638 description = "the SCM mountpoint at /mnt/daos is unavailable and can't be created/mounted"
cst-icx1: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)
cst-icx1: &&& RAS EVENT id: [engine_status_down] ts: [2021-04-07T15:48:18
.788085-0700] host: [cst-icx1.cluster] type: [STATE_CHANGE] sev: [ERROR]
msg: [DAOS rank exited unexpectedly] pid: [22023]
cst-icx2: DAOS Control Server v1.1.3 (pid 14875) listening on 0.0.0.0:100
01

[nnanal@cst-icx3 ~]$ daos_agent -i -s /tmp/daos_agent/ -o ~/daos_agent.yml
DAOS Agent v1.1.3 (pid 9583) listening on /tmp/daos_agent/daos_agent.sock

frostedcmos · 2021-04-08T00:07:51Z

@nikhilnanal please try with just cart level reproducers, as they would require significantly less setup.

In your case you might need to first mkdir /mnt/daos and make sure it is chmod-ed/chowned by the same user who is launching the daos

nikhilnanal · 2021-04-12T18:19:43Z

I tried to run the script using cart. However it cannot find the crt_launch.

export 'OTHER_ENVARS=-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
OTHER_ENVARS='-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
orterun -H cst-icx1:5 --np 4 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=. : -H cst-icx1:5 --np 1 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=.

orterun was unable to launch the specified application as it could not access
or execute an executable:

Executable: ./install/bin/crt_launch
Node: cst-icx1

while attempting to start process rank 0.

4 total processes failed to start

is there a specific version of daos I should build? I m building daos v1.1.3.
Here is the command I used to build daos
scons --config=force --build-deps=yes install (https://daos-stack.github.io/admin/installation/) are there any other options that I should additionally provide?

frostedcmos · 2021-04-12T18:52:09Z

daos compiles some of the samples/tests optionally based on whether MPI is found on your system or not.
in order to get crt_launch to be built you need to set MPI paths by using
module list (to see which mpis are available)
module load [mpi variant you want to compile against]

after that:
daos --build-deps=yes --config=force MPI_PKG=any install

If everything is correct you should get crt_launch in your install/bin/ directory.

frostedcmos · 2021-04-15T17:46:41Z

@nikhilnanal
Please let us know on daos end if you are able to reproduce the issue or still having problem launching this test.

nikhilnanal · 2021-04-15T17:55:12Z

I was able to build the crt_launch and for once it did run the test with the output above. but the second time and afterwards everytime I've tried to run it . it is giving these errors:

export 'OTHER_ENVARS=-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEO UT=10'
OTHER_ENVARS='-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
orterun -H cst-icx1:5 --np 4 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=. : -H cst-icx1:5 --np 1 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=.

By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

Local host: cst-icx1
Local adapter: mlx5_0
Local port: 1

WARNING: There was an error initializing an OpenFabrics device.

Local host: cst-icx1
Local device: mlx5_0

WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.

Local host: cst-icx1

04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
[jfcst-dev:3853119] 4 more processes have sent help message help-mpi-btl-openib.txt / ib port not s elected
[jfcst-dev:3853119] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messa ges
[jfcst-dev:3853119] 2 more processes have sent help message help-mpi-btl-openib.txt / error in devi ce init
[jfcst-dev:3853119] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
I have verified that the ibports are active

nikhilnanal · 2021-04-15T17:57:48Z

The OpenIb errors were present in the first trial as well. so I am not sure whats causing the HG errors.

frostedcmos · 2021-04-15T17:58:23Z

make sure there are no runaway processes from the first run; might need to kill test_group_np_srv and test_group_np_cli manually

nikhilnanal · 2021-04-15T18:04:44Z

okay that seems to work, thank you.
Another question I had is
"In a separate terminal launch self_test:
export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
export OFI_INTERFACE=ib0
./install/bin/self_test --group-name selftest_srv_grp --endpoint 0-4:0 -q --message-sizes "b100048576" --max-inflight-rpcs 16 --repetitions 100 -t -n -p .
"
should this be run on a different node ( like client server style) or the same node which is running the first part of the script?

frostedcmos · 2021-04-15T18:07:27Z

you can run it on the same node; i only use 1 node in my own reproduction

nikhilnanal · 2021-04-15T18:22:17Z

Okay now it is showing up
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_core.c:3194

hg_core_send_output_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg WARN src/cart/crt_hg.c:1153 crt_hg_reply_send_cb() hg_cbinfo->ret: 22, opc: 0xff030007.
04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

orterun noticed that process rank 3 with PID 0 on node cst-icx1 exited on signal 11 (Segmentation fault).

What is the expected outcome after pressing ctrl + c ? other than the Segmentation fault. Is it expected to terminate cleanly or that the orterun processes should not terminate?

frostedcmos · 2021-04-15T18:23:26Z

the expectation for servers is to continue working when client is ctrl-c-ed out

frostedcmos · 2021-04-15T18:24:52Z

bulk transfer errors are expected, as we are terminating bulk mid transaction, however the subsequent server crash is not

nikhilnanal · 2021-04-15T18:25:41Z

ok thank you. I ll try to debug from here.

gnailzenh · 2021-04-22T03:08:18Z

I'm checking the source code and suspect rxm_conn_handle_notify() can leave a freed handle in rxm_cmap::handles_av, could you confirm this change makes sense, or I misunderstood the code, thanks.

diff --git a/prov/rxm/src/rxm_conn.c b/prov/rxm/src/rxm_conn.c
index 30dd5c9d7..10206b520 100644
--- a/prov/rxm/src/rxm_conn.c
+++ b/prov/rxm/src/rxm_conn.c
@@ -1109,9 +1109,8 @@ static int rxm_conn_handle_notify(struct fi_eq_entry *eq_entry)
                dlist_remove(&handle->peer->entry);
                free(handle->peer);
                handle->peer = NULL;
-       } else {
-               cmap->handles_av[handle->fi_addr] = NULL;
        }
+       cmap->handles_av[handle->fi_addr] = NULL;
        rxm_conn_free(handle);
        return 0;
 }

frostedcmos · 2021-04-22T03:52:15Z

I've tried locally @gnailzenh 's patch and it didnt seem to fix the issue.
Also as an additional note after doing number of local experiments here, it appears that it's not precisely aborted rdma that crashes servers, it is the subsequent rpc that hits rxm connection crash.
In case of cart-level reproducer above, servers 'ping' each other every few seconds, and a ping after aborted rdma transaction is the one generating crash that points to:
Program terminated with signal 11, Segmentation fault.
#0 0x00007f754b086f0c in rxm_handle_comp_error ()

shefty · 2021-04-22T03:53:56Z

It's possible for handle->fi_addr == FI_ADDR_NOTAVAIL. The peer's address does not need to be in the AV. The initialization of handle only sets either fi_addr or peer, but it's not obvious to me if that requirement is always maintained. The if-else suggests we could add assert(handle->fi_addr == FI_ADDR_NOTAVAIL) in the if case. If that gets hit, then you've at least found one issue.

shefty · 2021-04-22T03:57:23Z

@frostedcmos - Is the segfault while running a debug version of libfabric?

frostedcmos · 2021-04-22T03:59:18Z

No, the retry just tonight was using release version:
#0 0x00007f171bfc7f0c in rxm_handle_comp_error ()
from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#1 0x00007f171bfbc580 in rxm_conn_handle_event ()
from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#2 0x00007f171bfbd293 in rxm_conn_progress ()

shefty · 2021-04-22T04:08:42Z

I suspect the completion error may be closer to the actual problem. Thanks. I'll analyze the tcp and rxm error reporting, particularly around the handling for internal messages.

swelch · 2021-04-22T14:01:59Z

It's possible for handle->fi_addr == FI_ADDR_NOTAVAIL. The peer's address does not need to be in the AV. The initialization of handle only sets either fi_addr or peer, but it's not obvious to me if that requirement is always maintained. The if-else suggests we could add assert(handle->fi_addr == FI_ADDR_NOTAVAIL) in the if case. If that gets hit, then you've at least found one issue.

I believe in the case of a loop back connection (endpoint connects to itself) you will find that handle->fi_addr is valid and the peer exists as well. Other cases should be one or the other since the peer is moved if the AV entry is later added. So it may still be a good idea to check both independently.

shefty · 2021-04-23T15:51:10Z

PR #6707 attempts to handle completion errors better. It's hard to test those changes since it requires generating completion errors though. But it might help with the segfault in rxm_handle_comp_error() that was hit. @nikhilnanal - if you're at the point where you can reproduce the crash, testing the changes in that PR would be useful, at least to ensure that it didn't make things worse.

@swelch - Do you know where in the code path that occurs? I don't mind check both independently to be safe, but I'd like to understand if there is a real issue that we could be hitting.

swelch · 2021-04-23T16:14:06Z

@swelch - Do you know where in the code path that occurs? I don't mind check both independently to be safe, but I'd like to understand if there is a real issue that we could be hitting.

@shefty - I believe it occurs when the local side has initiated a connect request to itself (hence the address is in it's AV), then when processing the connect request in the rxm_cmap_process_connreq() the AV handle state will show as RXM_CMAP_CONNREQ_SENT, since the local and requester address will be the same, a peer handle is allocated and the fi_addr set to the associated AV entry; the peer handle is used to create a new message endpoint and accept the connection. Seems like the notify will close the peer, but leave the handle in the AV valid. However, the more I think about it we may get a notify for both the AV handle and the peer handle in this case; which if true it may ultimately close both the peer and AV handle (but the connection is unusable when the first is received).

shefty · 2021-04-23T17:01:09Z

@swelch -- Thanks, I see that path now. So it is possible for those both to be set. I'll add a fix to my open PR to address the problem pointed out by @gnailzenh.

shefty · 2021-04-23T17:12:57Z

#6707 - updated with fi_addr fix.

gnailzenh · 2021-04-27T14:41:27Z

thanks, I also noticed there is a "TODO" in rxm_eq_sread()

+               /* TODO convert this to poll + fi_eq_read so that we can grab
+                * rxm_ep lock before reading the EQ. This is needed to avoid
+                * processing events / error entries from closed MSG EPs. This
+                * can be done only for non-Windows OSes as Windows doesn't
+                 have poll for a generic file descriptor. /

Because we are using auto-progress, is this a race that can happen?

shefty · 2021-04-27T15:11:06Z

Hmm... that sounds like it's describing a real race. I don't know for certain without doing a deep dive through the code to see how the cleanup occurs.

shefty · 2021-05-15T00:11:04Z

I did find issues in the tcp provider where it could report completions for transfers that were NOT initiated by the upper level user. E.g. an internal ack. There are fixes for this in master.

gnailzenh · 2021-06-09T09:41:37Z

Hi, is there a tag for these fixes, or could you provide commit hashes of those patches?

shefty · 2021-06-09T15:51:10Z

There's not a tag, but there will be a v1.13 release within about 3 weeks.

shefty · 2021-06-09T16:03:14Z

Btw, I've rewritten the connection management code in rxm, which I hope will start us down a path of fixing all of the DAOS connection related issues. See #6778. The code is still under testing, and I'm hesitant to pull it into v1.13 without broader testing.

frostedcmos · 2021-07-16T19:45:57Z

has there been any update on this? is it planned to be merged anytime soon into post v1.13?

j-xiong · 2021-07-16T21:06:31Z

#6778 has been replaced with #6833. This is being evaluated and will be merged once it is shown to be good.

frostedcmos · 2021-08-04T18:54:08Z

Update:
There appears to be a new regression that was introduced sometimes after 1.12 regarding this ticket.

With tcp;ofi_rxm:
Reproducer runs with v1.12 (servers start up and client connects to them)
Reproducer fails to run with v1.13 (servers start up but clients cant connect to them anymore)
Reproducer also fails to run with 7d6d2a1 (same behavior as with v1.13)

Reproducer still runs with sockets and verbs;ofi_rxm.

shefty · 2021-10-01T23:00:13Z

#7110 resolved the issue in local testing, when added on top of other fixes in main.

shefty mentioned this issue Mar 30, 2021

DAOS: stale connection with rxm #6660

Closed

shefty added the high priority label Apr 3, 2021

ooststep mentioned this issue Sep 30, 2021

prov/tcp: mark rma read control msg as internal #7110

Merged

shefty closed this as completed Oct 1, 2021

frostedcmos mentioned this issue Jan 12, 2022

DAOS: tcp;ofi_rxm ctrl+c out of client app during rdma transfer crashes server/mem corruption #7367

Closed

DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

Comments

frostedcmos commented Mar 30, 2021

nikhilnanal commented Apr 6, 2021 • edited Loading

frostedcmos commented Apr 6, 2021 • edited Loading

configure daos_server.yml file. example below instead of access_points: ['wolf-12vm1:10001'] replace with your host of one of servers

frostedcmos commented Apr 6, 2021

nikhilnanal commented Apr 6, 2021

shefty commented Apr 7, 2021

frostedcmos commented Apr 7, 2021

frostedcmos commented Apr 7, 2021

Wait few seconds and ctrl+c out of it. In the terminal with servers started you will see output similar to: `-------------------------------------------------------------------------- orterun noticed that process rank 0 with PID 0 on node wolf-55 exited on signal 11 (Segmentation fault).

nikhilnanal commented Apr 7, 2021

frostedcmos commented Apr 8, 2021

nikhilnanal commented Apr 12, 2021

while attempting to start process rank 0.

frostedcmos commented Apr 12, 2021

frostedcmos commented Apr 15, 2021

nikhilnanal commented Apr 15, 2021 • edited Loading

Local host: cst-icx1 Local device: mlx5_0

Local host: cst-icx1

hg_core_init(): Could not initialize NA class

HG_Core_init_opt(): Cannot initialize HG core layer

HG_Init_opt(): Could not create HG core class

hg_core_init(): Could not initialize NA class

HG_Core_init_opt(): Cannot initialize HG core layer

HG_Init_opt(): Could not create HG core class

hg_core_init(): Could not initialize NA class

HG_Core_init_opt(): Cannot initialize HG core layer

HG_Init_opt(): Could not create HG core class

hg_core_init(): Could not initialize NA class

HG_Core_init_opt(): Cannot initialize HG core layer

HG_Init_opt(): Could not create HG core class

hg_core_init(): Could not initialize NA class

HG_Core_init_opt(): Cannot initialize HG core layer

HG_Init_opt(): Could not create HG core class

nikhilnanal commented Apr 15, 2021

frostedcmos commented Apr 15, 2021

nikhilnanal commented Apr 15, 2021

frostedcmos commented Apr 15, 2021

nikhilnanal commented Apr 15, 2021

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

hg_core_send_output_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12. 04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

orterun noticed that process rank 3 with PID 0 on node cst-icx1 exited on signal 11 (Segmentation fault).

frostedcmos commented Apr 15, 2021

frostedcmos commented Apr 15, 2021

nikhilnanal commented Apr 15, 2021

gnailzenh commented Apr 22, 2021

frostedcmos commented Apr 22, 2021

shefty commented Apr 22, 2021

shefty commented Apr 22, 2021

frostedcmos commented Apr 22, 2021

shefty commented Apr 22, 2021

swelch commented Apr 22, 2021

shefty commented Apr 23, 2021

swelch commented Apr 23, 2021

shefty commented Apr 23, 2021

shefty commented Apr 23, 2021

gnailzenh commented Apr 27, 2021

shefty commented Apr 27, 2021

shefty commented May 15, 2021

gnailzenh commented Jun 9, 2021

shefty commented Jun 9, 2021

shefty commented Jun 9, 2021

frostedcmos commented Jul 16, 2021

j-xiong commented Jul 16, 2021

frostedcmos commented Aug 4, 2021

shefty commented Oct 1, 2021

nikhilnanal commented Apr 6, 2021 •

edited

Loading

frostedcmos commented Apr 6, 2021 •

edited

Loading

configure daos_server.yml file. example below
instead of access_points: ['wolf-12vm1:10001'] replace with your host of one of servers

Wait few seconds and ctrl+c out of it.
In the terminal with servers started you will see output similar to:
`--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 0 on node wolf-55 exited on signal 11 (Segmentation fault).

nikhilnanal commented Apr 15, 2021 •

edited

Loading

Local host: cst-icx1
Local device: mlx5_0

04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.