Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665

Closed
frostedcmos opened this issue Mar 30, 2021 · 42 comments

Comments

@frostedcmos
Copy link

In a test we have multiple servers and clients attempting to rdma to/from those servers.
If during rdma we kill the client via CTRL+C, the server side code crashes with the following trace:

OFI: 1.12.0
Provider: tcp;ofi_rxm

(gdb) bt
#0 0x00007f57edaefcb3 in rxm_conn_close () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#1 0x00007f57edaf152d in rxm_conn_handle_event () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#2 0x00007f57edaf277b in rxm_msg_eq_progress () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#3 0x00007f57edaf290d in rxm_cmap_connect () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#4 0x00007f57edaf2d61 in rxm_get_conn () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#5 0x00007f57edaf7fc2 in rxm_ep_tsend () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#6 0x00007f57f2fa27ca in fi_tsend (context=0x7f5720174dc8, tag=, dest_addr=, desc=, len=, buf=, ep=)
at /home/mschaara/install/daos/prereq/dev/ofi/include/rdma/fi_tagged.h:114
#7 na_ofi_cq_process_retries (context=0x7f5720043b10) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:3380
#8 na_ofi_progress (na_class=0x7f572002c0f0, context=0x7f5720043b10, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:5161
#9 0x00007f57f2f99c21 in NA_Progress (na_class=na_class@entry=0x7f572002c0f0, context=context@entry=0x7f5720043b10, timeout=timeout@entry=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na.c:1168
#10 0x00007f57f31c3370 in hg_core_progress_na (na_class=0x7f572002c0f0, na_context=0x7f5720043b10, timeout=0, progressed_ptr=progressed_ptr@entry=0x2dc28c0 "")
at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3896
#11 0x00007f57f31c51a4 in hg_core_poll (progressed_ptr=, timeout=, context=0x7f572002c4d0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3838
#12 hg_core_progress (context=0x7f572002c4d0, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3693
#13 0x00007f57f31ca38b in HG_Core_progress (context=, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:5056
#14 0x00007f57f31bcd52 in HG_Progress (context=context@entry=0x7f572002c120, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury.c:2020
#15 0x00007f57f5c6ce41 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f5720026dc8, timeout=timeout@entry=0) at src/cart/crt_hg.c:1233
#16 0x00007f57f5c2faa5 in crt_progress (crt_ctx=0x7f5720026db0, timeout=0) at src/cart/crt_context.c:1394
#17 0x0000000000422225 in dss_srv_handler (arg=0x2cd1410) at src/engine/srv.c:470
#18 0x00007f57f4bbc7ea in ABTD_ythread_func_wrapper () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#19 0x00007f57f4bbc991 in make_fcontext () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#20 0x0000000000000000 in ?? ()
(gdb) q

@nikhilnanal
Copy link
Contributor

nikhilnanal commented Apr 6, 2021

Could you please provide information on the setup used and steps (command + parameters) to recreate this issue. We have done the daos and ior setup on our cluster.

@frostedcmos
Copy link
Author

frostedcmos commented Apr 6, 2021

Hi,

Unfortunately right now we dont have simple reproduction scenario, and it requires few more components to be setup/installed before you can hit the issue. I am currently in the process of trying to reduce reproduction to cart-level test (which would eliminate need for daos server and other components). However at this point I was not able to hit the problem with simpler samples.

In order to recreate this you will need 3 nodes, 2 server and 1 client.
Assuming you have daos already compiled/installed:

Beyond daos stack this requires following:

HDF5 library to be built from here:
https://github.com/HDFGroup/hdf5/tree/hdf5-1_13_0-rc5

configure line to use:
./configure --prefix=... --disable-fortran --enable-parallel --enable-map-api
make all
make install

Next is to compile vol-daos with hdf5:

Download from: https://github.com/HDFGroup/vol-daos
Setup PATH and LD_IBRARY_PATH to point to install location of hdf5, e.g.
export PATH=/path/to/hdf5_install/bin:$PATH
export LD_LIBRARY_PATH=/path/to/hdf5_install/lib:$LD_LIBRARY_PATH

Once you update PATH and LD_LIBRARY_PATH to point to hdf5 prefix location vol ccmake would pick that up automatically

steps to build vol-daos are specified in:
https://github.com/HDFGroup/vol-daos/blob/master/README.md

In short you should be able to just do the folloiwng:
cd hdf5_vol_daos-X
mkdir build
cd build
ccmake ..

Once everything is compiled, you should have 'h5_partest_t_shapesame' test available which reproduces the problem.

Next step is to start daos server:

mkdir /tmp/daos_server
mkdir /mnt/daos/

configure daos_server.yml file. example below
instead of access_points: ['wolf-12vm1:10001'] replace with your host of one of servers

name: daos_server
access_points: ['wolf-12vm1:10001']
port: 10001
provider: ofi+tcp;ofi_rxm
socket_dir: /tmp/daos_server
nr_hugepages: 4096
control_log_mask: INFO
control_log_file: /tmp/msc_control.log
crt_timeout: 120
transport_config:
  allow_insecure: true

servers:
-
  targets: 8
  nr_xs_helpers: 1
  first_core: 1
  fabric_iface_port: 31418
  fabric_iface: ib0
  log_mask: ERR
  log_file: /tmp/msc.log
  env_vars:
#  - HG_NA_LOG_LEVEL=warning
#  - HG_LOG_LEVEL=warning
#  - FI_LOG_LEVEL=warn
  - DAOS_MD_CAP=1024
  - CRT_CREDIT_EP_CTX=0
  - FI_UNIVERSE_SIZE=16383
  - DD_STDERR=ERR
  # Storage definitions
  scm_mount: /mnt/daos  # map to -s /mnt/daos
  scm_class: ram
  scm_size: 64


Start daos server on server nodes; replace wolf-12vm1,wolf-12vm2 with your two server nodes:
clush -w wolf-12vm1,wolf-12vm2 -f 8 -o "-t -t" daos_server start -o /path/to/daos_server.yml --recreate-superblocks

Start daos agent on your client node:
mkdir -p /tmp/daos_agent
/path/to/daos/install/bin/daos_agent -i -s /tmp/daos_agent -o /path/to/daos_agent.yml

Example daos_agent.yml (replace access_points with one of server hosts):

access_points: ['wolf-12vm1']
port: 10001
transport_config:
allow_insecure: true

runtime_dir: /tmp/daos_agent
log_file: /tmp/daos_agent.log

Before running application:
Make sure to set the following on client node:
export HDF5_PLUGIN_PATH=/path/to/vol-daos/build/bin
export HDF5_VOL_CONNECTOR=daos

the easiest way to run is to create a pool and export that in DAOS_POOL env:
Helper script to create the pool and set all other envariables:

#!/bin/bash

command="dmg pool create -s=$1 -n=$2 -o=/path/to/daos.yml"
matches=$( $command | grep UUID | tail -1 )
echo $matches
pool=echo $matches | awk -F '[/= ,]' '{print $3}'

export DAOS_POOL=$pool
export DAOS_SVCL=1
export DAOS_CONT=uuidgen
export DAOS_FUSE=/tmp/dfuse

echo "DAOS_POOL = $DAOS_POOL"
echo "DAOS_CONT = $DAOS_CONT"
echo "DAOS_FUSE = $DAOS_FUSE"

Example of daos.yml (replace host with server host):
hostlist: ['wolf-12vm1:10001']
transport_config:
allow_insecure: true

Once that is done you can finally run the test:
mpirun -np 10 --genvall /home/mschaara/source/vol-daos/build/bin/h5_partest_t_shapesame

Wait for couple of seconds
ctrl-c out of the app

@frostedcmos
Copy link
Author

Mohamad was able to reproduce this problem running ior, waiting 2 seconds

Example of command run:
mpirun -np 64 --hostfile ~/config/cli_hosts --genvall ior -a DFS --dfs.cont $DAOS_CONT --dfs.pool $DAOS_POOL -b 1g -t 1m -o /testFile -g

DAOS_CONT and DAOS_POOL can be set using helper script above once the daos server has started

@nikhilnanal
Copy link
Contributor

I am still having issues with the make of the vol-daos.

  1. An undefined reference to daos_obj_generate_oid.
    .....
    ...
    /usr/bin/cc -Wall -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow -O2 -g -DNDEBUG -rdynamic CMakeFiles/h5daos_test_oclass.dir/h5daos_test_oclass.c.o -o ../../bin/h5daos_test_oclass -L/home/nnanal/gitrepos/daos/install/lib64 -Wl,-rpath,/home/nnanal/gitrepos/vol-daos/build/bin:/home/nnanal/gitrepos/daos/install/lib64:/home/nnanal/gitrepos/hdf5/installdir/lib ../../bin/libhdf5_vol_daos.so.1.1.0 /home/nnanal/gitrepos/daos/install/lib64/libdaos.so.1.1.0 -lduns /home/nnanal/gitrepos/hdf5/installdir/lib/libhdf5.so -lz -ldl -lm -lmpi -lrt -lpthread -ldl -lm -lmpi -lrt -lpthread -luuid
    ../../bin/libhdf5_vol_daos.so.1.1.0**: undefined reference to `daos_obj_generate_oid'**
    collect2: error: ld returned 1 exit status
    make[2]: *** [test/daos_vol/CMakeFiles/h5daos_test_oclass.dir/build.make:100: bin/h5daos_test_oclass] Error 1
    make[2]: Leaving directory '/home/nnanal/gitrepos/vol-daos/build'
    make[1]: *** [CMakeFiles/Makefile2:1066: test/daos_vol/CMakeFiles/h5daos_test_oclass.dir/all] Error 2
    make[1]: Leaving directory '/home/nnanal/gitrepos/vol-daos/build'
    make: *** [Makefile:163: all] Error 2

  2. The readme.md specifies an option HDF5_C_INCLUDE_DIR . But I could not find it in the ccmake configuration. Other parameters are present and have been configured as required.
    Here is the ccmake configuration
    BUILD_DOCUMENTATION OFF
    BUILD_EXAMPLES OFF
    BUILD_SHARED_LIBS ON
    BUILD_TESTING ON
    BZRCOMMAND BZRCOMMAND-NOTFOUND s/vol-daos/build/bin
    CMAKE_AR LD_TYPE /usr/bin/ar
    CMAKE_ARCHIVE_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
    CMAKE_BUILD_TYPE RelWithDebInfo ze=undefined -fno-omit-frame-pointer
    CMAKE_COLOR_MAKEFILE ON
    CMAKE_CXX_COMPILER /usr/bin/c++ gitrepos/vol-daos/build/bin
    CMAKE_CXX_COMPILER_AR /usr/bin/gcc-ar /nnanal/gitrepos/vol-daos/build/bin
    CMAKE_CXX_COMPILER_RANLIB /usr/bin/gcc-ranlib
    CMAKE_CXX_FLAGS
    CMAKE_CXX_FLAGS_DEBUG -g r/lib64/libm.so
    CMAKE_CXX_FLAGS_MINSIZEREL -Os -DNDEBUG
    CMAKE_CXX_FLAGS_RELEASE -O2 -DNDEBUG
    CMAKE_CXX_FLAGS_RELWITHDEBINFO -O2 -g -DNDEBUG
    CMAKE_CXX_FLAGS_UBSAN -O1 -g -fsanitize=undefined -fno-omit-frame-pointer
    CMAKE_C_COMPILER /usr/bin/cc
    CMAKE_C_COMPILER_AR /usr/bin/gcc-ar
    CMAKE_C_COMPILER_RANLIB /usr/bin/gcc-ranlib
    CMAKE_C_FLAGS -Wall -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow
    CMAKE_C_FLAGS_DEBUG -g
    CMAKE_C_FLAGS_MINSIZEREL -Os -DNDEBUG
    CMAKE_C_FLAGS_RELEASE -O2 -DNDEBUG
    CMAKE_C_FLAGS_RELWITHDEBINFO -O2 -g -DNDEBUG
    CMAKE_C_FLAGS_UBSAN -O1 -g -fsanitize=undefined -fno-omit-frame-pointer
    CMAKE_EXE_LINKER_FLAGS
    CMAKE_EXE_LINKER_FLAGS_DEBUG
    CMAKE_EXE_LINKER_FLAGS_MINSIZE
    CMAKE_EXE_LINKER_FLAGS_RELEASE
    CMAKE_EXE_LINKER_FLAGS_RELWITH
    CMAKE_EXPORT_COMPILE_COMMANDS OFF
    CMAKE_INSTALL_PREFIX /usr/local
    CMAKE_LIBRARY_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
    CMAKE_LINKER /usr/bin/ld
    CMAKE_MAKE_PROGRAM /usr/bin/gmake
    CMAKE_MODULE_LINKER_FLAGS
    CMAKE_MODULE_LINKER_FLAGS_DEBU
    CMAKE_MODULE_LINKER_FLAGS_MINS
    CMAKE_MODULE_LINKER_FLAGS_RELE
    CMAKE_MODULE_LINKER_FLAGS_RELW
    CMAKE_NM /usr/bin/nm
    CMAKE_OBJCOPY /usr/bin/objcopy
    CMAKE_OBJDUMP /usr/bin/objdump
    CMAKE_RANLIB /usr/bin/ranlib
    CMAKE_RUNTIME_OUTPUT_DIRECTORY /home/nnanal/gitrepos/vol-daos/build/bin
    Page 2 of 3
    CMAKE_SHARED_LINKER_FLAGS
    CMAKE_SHARED_LINKER_FLAGS_DEBU
    CMAKE_SHARED_LINKER_FLAGS_MINS
    CMAKE_SHARED_LINKER_FLAGS_RELE
    CMAKE_SHARED_LINKER_FLAGS_RELW ZRCOMMAND-NOTFOUND s/vol-daos/build/bin
    CMAKE_SKIP_INSTALL_RPATH OFF /bin/ar
    CMAKE_SKIP_RPATH OFF e/nnanal/gitrepos/vol-daos/build/bin me-pointer
    CMAKE_STATIC_LINKER_FLAGS elWithDebInfo ze=undefined -fno-omit-frame-pointer
    CMAKE_STATIC_LINKER_FLAGS_DEBU r/local
    CMAKE_STATIC_LINKER_FLAGS_MINS usr/bin/c++ gitrepos/vol-daos/build/bin
    CMAKE_STATIC_LINKER_FLAGS_RELE -ar /nnanal/gitrepos/vol-daos/build/bin
    CMAKE_STATIC_LINKER_FLAGS_RELW usr/bin/gcc-ranlib
    CMAKE_STRIP /usr/bin/stripitrepos/hdf5/installdir/lib/libhdf5.so
    CMAKE_VERBOSE_MAKEFILE OFFr/lib64/libm.so
    COVERAGE_COMMAND INSIZE /usr/bin/gcovDNDEBUG
    COVERAGE_EXTRA_FLAGS -O2 -DNDEBl
    CPACK_SOURCE_RPM OFF -g -DNDEBUG
    CPACK_SOURCE_TBZ2 ON -g -fsanitize=undefined -fno-omit-frame-pointer
    CPACK_SOURCE_TGZ ON r/bin/cc
    CPACK_SOURCE_TXZ ON r/bin/gcc-ar
    CPACK_SOURCE_TZ _RANLIB ON r/bin/gcc-ranlib
    CPACK_SOURCE_ZIP OFF l -Wextra -Winline -Wcast-qual -std=gnu99 -Wshadow
    CTEST_SUBMIT_RETRY_COUNT 3
    CTEST_SUBMIT_RETRY_DELAY 5 s -DNDEBUG
    CVSCOMMAND GS_RELEASE CVSCOMMAND-NOTFOUND
    CVS_UPDATE_OPTIONS THDEBINFO -d -A -P NDEBUG
    DAOS_AGENT_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/daos_agent
    DAOS_DMG_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/dmg
    DAOS_INCLUDE_DIR /home/nnanal/gitrepos/daos/install/include
    DAOS_LIBRARY /home/nnanal/gitrepos/daos/install/lib64/libdaos.so.1.1.0
    DAOS_POOL_SIZE 4
    DAOS_SERVER_EXECUTABLE /home/nnanal/gitrepos/daos/install/bin/daos_server
    DAOS_SERVER_IFACE lo
    DAOS_SERVER_SCM_MNT /mnt/daos
    DAOS_SERVER_SCM_SIZE 8 RECTORY /home/nnanal/gitrepos/vol-daos/build/bin
    DAOS_SERVER_TRANSPORT ofi+sockets
    DAOS_UNS_LIBRARY /home/nnanal/gitrepos/daos/install/lib64/libduns.so
    DART_TESTING_TIMEOUT 1500
    GITCOMMAND /usr/bin/git
    GIT_EXECUTABLE /usr/bin/git
    HDF5_C_COMPILER_EXECUTABLE /home/nnanal/gitrepos/hdf5/installdir/bin/h5pcc
    HDF5_C_LIBRARY_dl /usr/lib64/libdl.so
    HDF5_C_LIBRARY_hdf5 /home/nnanal/gitrepos/hdf5/installdir/lib/libhdf5.so
    HDF5_C_LIBRARY_m /usr/lib64/libm.so
    HDF5_C_LIBRARY_z /usr/lib64/libz.so
    HDF5_DIFF_EXECUTABLE /home/nnanal/gitrepos/hdf5/installdir/bin/h5diff
    HDF5_DIR HDF5_DIR-NOTFOUND pos/vol-daos/build/bin
    HDF5_VOL_DAOS_ENABLE_COVERAGE OFF
    HDF5_VOL_DAOS_ENABLE_DEBUG OFF
    HDF5_VOL_DAOS_ENABLE_MEM_TRACK OFF
    HDF5_VOL_DAOS_TESTING_USE_SYST ON
    HDF5_VOL_TEST_ENABLE_ASYNC OFF ZRCOMMAND-NOTFOUND s/vol-daos/build/bin
    HDF5_VOL_TEST_ENABLE_PARALLEL ON /bin/ar
    HDF5_VOL_TEST_ENABLE_PART OFF e/nnanal/gitrepos/vol-daos/build/bin me-pointer
    HGCOMMAND HGCOMMAND-NOTFOUNDundefined -fno-omit-frame-pointer
    MAKECOMMAND /usr/bin/cmake --build . --config "${CTEST_CONFIGURATION_TYPE}" -- -i
    MEMORYCHECK_COMMAND /usr/bin/valgrindepos/vol-daos/build/bin
    MEMORYCHECK_SUPPRESSIONS_FILE -ar /nnanal/gitrepos/vol-daos/build/bin
    MPIEXEC_EXECUTABLE /opt/intel/oneapi/mpi/2021.1-beta09/bin/mpiexec
    MPIEXEC_MAX_NUMPROCS 48 r/bin/stripitrepos/hdf5/installdir/lib/libhdf5.so
    MPIEXEC_NUMPROC_FLAG -n r/lib64/libm.so
    MPIEXEC_POSTFLAGSINSIZE /usr/bin gcovDNDEBUG
    MPIEXEC_PREFLAGS -O2 -DNDE
    MPI_C_ADDITIONAL_INCLUDE_DIRS -g -DNDEBUG
    MPI_C_COMPILER /opt/intel/oneapi/mpi/2021.1-beta09/bin/mpigcc
    MPI_C_COMPILE_DEFINITIONS r/bin/cc
    MPI_C_COMPILE_OPTIONS r/bin/gcc-ar
    MPI_C_HEADER_DIR_RANLIB /opt/intel/oneapi/mpi/2021.1-beta09/include
    MPI_C_LIB_NAMES mpi;rt;pthread;dlnline -Wcast-qual -std=gnu99 -Wshadow
    MPI_C_LINK_FLAGS Y_COUNT -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /opt/intel/oneapi/mpi/2021.1-beta09/lib/release -Xlinker -
    MPI_dl_LIBRARY TRY_DELAY /usr/lib64/libdl.so
    MPI_mpi_LIBRARYELEASE /opt/intel/oneapi/mpi/2021.1-beta09/lib/release/libmpi.so
    MPI_pthread_LIBRARYTHDEBINFO /usr/lib64/libpthread.so
    MPI_rt_LIBRARY UTABLE /usr/lib64/librt.so s/daos/install/bin/daos_agent
    P4COMMAND XECUTABLE P4COMMAND-NOTFOUND os/daos/install/bin/dmg
    PKG_CONFIG_EXECUTABLE /usr/bin/pkg-config s/daos/install/include
    SCPCOMMAND /usr/bin/scp gitrepos/daos/install/lib64/libdaos.so.1.1.0
    SITE POOL_SIZE jfcst-dev
    SLURM_SBATCH_COMMAND /home/nnanSLURM_SBATCH_COMMAND-NOTFOUND aos_server
    SLURM_SRUN_COMMAND SLURM_SRUN_COMMAND-NOTFOUND
    SVNCOMMAND _SCM_MNT SVNCOMMAND-NOTFOUND
    UUID_INCLUDE_DIR 8 RECTORY /usr/include/gitrepos/vol-daos/build/bin
    UUID_LIBRARY RANSPORT /usr/lib64/libuuid.so

@shefty
Copy link
Member

shefty commented Apr 7, 2021

I don't know what ior is, but it we use that, can we reproduce the problem easier than following that 2 page recipe?

@frostedcmos
Copy link
Author

According to Mohamad yes, the ior run can also be used to reproduce things without having to setup daos-vol/hdf5 setup.
You would still need to launch daos (so this part of setup is still needed), but after the daos is running all you should need to is run ior as listed and ctrl-c out of it after few seconds

@frostedcmos
Copy link
Author

I've been able just now to reproduce the problem much easier using cart-level server and client end.

step1: Start server using script below. (All runs assume you are starting from the top of daos/ directory)

Modify 'HOST' envariable to set to your hostname; also change INTERFACE_1/INTERFACE_2
if you dont have ib0 interface (change to eth0).

`export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
unset OFI_DOMAIN
export D_LOG_MASK=WARN
export INTERFACE_1=ib0
export INTERFACE_2=ib0
export CRT_TIMEOUT=10
export CRT_ATTACH_INFO_PATH="."
HOST="wolf-55"

SERVER_APP="./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=."

set -x
export OTHER_ENVARS="-x D_LOG_MASK -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT"
orterun -H ${HOST}:5 --np 4 -x OFI_INTERFACE=${INTERFACE_1} ${OTHER_ENVARS} ${SERVER_APP} : -H ${HOST}:5 --np 1 -x OFI_INTERFACE=${INTERFACE_2} ${OTHER_ENVARS} ${SERVER_APP}
Once it launches you should see something like this output in the terminal:SRV [rank=2 pid=118658] Basic server started, group_size=5
SRV [rank=2 pid=118658] Protocol registered
SRV [rank=2 pid=118658] Contexts created 1
SRV [rank=0 pid=118656] Basic server started, group_size=5
SRV [rank=0 pid=118656] Protocol registered
SRV [rank=0 pid=118656] Contexts created 1
SRV [rank=1 pid=118657] Basic server started, group_size=5
SRV [rank=1 pid=118657] Protocol registered
SRV [rank=1 pid=118657] Contexts created 1
SRV [rank=4 pid=118660] Basic server started, group_size=5
SRV [rank=4 pid=118660] Protocol registered
SRV [rank=4 pid=118660] Contexts created 1
SRV [rank=3 pid=118659] Basic server started, group_size=5
SRV [rank=3 pid=118659] Protocol registered
SRV [rank=3 pid=118659] Contexts created 1
SRV [rank=0 pid=118656] Group config file saved
`

In a separate terminal launch self_test:
export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
export OFI_INTERFACE=ib0
./install/bin/self_test --group-name selftest_srv_grp --endpoint 0-4:0 -q --message-sizes "b100048576" --max-inflight-rpcs 16 --repetitions 100 -t -n -p .

Wait few seconds and ctrl+c out of it.
In the terminal with servers started you will see output similar to:
`--------------------------------------------------------------------------
orterun noticed that process rank 0 with PID 0 on node wolf-55 exited on signal 11 (Segmentation fault).

`

one of the traces from generated core files (there seem to be few different failure points) is:
(gdb) bt #0 0x00007f321de352cc in rxm_handle_comp_error () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #1 0x00007f321de2b350 in rxm_conn_handle_event () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #2 0x00007f321de2c03c in rxm_conn_progress () from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1 #3 0x00007f321fecfe25 in start_thread () from /lib64/libpthread.so.0 #4 0x00007f321ee49bad in clone () from /lib64/libc.so.6

@nikhilnanal
Copy link
Contributor

Just running the servers and the cleints - one of the servers seems to be crashing because it cannot create a mount point at /mnt/daos. The client and the other server report to be listening.
[nnanal@cst-icx1 ~]$ clush -w cst-icx1,cst-icx2 -f 8 -o "-t -t" daos_server start -o ~/daos_server.yml --recreate-superblocks
cst-icx1: DAOS Server config loaded from /home/nnanal/daos_server.yml
cst-icx2: DAOS Server config loaded from /home/nnanal/daos_server.yml
cst-icx1: daos_server logging to file /tmp/daos_server.log
cst-icx2: daos_server logging to file /tmp/daos_server.log
cst-icx1: DAOS Control Server v1.1.3 (pid 22023) listening on 0.0.0.0:100
01
cst-icx1: Checking DAOS I/O Engine instance 0 storage ...
cst-icx1: instance 0 exited: server: code = 638 description = "the SCM mountpoint at /mnt/daos is unavailable and can't be created/mounted"
cst-icx1: ERROR: removing socket file: removing instance 0 socket file: no dRPC client set (data plane not started?)
cst-icx1: &&& RAS EVENT id: [engine_status_down] ts: [2021-04-07T15:48:18
.788085-0700] host: [cst-icx1.cluster] type: [STATE_CHANGE] sev: [ERROR]
msg: [DAOS rank exited unexpectedly] pid: [22023]

cst-icx2: DAOS Control Server v1.1.3 (pid 14875) listening on 0.0.0.0:100
01

[nnanal@cst-icx3 ~]$ daos_agent -i -s /tmp/daos_agent/ -o ~/daos_agent.yml
DAOS Agent v1.1.3 (pid 9583) listening on /tmp/daos_agent/daos_agent.sock

@frostedcmos
Copy link
Author

@nikhilnanal please try with just cart level reproducers, as they would require significantly less setup.

In your case you might need to first mkdir /mnt/daos and make sure it is chmod-ed/chowned by the same user who is launching the daos

@nikhilnanal
Copy link
Contributor

I tried to run the script using cart. However it cannot find the crt_launch.

  • export 'OTHER_ENVARS=-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
  • OTHER_ENVARS='-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
  • orterun -H cst-icx1:5 --np 4 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=. : -H cst-icx1:5 --np 1 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=.

orterun was unable to launch the specified application as it could not access
or execute an executable:

Executable: ./install/bin/crt_launch
Node: cst-icx1

while attempting to start process rank 0.

4 total processes failed to start

is there a specific version of daos I should build? I m building daos v1.1.3.
Here is the command I used to build daos
scons --config=force --build-deps=yes install (https://daos-stack.github.io/admin/installation/) are there any other options that I should additionally provide?

@frostedcmos
Copy link
Author

daos compiles some of the samples/tests optionally based on whether MPI is found on your system or not.
in order to get crt_launch to be built you need to set MPI paths by using
module list (to see which mpis are available)
module load [mpi variant you want to compile against]

after that:
daos --build-deps=yes --config=force MPI_PKG=any install

If everything is correct you should get crt_launch in your install/bin/ directory.

@frostedcmos
Copy link
Author

@nikhilnanal
Please let us know on daos end if you are able to reproduce the issue or still having problem launching this test.

@nikhilnanal
Copy link
Contributor

nikhilnanal commented Apr 15, 2021

I was able to build the crt_launch and for once it did run the test with the output above. but the second time and afterwards everytime I've tried to run it . it is giving these errors:

  • export 'OTHER_ENVARS=-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEO UT=10'
  • OTHER_ENVARS='-x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10'
  • orterun -H cst-icx1:5 --np 4 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=. : -H cst-icx1:5 --np 1 -x OFI_INTERFACE=mlx5 -x D_LOG_MASK=WARN -x CRT_PHY_ADDR_STR -x CRT_DISABLE_MEM_PIN=1 -x CRT_TIMEOUT=10 ./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=.

By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default. The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

Local host: cst-icx1
Local adapter: mlx5_0
Local port: 1



WARNING: There was an error initializing an OpenFabrics device.

Local host: cst-icx1
Local device: mlx5_0


WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.

Local host: cst-icx1

04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260251/260251] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI
OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260252/260252] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI
OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260253/260253] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI
OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260254/260254] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI
OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] fi WARN src/gurt/fault_inject.c:669 d_fault_inject init() Fault Injection not initialized feature not included in build
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] crt WARN src/cart/crt_init.c:145 data_init() FI_UNI VERSE_SIZE was not set; setting to 2048
04/15-10:18:04.49 cst-icx1 CaRT[260250/260250] crt WARN src/cart/crt_init.c:410 crt_init_opt() FI
OFI_RXM_USE_SRX not set, set=1
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260252/260252] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260251/260251] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260250/260250] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260253/260253] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:916

hg_core_init(): Could not initialize NA class

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury_core.c:4223

HG_Core_init_opt(): Cannot initialize HG core layer

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] external ERR # HG -- error -- /home/nnanal/gitrepos /daos/build/external/release/mercury/src/mercury.c:1010

HG_Init_opt(): Could not create HG core class

04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] hg ERR src/cart/crt_hg.c:525 crt_hg_class_init() Could not initialize HG class.
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] rpc ERR src/cart/crt_context.c:210 crt_context_cre ate() crt_hg_ctx_init() failed, DER_HG(-1020): 'Transport layer mercury error'
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] misc ERR src/utils/crt_launch/crt_launch.c:171 get_ self_uri() crt_context_create() failed; rc=-1020
04/15-10:18:04.53 cst-icx1 CaRT[260254/260254] misc ERR src/utils/crt_launch/crt_launch.c:320 main () Failed to retrieve self uri
[jfcst-dev:3853119] 4 more processes have sent help message help-mpi-btl-openib.txt / ib port not s elected
[jfcst-dev:3853119] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messa ges
[jfcst-dev:3853119] 2 more processes have sent help message help-mpi-btl-openib.txt / error in devi ce init
[jfcst-dev:3853119] 1 more process has sent help message help-mpi-btl-openib.txt / no active ports found
I have verified that the ibports are active

@nikhilnanal
Copy link
Contributor

The OpenIb errors were present in the first trial as well. so I am not sure whats causing the HG errors.

@frostedcmos
Copy link
Author

make sure there are no runaway processes from the first run; might need to kill test_group_np_srv and test_group_np_cli manually

@nikhilnanal
Copy link
Contributor

okay that seems to work, thank you.
Another question I had is
"In a separate terminal launch self_test:
export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm"
export OFI_INTERFACE=ib0
./install/bin/self_test --group-name selftest_srv_grp --endpoint 0-4:0 -q --message-sizes "b100048576" --max-inflight-rpcs 16 --repetitions 100 -t -n -p .
"
should this be run on a different node ( like client server style) or the same node which is running the first part of the script?

@frostedcmos
Copy link
Author

you can run it on the same node; i only use 1 node in my own reproduction

@nikhilnanal
Copy link
Contributor

Okay now it is showing up
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_core.c:3194

hg_core_send_output_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg WARN src/cart/crt_hg.c:1153 crt_hg_reply_send_cb() hg_cbinfo->ret: 22, opc: 0xff030007.
04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020
04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] external ERR # HG -- error -- /home/nnanal/gitrepos/daos/build/external/release/mercury/src/mercury_bulk.c:2359

hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)

04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] st ERR src/cart/crt_self_test_service.c:509 crt_self_test_msg_bulk_put_cb() BULK_GET failed; bci_rc=-1020

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


orterun noticed that process rank 3 with PID 0 on node cst-icx1 exited on signal 11 (Segmentation fault).

What is the expected outcome after pressing ctrl + c ? other than the Segmentation fault. Is it expected to terminate cleanly or that the orterun processes should not terminate?

@frostedcmos
Copy link
Author

the expectation for servers is to continue working when client is ctrl-c-ed out

@frostedcmos
Copy link
Author

bulk transfer errors are expected, as we are terminating bulk mid transaction, however the subsequent server crash is not

@nikhilnanal
Copy link
Contributor

ok thank you. I ll try to debug from here.

@gnailzenh
Copy link

I'm checking the source code and suspect rxm_conn_handle_notify() can leave a freed handle in rxm_cmap::handles_av, could you confirm this change makes sense, or I misunderstood the code, thanks.

diff --git a/prov/rxm/src/rxm_conn.c b/prov/rxm/src/rxm_conn.c
index 30dd5c9d7..10206b520 100644
--- a/prov/rxm/src/rxm_conn.c
+++ b/prov/rxm/src/rxm_conn.c
@@ -1109,9 +1109,8 @@ static int rxm_conn_handle_notify(struct fi_eq_entry *eq_entry)
                dlist_remove(&handle->peer->entry);
                free(handle->peer);
                handle->peer = NULL;
-       } else {
-               cmap->handles_av[handle->fi_addr] = NULL;
        }
+       cmap->handles_av[handle->fi_addr] = NULL;
        rxm_conn_free(handle);
        return 0;
 }

@frostedcmos
Copy link
Author

I've tried locally @gnailzenh 's patch and it didnt seem to fix the issue.
Also as an additional note after doing number of local experiments here, it appears that it's not precisely aborted rdma that crashes servers, it is the subsequent rpc that hits rxm connection crash.
In case of cart-level reproducer above, servers 'ping' each other every few seconds, and a ping after aborted rdma transaction is the one generating crash that points to:
Program terminated with signal 11, Segmentation fault.
#0 0x00007f754b086f0c in rxm_handle_comp_error ()

@shefty
Copy link
Member

shefty commented Apr 22, 2021

It's possible for handle->fi_addr == FI_ADDR_NOTAVAIL. The peer's address does not need to be in the AV. The initialization of handle only sets either fi_addr or peer, but it's not obvious to me if that requirement is always maintained. The if-else suggests we could add assert(handle->fi_addr == FI_ADDR_NOTAVAIL) in the if case. If that gets hit, then you've at least found one issue.

@shefty
Copy link
Member

shefty commented Apr 22, 2021

@frostedcmos - Is the segfault while running a debug version of libfabric?

@frostedcmos
Copy link
Author

No, the retry just tonight was using release version:
#0 0x00007f171bfc7f0c in rxm_handle_comp_error ()
from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#1 0x00007f171bfbc580 in rxm_conn_handle_event ()
from /home/aaoganez/github/daos/install/lib/daos/TESTING/tests/../../../../prereq/release/mercury/lib/../../ofi/lib/libfabric.so.1
#2 0x00007f171bfbd293 in rxm_conn_progress ()

@shefty
Copy link
Member

shefty commented Apr 22, 2021

I suspect the completion error may be closer to the actual problem. Thanks. I'll analyze the tcp and rxm error reporting, particularly around the handling for internal messages.

@swelch
Copy link
Contributor

swelch commented Apr 22, 2021

It's possible for handle->fi_addr == FI_ADDR_NOTAVAIL. The peer's address does not need to be in the AV. The initialization of handle only sets either fi_addr or peer, but it's not obvious to me if that requirement is always maintained. The if-else suggests we could add assert(handle->fi_addr == FI_ADDR_NOTAVAIL) in the if case. If that gets hit, then you've at least found one issue.

I believe in the case of a loop back connection (endpoint connects to itself) you will find that handle->fi_addr is valid and the peer exists as well. Other cases should be one or the other since the peer is moved if the AV entry is later added. So it may still be a good idea to check both independently.

@shefty
Copy link
Member

shefty commented Apr 23, 2021

PR #6707 attempts to handle completion errors better. It's hard to test those changes since it requires generating completion errors though. But it might help with the segfault in rxm_handle_comp_error() that was hit. @nikhilnanal - if you're at the point where you can reproduce the crash, testing the changes in that PR would be useful, at least to ensure that it didn't make things worse.

@swelch - Do you know where in the code path that occurs? I don't mind check both independently to be safe, but I'd like to understand if there is a real issue that we could be hitting.

@swelch
Copy link
Contributor

swelch commented Apr 23, 2021

@swelch - Do you know where in the code path that occurs? I don't mind check both independently to be safe, but I'd like to understand if there is a real issue that we could be hitting.

@shefty - I believe it occurs when the local side has initiated a connect request to itself (hence the address is in it's AV), then when processing the connect request in the rxm_cmap_process_connreq() the AV handle state will show as RXM_CMAP_CONNREQ_SENT, since the local and requester address will be the same, a peer handle is allocated and the fi_addr set to the associated AV entry; the peer handle is used to create a new message endpoint and accept the connection. Seems like the notify will close the peer, but leave the handle in the AV valid. However, the more I think about it we may get a notify for both the AV handle and the peer handle in this case; which if true it may ultimately close both the peer and AV handle (but the connection is unusable when the first is received).

@shefty
Copy link
Member

shefty commented Apr 23, 2021

@swelch -- Thanks, I see that path now. So it is possible for those both to be set. I'll add a fix to my open PR to address the problem pointed out by @gnailzenh.

@shefty
Copy link
Member

shefty commented Apr 23, 2021

#6707 - updated with fi_addr fix.

@gnailzenh
Copy link

thanks, I also noticed there is a "TODO" in rxm_eq_sread()

+               /* TODO convert this to poll + fi_eq_read so that we can grab
+                * rxm_ep lock before reading the EQ. This is needed to avoid
+                * processing events / error entries from closed MSG EPs. This
+                * can be done only for non-Windows OSes as Windows doesn't
+                 have poll for a generic file descriptor. /

Because we are using auto-progress, is this a race that can happen?

@shefty
Copy link
Member

shefty commented Apr 27, 2021

Hmm... that sounds like it's describing a real race. I don't know for certain without doing a deep dive through the code to see how the cleanup occurs.

@shefty
Copy link
Member

shefty commented May 15, 2021

I did find issues in the tcp provider where it could report completions for transfers that were NOT initiated by the upper level user. E.g. an internal ack. There are fixes for this in master.

@gnailzenh
Copy link

Hi, is there a tag for these fixes, or could you provide commit hashes of those patches?

@shefty
Copy link
Member

shefty commented Jun 9, 2021

There's not a tag, but there will be a v1.13 release within about 3 weeks.

@shefty
Copy link
Member

shefty commented Jun 9, 2021

Btw, I've rewritten the connection management code in rxm, which I hope will start us down a path of fixing all of the DAOS connection related issues. See #6778. The code is still under testing, and I'm hesitant to pull it into v1.13 without broader testing.

@frostedcmos
Copy link
Author

has there been any update on this? is it planned to be merged anytime soon into post v1.13?

@j-xiong
Copy link
Contributor

j-xiong commented Jul 16, 2021

#6778 has been replaced with #6833. This is being evaluated and will be merged once it is shown to be good.

@frostedcmos
Copy link
Author

Update:
There appears to be a new regression that was introduced sometimes after 1.12 regarding this ticket.

With tcp;ofi_rxm:
Reproducer runs with v1.12 (servers start up and client connects to them)
Reproducer fails to run with v1.13 (servers start up but clients cant connect to them anymore)
Reproducer also fails to run with 7d6d2a1 (same behavior as with v1.13)

Reproducer still runs with sockets and verbs;ofi_rxm.

@shefty
Copy link
Member

shefty commented Oct 1, 2021

#7110 resolved the issue in local testing, when added on top of other fixes in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants