Skip to content

Cray CXI SHS11.1 and openmpi@main fail with intra-node communication #13148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
germanne opened this issue Mar 17, 2025 · 19 comments
Closed

Cray CXI SHS11.1 and openmpi@main fail with intra-node communication #13148

germanne opened this issue Mar 17, 2025 · 19 comments

Comments

@germanne
Copy link

germanne commented Mar 17, 2025

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Branch main, 10 March 2025

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Was installed via Spack, using OSS libfabric and cxi support. Compiler args:

--enable-shared --disable-silent-rules --disable-sphinx --enable-builtin-atomics --disable-static --with-slingshot --enable-mpi1-compatibility --without-psm --without-psm2 --without-fca --without-cma --without-knem --with-xpmem=/usr --without-hcoll --without-mxm --with-ofi=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/libfabric-main-iegtmu74ojgb2pvywqfrlzvedlcz7cps --without-ucc --without-ucx --without-verbs --with-cray-xpmem --without-sge --without-alps --without-loadleveler --without-tm --with-slurm --without-lsf --disable-memchecker --with-libevent=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/libevent-2.1.12-p5qzh7hez5qdbtl7j3avjhf4mry5fm4n --without-lustre --with-pmix=internal --with-zlib=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-7.5.0/zlib-ng-2.2.3-o7xbhxhnqpk5ljcpjmgple6rv3i75z2h --with-hwloc=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/hwloc-2.11.1-3jxzkohocpqjyvd2irytycprfi2bom5q --disable-java --disable-mpi-java --disable-io-romio --with-gpfs=no --enable-dlopen --with-cuda=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2 --with-cuda-libdir=/afs/psi.ch/sys/spack/develop/opt/spack/testing/[padded-to-256-chars]/linux-sles15-aarch64/gcc-14.2.0/cuda-12.8.0-ne7ulo7g6hhe7dv5nhh4nxchdenls3r2/lib64/stubs --enable-wrapper-rpath --disable-wrapper-runpath --with-wrapper-ldflags=-Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib/gcc/aarch64-unknown-linux-gnu/14.2.0 -Wl,-rpath,/afs/psi.ch/sys/spack/develop/opt/spack/unstable/linux-sles15-aarch64/gcc-7.5.0/gcc-14.2.0-tln2ck4lolcipi2fj2klu5dei3oac4sv/lib64 CFLAGS=-DYY_BUF_SIZE=1048576 --disable-debug

spack spec:

-   scyqclc  openmpi@main%gcc@14.2.0+atomics+cuda~debug~gpfs~internal-hwloc~internal-libevent+internal-pmix~java~lustre~memchecker~openshmem~romio+rsh~static~two_level_namespace+vt+wrapper-rpath build_system=autotools cuda_arch=90 fabrics=ofi,xpmem romio-filesystem=none schedulers=slurm arch=linux-sles15-aarch64
[+]  mcdzcmr      ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  btyzacb      ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  ne7ulo7      ^cuda@12.8.0%gcc@14.2.0~allow-unsupported-compilers~dev build_system=generic arch=linux-sles15-aarch64
[+]  gzc3f4t          ^libxml2@2.13.5%gcc@7.5.0~http+pic~python+shared build_system=autotools arch=linux-sles15-aarch64
[+]  bu4jqoi              ^libiconv@1.17%gcc@7.5.0 build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+]  5ss23k5              ^pkg-config@0.29.2%gcc@7.5.0+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e]  fkyyhdc              ^xz@5.2.3%gcc@7.5.0~pic build_system=autotools libs=shared,static arch=linux-sles15-aarch64
[+]  nyb2vfy      ^gcc-runtime@14.2.0%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[e]  3egpojh      ^glibc@2.31%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  hpibhrn      ^gnuconfig@2024-07-27%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  3jxzkoh      ^hwloc@2.11.1%gcc@14.2.0~cairo+cuda~gl~level_zero~libudev+libxml2~nvml~opencl+pci~rocm build_system=autotools cuda_arch=90 libs=shared,static arch=linux-sles15-aarch64
[+]  4eajtzs          ^libpciaccess@0.17%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  txi65ob              ^gcc-runtime@7.5.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  nwu26be              ^util-macros@1.20.1%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  wfg3cd7                  ^gcc-runtime@12.3%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  zia4ebj          ^ncurses@6.5%gcc@7.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
[+]  p5qzh7h      ^libevent@2.1.12%gcc@7.5.0+openssl build_system=autotools arch=linux-sles15-aarch64
[+]  m3gwtgf          ^openssl@3.4.0%gcc@7.5.0~docs+shared build_system=generic certs=mozilla arch=linux-sles15-aarch64
[+]  3aq2syu              ^ca-certificates-mozilla@2023-05-30%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[e]  eyczfjv              ^perl@5.26.1%gcc@7.5.0+cpanm+opcode+open+shared+threads build_system=generic patches=0eac10e,8cf4302 arch=linux-sles15-aarch64
 -   zgkq6vw      ^libfabric@main%gcc@14.2.0+cuda~debug~kdreg~level_zero+uring build_system=autotools cuda_arch=90 fabrics=cxi,sockets,tcp,udp,xpmem arch=linux-sles15-aarch64
[+]  u5d4zw4          ^curl@8.11.1%gcc@7.5.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2+nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-aarch64
[+]  6mvnrnk              ^nghttp2@1.64.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  5t2hvib          ^json-c@0.16%gcc@7.5.0~ipo build_system=cmake build_type=Release generator=make arch=linux-sles15-aarch64
[+]  u2nmjzn              ^cmake@3.31.4%gcc@7.5.0~doc+ncurses+ownlibs~qtgui build_system=generic build_type=Release arch=linux-sles15-aarch64
 -   nhjhwto          ^libcxi@main%gcc@14.2.0+cuda~level_zero~rocm build_system=autotools arch=linux-sles15-aarch64
 -   5wgs3er              ^cassini-headers@main%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
 -   u6zzijb              ^cxi-driver@main%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  7cnxi2c              ^libconfig@1.7.3%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  rq33jed                  ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  dcjewuo                      ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  l6mjl5c                  ^gcc-runtime@14.2.0%gcc@14.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  p7ozge3                  ^libtool@2.4.7%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  trzm7v5                      ^findutils@4.9.0%gcc@7.5.0 build_system=autotools patches=440b954 arch=linux-sles15-aarch64
[+]  jppuqwv                      ^m4@1.4.19%gcc@7.5.0+sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+]  ivhh3c7                          ^libsigsegv@2.14%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  oj3ovvr                  ^texinfo@7.1%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  laxgbis                      ^gcc-runtime@12.3%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  fmk7hej                      ^ncurses@6.5%gcc@7.5.0~symlinks+termlib abi=none build_system=autotools patches=7a351bc arch=linux-sles15-aarch64
 -   pjzsvu4              ^libfuse@2.9.9%gcc@14.2.0~strip~system_install~useroot+utils build_system=meson buildtype=release default_library=shared arch=linux-sles15-aarch64
[+]  yczbssx                  ^meson@1.5.1%gcc@7.5.0 build_system=python_pip patches=0f0b1bd arch=linux-sles15-aarch64
[+]  zvoecxo                      ^py-pip@24.3.1%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  qqyoi74                      ^py-setuptools@75.8.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  wk6kswr                      ^py-wheel@0.41.2%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  q7jjsue                      ^python@3.12.8%gcc@7.5.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic arch=linux-sles15-aarch64
[+]  3hlzxo5                          ^expat@2.6.4%gcc@7.5.0+libbsd build_system=autotools arch=linux-sles15-aarch64
[+]  lducxxr                              ^libbsd@0.10.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  qerkf3p                          ^gdbm@1.24%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  ds2kwc3                          ^libffi@3.4.6%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  mfeth7l                          ^readline@8.2%gcc@7.5.0 build_system=autotools patches=1ea4349,24f587b,3d9885e,5911a5b,622ba38,6c8adf8,758e2ec,79572ee,a177edc,bbf97f1,c7b45ff,e0013d9,e065038 arch=linux-sles15-aarch64
[+]  teiwdpd                          ^sqlite@3.46.0%gcc@7.5.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-aarch64
[+]  suaqfjx                          ^util-linux-uuid@2.40.2%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  hjcwama                      ^python-venv@1.0%gcc@7.5.0 build_system=generic arch=linux-sles15-aarch64
[+]  saybo2v                  ^ninja@1.12.1%gcc@7.5.0~re2c build_system=generic arch=linux-sles15-aarch64
[+]  zq5s6y2                      ^gcc-runtime@13.2.0%gcc@13.2.0 build_system=generic arch=linux-sles15-aarch64
[+]  3bgmubm                      ^python@3.8.19%gcc@7.5.0~bz2~crypt+ctypes~dbm~debug+libxml2+lzma~nis~optimizations+pic~pyexpat+pythoncmd~readline+shared~sqlite3~ssl~tkinter~uuid+zlib build_system=generic patches=0d98e93,4c24573,ebdca64,f2fd060 arch=linux-sles15-aarch64
[e]  gnju5co                          ^gettext@0.20.2%gcc@7.5.0+bzip2+curses+git~libunistring+libxml2+pic+shared+tar+xz build_system=autotools arch=linux-sles15-aarch64
[+]  s2jrvfo                          ^zlib-ng@2.1.6%gcc@7.5.0+compat+new_strategies+opt+pic+shared build_system=autotools arch=linux-sles15-aarch64
[+]  qlqjhch                              ^gnuconfig@2022-09-17%gcc@12.3 build_system=generic arch=linux-sles15-aarch64
[+]  xnbrchw              ^libnl@3.3.0%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  3aey2ot                  ^flex@2.6.3%gcc@7.5.0+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+]  3e23eaj              ^libuv@1.48.0%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  7sa6suu              ^libyaml@0.2.5%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  pnnsys3              ^lm-sensors@3-6-0%gcc@14.2.0 build_system=makefile arch=linux-sles15-aarch64
[+]  sawly4e                  ^bison@3.8.2%gcc@7.5.0~color build_system=autotools arch=linux-sles15-aarch64
[+]  ghouivi                      ^m4@1.4.19%gcc@7.5.0~sigsegv build_system=autotools patches=9dc5fbd,bfdffa7 arch=linux-sles15-aarch64
[+]  r2g4qhm                  ^flex@2.6.3%gcc@7.5.0+lex~nls build_system=autotools arch=linux-sles15-aarch64
[+]  5cb63ad          ^liburing@2.3%gcc@14.2.0 build_system=autotools arch=linux-sles15-aarch64
[+]  3gxsior      ^numactl@2.0.18%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  l2qugjv          ^autoconf@2.72%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  tvosith          ^automake@1.16.5%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  2xmogbm      ^openssh@9.9p1%gcc@7.5.0+gssapi build_system=autotools arch=linux-sles15-aarch64
[+]  7cgifzm          ^krb5@1.21.3%gcc@7.5.0+shared build_system=autotools arch=linux-sles15-aarch64
[+]  q7hhrig              ^bison@3.8.2%gcc@7.5.0~color build_system=autotools arch=linux-sles15-aarch64
[+]  obetosr          ^libedit@3.1-20240808%gcc@7.5.0 build_system=autotools arch=linux-sles15-aarch64
[+]  w6hnyfk          ^libxcrypt@4.4.35%gcc@7.5.0~obsolete_api build_system=autotools patches=4885da3 arch=linux-sles15-aarch64
[+]  iacvnhj      ^pkg-config@0.29.2%gcc@7.5.0+internal_glib build_system=autotools arch=linux-sles15-aarch64
[e]  2d7jkg5      ^slurm@24.05.3%gcc@7.5.0+cgroup~cray_shasta+gtk~hdf5+hwloc+mariadb+nvml+pam+pmix+readline+restd~rsmi build_system=autotools sysconfdir=PREFIX/etc arch=linux-sles15-aarch64
[e]  znxqplr      ^xpmem@2.9.6-1.1%gcc@14.2.0+kernel-module build_system=autotools arch=linux-sles15-aarch64

Please describe the system on which you are running

  • Operating system/version: SLES15 14.21-150500.55.65_13.0.73-cray_shasta_c_64k aarch64
  • Computer hardware: Grace Hopper GPU, aarch64
  • Network type: CXI, SHS11.1

Details of the problem

Multi nodes jobs do run without any problem. Multi tasks jobs on single node do fail with the following error:

[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm

ompi_info returns that the btl ofi component is there, but it still seems to fail

ompi_info
...
MCA btl: ofi (MCA v2.1.0, API v3.3.0, Component v5.1.0)
shell$ mpirun --mca btl_base_verbose 100 -np 2 osu_bw -d cuda D D
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177843] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177843] mca: base: components_open: component smcuda open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: registering framework btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component self register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component ofi register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component sm register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component tcp register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] mca: base: components_register: component smcuda register function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: opening btl components
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component self
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component self open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component ofi
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component ofi open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component sm
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component sm open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component tcp
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component tcp open function successful
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: found loaded component smcuda
[gpu001.merlin7.psi.ch:177842] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[gpu001.merlin7.psi.ch:177842] mca: base: components_open: component smcuda open function successful
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_domain).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: gpu001
  Location: mtl_ofi_component.c:1007
  Error: Function not implemented (70368744177702)
--------------------------------------------------------------------------
[gpu001.merlin7.psi.ch:177843] select: initializing btl component self
[gpu001.merlin7.psi.ch:177843] select: init of component self returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177842] select: initializing btl component self
[gpu001.merlin7.psi.ch:177842] select: init of component self returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component ofi
[gpu001.merlin7.psi.ch:177843] select: init of component ofi returned failure
[gpu001.merlin7.psi.ch:177842] select: init of component ofi returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177842] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177842] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177842] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177842] btl: tcp: Using interface: sppp 
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x323ce000: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32860a90: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861280: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32861b70: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: 0x32862380: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177842] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: Successfully bound to AF_INET port 1024
[gpu001.merlin7.psi.ch:177842] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177842] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177842] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177842] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177842] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] mca: base: close: component ofi closed
[gpu001.merlin7.psi.ch:177843] mca: base: close: unloading component ofi
[gpu001.merlin7.psi.ch:177843] select: initializing btl component sm
[gpu001.merlin7.psi.ch:177843] select: init of component sm returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component tcp
[gpu001.merlin7.psi.ch:177843] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[gpu001.merlin7.psi.ch:177843] btl: tcp: Found match: 127.0.0.1 (lo)
[gpu001.merlin7.psi.ch:177843] btl: tcp: Using interface: sppp 
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72a80: if nmn0 kidx 2 cnt 0 addr 10.100.36.33 IPv4 bw 1000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a72f80: if hsn0 kidx 3 cnt 0 addr 172.30.138.1 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a73660: if hsn2 kidx 4 cnt 0 addr 172.30.138.3 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6d130: if hsn3 kidx 5 cnt 0 addr 172.30.138.4 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: 0x37a6da20: if hsn1 kidx 6 cnt 0 addr 172.30.138.2 IPv4 bw 200000 lt 100
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1024
[gpu001.merlin7.psi.ch:177843] btl:tcp: Attempting to bind to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: Successfully bound to AF_INET port 1025
[gpu001.merlin7.psi.ch:177843] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 0 2 IPv4 10.100.36.33
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 1 3 IPv4 172.30.138.1
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 2 4 IPv4 172.30.138.3
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 3 5 IPv4 172.30.138.4
[gpu001.merlin7.psi.ch:177843] btl: tcp: exchange: 4 6 IPv4 172.30.138.2
[gpu001.merlin7.psi.ch:177843] select: init of component tcp returned success
[gpu001.merlin7.psi.ch:177843] select: initializing btl component smcuda
[gpu001.merlin7.psi.ch:177843] select: init of component smcuda returned success
[gpu001.merlin7.psi.ch:177843] [[46903,1],1] selected pml ob1, but peer [[46903,1],0] on gpu001 selected pml cm
--------------------------------------------------------------------------
MPI_INIT has failed because at least one MPI process is unreachable
from another.  This *usually* means that an underlying communication
plugin -- such as a BTL or an MTL -- has either not loaded or not
allowed itself to be used.  Your MPI job will now abort.

You may wish to try to narrow down the problem;

 * Check the output of ompi_info to see which BTL/MTL plugins are
   available.
 * Run your application with MPI_THREAD_SINGLE.
 * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
   if using MTL-based communications) to see exactly which
   communication plugins were considered and/or discarded.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_mpi_instance_init failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
[gpu001:00000] *** An error occurred in MPI_Init
[gpu001:00000] *** reported by process [3073835009,281470681743361]
[gpu001:00000] *** on a NULL communicator
[gpu001:00000] *** Unknown error
[gpu001:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[gpu001:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

It seems very similar to the issue #12038 but since I am using the branch main, this shoud have been fixed in the meantime...

Thanks a lot for any help in advance!

@germanne germanne changed the title Cray CXI SHS11.1 and openmpi@main fail with inter-node communication Cray CXI SHS11.1 and openmpi@main fail with intra-node communication Mar 17, 2025
@hppritcha
Copy link
Member

you need to set some non-default MCA parameters for the OFI BTL. If you use env. variables to set the mca params here they are:

OMPI_MCA_btl_ofi_mode=2

you may also need to set the following PRRTE MCA params

PRTE_MCA_ras_slurm_use_entire_allocation=1
PRTE_MCA_ras_base_launch_orted_on_hn=1

We set these in the spack generated modules files we use at NERSC and on our internal SS11 systems.

We have not done extensive testing of Open MPI using the libfabric 2.0.0 or 2.1.0 rc's if you are using one of those.

Here's a snippet from the modules.yaml file I like to use:

      openmpi:
        environment:
          set:
            'FI_CXI_RX_MATCH_MODE': 'software'
            'FI_PROVIDER': 'cxi'
            'OMPI_MCA_btl_ofi_disable_sep': 'true'
            'OMPI_MCA_btl_ofi_mode': '2'
            'OMPI_MCA_btl_ofi_provider_include': 'cxi'
            'OMPI_MCA_mtl_ofi_provider_include': 'cxi'
            'OMPI_MCA_pml': 'cm'
            'PMIX_MCA_psec': 'native'
            'PRTE_MCA_ras_base_launch_orted_on_hn': '1'
            'PRTE_MCA_ras_slurm_use_entire_allocation': '1'
            'SLURM_MPI_TYPE': 'pmix'

We're setting the FI_CX_RX_MATCH_MODE to software as it seems to typically give better performance although mileage varies depending on the app.

We set the OMPI_MCA_pml to cm as its the quickest way to find out if somethings not working. We are finding though that a number of apps do better using the ob1 pml.

@hppritcha
Copy link
Member

Hmm...this may be a new problem. Could you see what happens if you try to force the run to use the OB1 PML?

mpirun --mca pml ob1 ........

Also, what happens if you use libfabric@1.22.0 ?

@germanne
Copy link
Author

germanne commented Mar 17, 2025

Dear Howard,

Thank you so much for your really express answer! I unfortunately still get the exact error, even if I set all the variables...

I get this when doing the above requested command;

[gpu005][[60577,1],0][btl_ofi_module.c:88:mca_btl_ofi_add_procs] error receiving modex
[gpu005][[60577,1],0][btl_ofi_component.c:244:mca_btl_ofi_exit] BTL OFI will now abort.

I do not use libfabric@1.22.0 because if I remember well this either fail build wise when trying to use opensource CXI or it fails performance wise in the end.

@germanne
Copy link
Author

germanne commented Mar 17, 2025

Editing, I tried to fix the libfabric version to 1.22.0, unfortunately I get the exact same error

@hppritcha
Copy link
Member

Does the test run successfully if we get OFI out of the picture?

mpirun -np 2 --mca pml ob1 --mca btl ^ofi

@germanne
Copy link
Author

Indeed it's then successful! But obviously with bad bandwidth

@germanne
Copy link
Author

germanne commented Mar 17, 2025

Ah something I forgot to mention but that is really important. This exact same openmpi was working with SHS2.1.3. We upgraded the system image and unfortunately are now unable to run intra-node ompi jobs correctly anymore.

@hppritcha
Copy link
Member

could you try running with OB1 and OFI BTL and set FI_LOG_LEVEL=debug to see what's going on?
I suspect that there's some kind of "optimization" in the slurm/ofi interaction when you only request one node that disables use of OFI.

@hppritcha
Copy link
Member

something else to try - if you build open mpi against the system libfabric does it work?

@germanne
Copy link
Author

germanne commented Mar 18, 2025

The output of FI_LOG_LEVEL=debug
slurm-23652.txt

This line is definitely suspicious

libfabric:15515:1742284920::cxi:domain:cxip_domain():1845<warn> gpu003.merlin7.psi.ch: cxip_gen_auth_key failed: -38:Function not implementedlibfabric:15516:1742284920::cxi:core:fi_param_get_():372<info> variable req_buf_min_posted=<not set>

Notice I am still setting:

export OMPI_MCA_btl_ofi_mode=2
export PRTE_MCA_ras_slurm_use_entire_allocation=1
export PRTE_MCA_ras_base_launch_orted_on_hn=1

@germanne
Copy link
Author

germanne commented Mar 18, 2025

An other interesting report, using the system libfabric doesn't solve any problem either...

Currently Loaded Modules:
 1) zstd/1.5.6-jcbw                      2) gcc/14.2.0
 3) hwloc/2.11.1-GH200-gpu               4) libfabric/1.22.0
 5) xpmem/2.9.6-1.1                      6) zlib-ng/2.2.3-o7xb
 7) openmpi/main-v6sz-GH200-gpu          8) cuda/12.8.0-ne7u
 9) osu-micro-benchmarks/7.5-GH200-gpu  

[gpu003][[39322,1],0][btl_ofi_module.c:88:mca_btl_ofi_add_procs] error receiving modex
[gpu003][[39322,1],0][btl_ofi_component.c:244:mca_btl_ofi_exit] BTL OFI will now abort.
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

And I checked with ldd, no OSS libraries, only systems

Edit, it might be only a problem with newest version of ompi, seems I don't have it using 5.0.3!

@germanne
Copy link
Author

ok... I kinda find out a solution. Putting --exclusive to our nodes solve the problem, which is definitely not the best way for us, but it is at least a solution.

@hppritcha
Copy link
Member

I think we need to poke around with how slurm is configured. You may want to see if there were changes made to the slurm version and configuration as part of the upgrade to shs11.1. i am pointing our slurm adm to this issue to take a look.

@hppritcha
Copy link
Member

could you run this command

scontrol show config | grep -i switchparameters

and paste the output into this issue?

On one or our XC systems this reports

SwitchParameters        = vnis=32768-65535,def_les=16,max_ptes=2031,max_tgqs=494,def_eqs=2012,max_eqs=1023,max_acs=1018,max_cts=1023,job_vni=user

@germanne
Copy link
Author

germanne commented Mar 19, 2025

We tried the three following switchparameters on our test system, unfortunately it doesn't help. I will try to recompile Slurm because it was compiled finding libfabric 1.15.2.0, I am not sure this should be a problem but still.

gpu009:~ # scontrol show conf | grep Switch
SwitchParameters        = vnis=32768-65535,def_les=16,max_ptes=2031,max_tgqs=494,def_eqs=2012,max_eqs=1023,max_acs=1018,max_cts=1023,job_vni=user
SwitchType              = switch/hpe_slingshot
tgmerlin7-slurmctld01:~ #  scontrol show config | grep -iE 'switch|mpi'
MpiDefault              = cray_shasta
MpiParams               = ports=20000-32767
SwitchParameters        = vnis=32768-65535,job_vni,job_vni,def_tles=0,def_les=0
SwitchType              = switch/hpe_slingshot
login001:~ # scontrol show config | grep -iE 'switch|mpi'
MpiDefault              = pmi2
MpiParams               = (null)
SwitchParameters        = (null)
SwitchType              = (null)

@germanne
Copy link
Author

FYI: just experimented with an other version of slurm, and ~ same problem. Just to make sure it's not slurm 24.11.3 that's the problem.

[gpu009][[42113,0],0][btl_ofi_module.c:88:mca_btl_ofi_add_procs] error receiving modex
[gpu009][[42113,0],0][btl_ofi_component.c:244:mca_btl_ofi_exit] BTL OFI will now abort.

@germanne
Copy link
Author

Thank you very much for all the help and apologies for having taken time. The problem was a setting on the cxi network on the machine which was set wrongly, hopefully we were able to correct this. Has nothing to do with Slurm or OpenMPI or libfabric themselves.

@hppritcha
Copy link
Member

Would you mind sharing that cxi network setting here? i'm sure its only a matter of time before this resurfaces somewhere else.

@germanne germanne reopened this Mar 31, 2025
@germanne
Copy link
Author

germanne commented Mar 31, 2025

Sure, sorry was just checking how much is ok to say,

So, after the upgrade from SHS2.1.3 to SHS11.1 the module cxi_core has been renamed to cxi_ss1. We enabled cxi_service generically for all nodes ( cxi_service -s 1 enable), but for the GPU nodes only cxi0 was updated. Each interface can be enabled with:

cxi_service -d cxi0 enable -s 1
cxi_service -d cxi1 enable -s 1
cxi_service -d cxi2 enable -s 1
cxi_service -d cxi3 enable -s 1

Or directly by setting the proper boot parameters cxi_ss1.disable_default_svc=0. We are getting a consistent performance, but not optimal:

  • 24GB/s osu_bw D D with CXI. By compiling libfabric with Linkx support we can reach up to 120GB/s.
mpirun  --mca mtl ofi --mca opal_common_ofi_provider_include "shm+cxi:linkx" --map-by ppr:1:l3cache --bind-to core --np 2 osu_bw -d cuda D D
# OSU MPI-CUDA Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       0.23
2                       0.47
4                       0.94
8                       1.89
16                      3.81
32                      7.60
64                     15.24
128                    30.43
256                    60.89
512                   120.38
1024                  241.64
2048                  483.08
4096                  962.28
8192                 1928.12
16384                3841.19
32768                7663.60
65536               15275.76
131072              30226.04
262144              58536.63
524288              80619.33
1048576             98711.76
2097152            111745.70
4194304            120530.83

However, using Linkx between 2 nodes, we still get very poor performance. A new issue would be opened in case this is not solved, but now this is being addressed at the libfabric level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants