Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test test_square_sparse_rma fails #599

Closed
mbanck opened this issue Apr 2, 2022 · 3 comments
Closed

Test test_square_sparse_rma fails #599

mbanck opened this issue Apr 2, 2022 · 3 comments

Comments

@mbanck
Copy link

mbanck commented Apr 2, 2022

Describe the bug

On Debian unstable, running the testsuite fails in test_square_sparse_rma:

10: Test command: /usr/bin/mpiexec "-n" "8" "/<<PKGBUILDDIR>>/obj-x86_64-linux-gnu/tests/dbcsr_perf" "/<<PKGBUILDDIR>>/tests/inputs/test_square_sparse_rma.perf"
10: Environment variables:
10:  OMP_NUM_THREADS=2
10: Test timeout computed to be: 1500
10:  DBCSR| CPU Multiplication driver                                           BLAS
10:  DBCSR| Multrec recursion limit                                              512
10:  DBCSR| Multiplication stack size                                           1000
10:  DBCSR| Maximum elements for images                                    UNLIMITED
10:  DBCSR| Multiplicative factor virtual images                                   1
10:  DBCSR| Use multiplication densification                                       T
10:  DBCSR| Multiplication size stacks                                             3
10:  DBCSR| Use memory pool for CPU allocation                                     F
10:  DBCSR| Number of 3D layers                                               SINGLE
10:  DBCSR| Use MPI memory allocation                                              F
10:  DBCSR| Use RMA algorithm                                                      T
10:  DBCSR| Use Communication thread                                               T
10:  DBCSR| Communication thread load                                            100
10:  DBCSR| MPI: My node id                                                        0
10:  DBCSR| MPI: Number of nodes                                                   8
10:  DBCSR| OMP: Current number of threads                                         2
10:  DBCSR| OMP: Max number of threads                                             2
10:  DBCSR| Split modifier for TAS multiplication algorithm                  1.0E+00
10:  numthreads           2
10:  numnodes           8
10:  matrix_sizes        1000        1000        1000
10:  sparsities  0.90000000000000002       0.90000000000000002       0.90000000000000002
10:  trans NN
10:  symmetries NNN
10:  type            3
10:  alpha_in   1.0000000000000000        0.0000000000000000
10:  beta_in   1.0000000000000000        0.0000000000000000
10:  limits           1        1000           1        1000           1        1000
10:  retain_sparsity F
10:  nrep          10
10:  bs_m           1           5
10:  bs_n           1           5
10:  bs_k           1           5
10: --------------------------------------------------------------------------
10: A system call failed during shared memory initialization that should
10: not have.  It is likely that your MPI job will now either abort or
10: experience performance degradation.
10:
10:   Local host:  curie
10:   System call: open(2)
10:   Error:       No such file or directory (errno 2)
10: --------------------------------------------------------------------------
10:
10:  *******************************************************************************
10:  *   ___                                                                       *
10:  *  /   \                                                                      *
10:  * [ABORT]                                                                     *
10:  *  \___/     MPI error 53 in mpi_win_create @ mp_win_create_dv : MPI_ERR_WIN: *
10:  *    |                               invalid window                           *
10:  *  O/|                                                                        *
10:  * /| |                                                                        *
10:  * / \                                                     dbcsr_mpiwrap.F:852 *
10:  *******************************************************************************
10:
10:
10:  ===== Routine Calling Stack =====
10:
10:             7 mp_win_create_dv
10:             6 win_setup
10:             5 multiply_3D
10:             4 dbcsr_multiply_generic
10:             3 perf_multiply
10:             2 dbcsr_perf_multiply_low
10:             1 dbcsr_performance_driver
10: --------------------------------------------------------------------------
10: MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
10: with errorcode 1.
10:
10: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
10: You may or may not see output from other processes, depending on
10: exactly when Open MPI kills them.
10: --------------------------------------------------------------------------
10: [curie:1097886] 1 more process has sent help message help-opal-shmem-mmap.txt / sys call fail
10: [curie:1097886] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10/19 Test #10: dbcsr_perf:inputs/test_square_sparse_rma.perf .........***Failed    0.44 sec

That test is the only one with Use RMA algorithm T, so that looks related? If that test needs two nodes (due to RMA?), I guess it should be skipped if only one is available?

Hrm, tests/CMakeLists.txt has (around line 30):

if ("${MPI_Fortran_LIBRARY_VERSION_STRING}" MATCHES "Open MPI v2.1"
    OR "${MPI_Fortran_LIBRARY_VERSION_STRING}" MATCHES "Open MPI v3.1")
  list(FILTER DBCSR_PERF_TESTS EXCLUDE REGEX "_rma")
endif ()

So (as Debian unstable now has OpenMPI v4.1) this should be extenden to 4.1?

To Reproduce

Run testssuite.

Expected behavior

Testsuite passes and/or skips over tests that cannot pass due to environment.

Environment:

  • Operating system & version: Debian unstable
  • Compiler vendor & version: gfortran-11.2.0
  • Build environment (make or cmake): cmake
  • Configuration of DBCSR (either the cmake flags or the Makefile.inc): cmake -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=None -DCMAKE_INSTALL_SYSCONFDIR=/etc -DCMAKE_INSTALL_LOCALSTATEDIR=/var -DCMAKE_EXPORT_NO_PACKAGE_REGISTRY=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DCMAKE_FIND_PACKAGE_NO_PACKAGE_REGISTRY=ON "-GUnix Makefiles" -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_INSTALL_LIBDIR=lib/x86_64-linux-gnu
  • MPI implementation and version: OpenMPI 4.1.2
  • If CUDA is being used: CUDA version and GPU architecture: no
  • BLAS/LAPACK implementation and version: reference blas/lapack 3.10.0
  • If applicable: Runtime information (how many nodes, type of nodes, ...): single node (compile/test suite)
@alazzaro
Copy link
Member

alazzaro commented Apr 4, 2022

As far I can see, this is an error on the OpenMPI side.
The message:

10: A system call failed during shared memory initialization that should
10: not have.  It is likely that your MPI job will now either abort or
10: experience performance degradation.
10:
10:   Local host:  curie
10:   System call: open(2)
10:   Error:       No such file or directory (errno 2)

is coming from OpenMPI. Then DBCSR stops when doing this call for creating the window.
In the past we had several problems with RMA and OpenMPI, that's why we mask some versions (2.1 and 3.1), as you saw in the cmake file.
We can definitely do some tricks to avoid the windows allocations when a single node is used, but in principle it must work no matter how many nodes we are using. Note that we test for test as part of our CI. I'm also assume you are already testing with previous OpenMPI version and it worked...
So, question now is: could you test with two ranks? If it doesn't work, then it is something in OpenMPI or in the system you are using (according to the error message...)

@alazzaro
Copy link
Member

alazzaro commented Apr 4, 2022

A bit of "asking google", I found this post:

open-mpi/ompi#7393

@alazzaro
Copy link
Member

RMA will not used anymore with OpenMPI is involved (due to the many problems on the OpenMPI-RMA side).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants