Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerous MueLu tests randomly failing and timing out (handing) in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting around 9/12/2019? #6077

Closed
bartlettroscoe opened this issue Oct 10, 2019 · 2 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 10, 2019

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52

Next Action Status

As of 2019-12-17, the last non-MueLu_ParameterListInterpreterTpetraXXX failing test was on 2019-11-11. No comments by Trilinos developers and not clear how the test got fixed.

Description

As shown in this query between 9/1/2019 and 10/10/2019 there are numerous (106 total as of now) MueLu tests failing and timing out (hanging) in the build:

  • Trilinos-atdm-waterman_cuda-9.2_shared_opt

The list of tests randomly failing and timing out (hanging) include:

Test Name Number fails/hangs
MueLu_Aggregation_MPI_4 1
MueLu_BlockCrs-Tpetra_MPI_4 1
MueLu_CreateOperatorTpetra_MPI_4 2
MueLu_Driver_TogglePFactory_sa_tent_Tpetra_MPI_4 2
MueLu_Driver_TogglePFactory_semi_tent_line_Tpetra_MPI_4 2
MueLu_Driver_TogglePFactory_tent_tent_Tpetra_MPI_4 1
MueLu_DriverTpetra_Distance2Coloring_MPI_4 1
MueLu_DriverTpetra_Milestone_MPI_4 1
MueLu_DriverTpetra_MPI_4 1
MueLu_DriverTpetra_WithGlobalConstants_MPI_4 2
MueLu_DriverTpetraIfpack2LinePartitioner_MPI_4 1
MueLu_DriverTpetraILU_MPI_4 1
MueLu_DriverTpetraSingleReduceCG_MPI_4 2
MueLu_DriverTpetraYaml_MPI_4 2
MueLu_FixedMatrixPattern-Tpetra_MPI_4 3
MueLu_ImportPerformance_Tpetra_MPI_4 1
MueLu_Maxwell3D-Tpetra_MPI_4 1
MueLu_ParameterListInterpreterTpetra_MPI_1 5
MueLu_ParameterListInterpreterTpetra_MPI_4 8
MueLu_ReadMatrixTpetra_MPI_4 3
MueLu_SimpleTpetra_MPI_4 1
MueLu_SimpleTpetraYaml_MPI_4 5
MueLu_StandardReuse-Tpetra_MPI_4 3
MueLu_Stratimikos_MPI_4 4
MueLu_Structured_Elasticity3D_Tpetra_MPI_4 2
MueLu_Structured_Laplace2D_Shift_Tpetra_MPI_4 2
MueLu_Structured_Laplace2D_Tpetra_MPI_4 2
MueLu_Structured_Line_Tpetra_MPI_4 7
MueLu_UnitTestsIntrepid2Tpetra_MPI_4 4
MueLu_UnitTestsTpetra_MPI_1 3
MueLu_UnitTestsTpetra_MPI_4 31
MueLu_VarDofDriver_MPI_2 3

Some of these failures are already covered in existing issues:

There are numerous failures shown including failures like shown in the following queries:

This query showing (e.g. here):

mpiexec noticed that process rank 2 with PID 0 on node waterman4 exited on signal 11 (Segmentation fault).

This query showing (e.g. here):

mpiexec noticed that process rank 1 with PID 0 on node waterman4 exited on signal 9 (Killed).

This query showing the test MueLu_Structured_Laplace2D_Tpetra_MPI_4 failing (e.g. here):

p=0: *** Caught standard std::exception of type 'Belos::StatusTestError' :

 /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/belos/src/BelosStatusTestGenResNorm.hpp:578:
 
 Throw number = 1
 
 Throw test that evaluated to true: true
 
 StatusTestGenResNorm::checkStatus(): NaN has been detected.

This query showing (e.g. here):

=0: *** Caught standard std::exception of type 'MueLu::Exceptions::RuntimeError' :

 /home/atdm-devops-admin/jenkins/waterman/Trilinos-atdm-waterman_cuda-9.2_shared_opt/SRC_AND_BUILD/Trilinos/packages/muelu/src/Transfers/Smoothed-Aggregation/MueLu_SaPFactory_def.hpp:175:
 
 Throw number = 1
 
 Throw test that evaluated to true: !std::isfinite(Teuchos::ScalarTraits<SC>::magnitude(omega))
 
 Prolongator damping factor needs to be finite.

If we go further back in time starting at 8/1/2019 and exclude the tests MueLu_UnitTestsTpetra_MPI_* and tests showing the CUDA errors cudaErrorMemoryAllocation and cudaErrorIllegalAddress in this query it looks like we starting seeing failures on 8/14/2019. But those first 6 failures starting 8/14/2019 through 9/10/2019 were timeouts that don't look to be hangs involving the tests:

  • MueLu_UnitTestsIntrepid2Tpetra_MPI_4
  • MueLu_UnitTestsBlockedTpetra_MPI_4

and no other tests. These two tests are not seen failing at all after 9/10/2019 so it may be safe to assume whatever happened to start making these tests fails started sometime on or before 9/12/2019 with the failure of the test MueLu_ParameterListInterpreterTpetra_MPI_4.

Because these are random failures, it is hard to know when something could have changed in Trilinos or in this waterman build that could have triggered these failures.

Current Status on CDash

Steps to Reproduce

One might be able to reproduce these failures on the machine 'waterman' as described in:

More specifically, the commands given for the system 'waterman' are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
    Trilinos-atdm-waterman_cuda-9.2_shared_opt

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
 $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -n 20 ctest -j16

Again, since these are random failures across a bunch of different MueLu tests these may be hard to reproduce.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: MueLu client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area labels Oct 10, 2019
@bartlettroscoe
Copy link
Member Author

@trilinos/muelu and @srajama1,

It is important to note that as shown in this query, only MueLu tests have failed in this build since 9/1/2019. Therefore, while this might be system-related issue (including overloading the GPU) that seems unlikely since only MueLu tests are failing. But why are no tests in downstream packages using MueLu like Panzer or Tempus failing?

Also, this is the closest build that we currently have to the ATS-2 platform (until we get builds on on 'vortex') for the production builds used by the ATDM APP codes.

@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019
@bartlettroscoe
Copy link
Member Author

From looking at this query, it would appear that the last failing MueLu test in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt that did not show failures with the regex output kokkos/EasyParameterListInterpreter.*[.]xml : failed for the MueLu_ParameterListInterpreterTpetraXXX tests (which are covered in #6361) was on 2019-11-22.

In fact, the last failing Muelu test in this build was not a MueLu_ParameterListInterpreterTpetraXXX test, was the test MueLu_Structured_Laplace2D_Tpetra_MPI_4 which failed on 2019-11-11. Before that, at least one MueLu test in this set of tests were failing randomly almost every day in this build.

It is not clear what fixed these MueLu tests but they seem to be fixed.

Closing this issue as complete (but issue #6361 stays open).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

1 participant