Numerous MueLu tests randomly failing and timing out (handing) in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting around 9/12/2019? #6077
Labels
ATDM Sev: Blocker
Problems that make Trilinos unfit to be adopted by one or more ATDM APPs
client: ATDM
Any issue primarily impacting the ATDM project
impacting: tests
The defect (bug) is primarily a test failure (vs. a build failure)
PA: Linear Solvers
Issues that fall under the Trilinos Linear Solvers Product Area
pkg: MueLu
type: bug
The primary issue is a bug in Trilinos code or tests
Milestone
CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52
Next Action Status
As of 2019-12-17, the last non-
MueLu_ParameterListInterpreterTpetraXXX
failing test was on 2019-11-11. No comments by Trilinos developers and not clear how the test got fixed.Description
As shown in this query between 9/1/2019 and 10/10/2019 there are numerous (106 total as of now) MueLu tests failing and timing out (hanging) in the build:
Trilinos-atdm-waterman_cuda-9.2_shared_opt
The list of tests randomly failing and timing out (hanging) include:
Some of these failures are already covered in existing issues:
MueLu_Structured_Line_Tpetra_MPI_4
: Test MueLu_Structured_Line_Tpetra_MPI_4 randomly timing out (hanging) in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting 9/17/2019 #6070MueLu_UnitTestsTpetra_MPI_*
: MueLu_UnitTestsTpetra_MPI* tests failing in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting 2019-06-02 #5310There are numerous failures shown including failures like shown in the following queries:
This query showing (e.g. here):
This query showing (e.g. here):
This query showing the test
MueLu_Structured_Laplace2D_Tpetra_MPI_4
failing (e.g. here):This query showing (e.g. here):
If we go further back in time starting at 8/1/2019 and exclude the tests
MueLu_UnitTestsTpetra_MPI_*
and tests showing the CUDA errorscudaErrorMemoryAllocation
andcudaErrorIllegalAddress
in this query it looks like we starting seeing failures on 8/14/2019. But those first 6 failures starting 8/14/2019 through 9/10/2019 were timeouts that don't look to be hangs involving the tests:MueLu_UnitTestsIntrepid2Tpetra_MPI_4
MueLu_UnitTestsBlockedTpetra_MPI_4
and no other tests. These two tests are not seen failing at all after 9/10/2019 so it may be safe to assume whatever happened to start making these tests fails started sometime on or before 9/12/2019 with the failure of the test
MueLu_ParameterListInterpreterTpetra_MPI_4
.Because these are random failures, it is hard to know when something could have changed in Trilinos or in this waterman build that could have triggered these failures.
Current Status on CDash
Steps to Reproduce
One might be able to reproduce these failures on the machine 'waterman' as described in:
More specifically, the commands given for the system 'waterman' are provided at:
The exact commands to reproduce this issue should be:
Again, since these are random failures across a bunch of different MueLu tests these may be hard to reproduce.
The text was updated successfully, but these errors were encountered: