Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

bartlettroscoe · 2019-11-26T19:58:54Z

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52, @lucbv

Next Action Status

Description

As shown in this query the tests:

MueLu_ParameterListInterpreterTpetra_MPI_1
MueLu_ParameterListInterpreterTpetra_MPI_4
MueLu_ParameterListInterpreterTpetraHeavy_MPI_4

in the builds:

Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug ('white')
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

on the machines 'white','ride', and 'waterman' started failing every day starting 2019-11-23.

As shown in this query all of the failing tests show at lease one of the following error outputs.

Like here showing:

Testing: kokkos/EasyParameterListInterpreter/coarse1.xml
--- kokkos/Output/coarse1_tpetra.gold_filtered	2019-11-23 07:44:05.270777000 -0700
+++ kokkos/Output/coarse1_tpetra.out_filtered	2019-11-23 07:44:05.301774000 -0700
@@ -276,7 +276,7 @@
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
 Number of levels    = 4
-Operator complexity = 1.48
+Operator complexity = 1.47
 Smoother complexity = <ignored>
 Cycle type          = V
 
kokkos/EasyParameterListInterpreter/coarse1.xml : failed

and here showing:

Testing: kokkos/EasyParameterListInterpreter/repartition3_np4.xml
--- kokkos/Output/repartition1_np4_tpetra.gold_filtered	2019-11-26 09:13:16.489926000 -0700
+++ kokkos/Output/repartition1_np4_tpetra.out_filtered	2019-11-26 09:13:16.513926000 -0700
@@ -278,7 +278,7 @@
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
 Number of levels    = 3
-Operator complexity = 1.45
+Operator complexity = 1.44
 Smoother complexity = <ignored>
 Cycle type          = V
 
kokkos/EasyParameterListInterpreter/repartition1_np4.xml : failed

and here showing:

Testing: kokkos/EasyParameterListInterpreter/aggregation1.xml
--- kokkos/Output/aggregation1_tpetra.gold_filtered	2019-10-29 03:34:18.334162000 -0600
+++ kokkos/Output/aggregation1_tpetra.out_filtered	2019-10-29 03:34:18.362184000 -0600
@@ -154,54 +154,4 @@
 matrixmatrix: kernel params -> 
  [empty list]
 
-sa: damping factor = 1.33   [default]
-sa: calculate eigenvalue estimate = 0   [default]
-sa: eigenvalue estimate num iterations = 10   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Transpose P (MueLu::TransPFactory)
-matrixmatrix: kernel params -> 
- [empty list]
-
-Computing Ac (MueLu::RAPFactory)
-transpose: use implicit = 0   [default]
-rap: triple product = 0   [default]
-rap: fix zero diagonals = 0   [default]
-rap: relative diagonal floor = {}   [default]
-CheckMainDiagonal = 0   [default]
-RepairMainDiagonal = 0   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Max coarse size (<= 2000) achieved
-Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
-keep smoother data = 0   [default]
-PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-presmoother -> 
- A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- Nullspace = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- fix nullspace = 0   [default]
-
-
---------------------------------------------------------------------------------
----                            Multigrid Summary                             ---
---------------------------------------------------------------------------------
-Number of levels    = 3
-Operator complexity = 1.44
-Smoother complexity = 1.78
-Cycle type          = V
-
-level  rows  nnz    nnz/row  c ratio  procs
-  0  9999  29995  3.00                  1  
-  1  3333  9997   3.00     3.00         1  
-  2  1111  3331   3.00     3.00         1  
-
-Smoother (level 0) both : "Ifpack2::Relaxation": {Initialized: true, Computed: true, Type: Symmetric Gauss-Seidel, sweeps: 1, damping factor: 1, Global matrix dimensions: [9999, 9999], Global nnz: 29995}
-
-Smoother (level 1) both : "Ifpack2::Relaxation": {Initialized: true, Computed: true, Type: Symmetric Gauss-Seidel, sweeps: 1, damping factor: 1, Global matrix dimensions: [3333, 3333], Global nnz: 9997}
-
-Smoother (level 2) pre  : <Direct> solver interface
-Smoother (level 2) post : no smoother
-
+Caught exception: Prolongator damping factor needs to be finite.
kokkos/EasyParameterListInterpreter/aggregation1.xml : failed

But note that these tests pass in the 'sems-rhel7-cuda-9.2' builds. Therefore, there is something different about these Power/GPU machines that are triggering these failures.

The new commits that were pulled in to these builds on 2019-11-26 that these failures started are show, for example, here.

From looking over that set of commits, it seems likely this was triggered by the commits in the PR #6326 from @lucbv (topic branch MueLu_aggregation_kokkos_fixes).

Current Status on CDash

All MueLu_ParameterListXXX tests in all ATDM Trilinos builds for the current testing day

NOTE: Click "Previous" to see status for previous testing day.

Steps to Reproduce

One should be able to reproduce this failure on the machine as described in:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md

More specifically, the commands given for the system 'white' (SON) and 'ride' (SRN) are provided at:

https://github.com/trilinos/Trilinos/blob/develop/cmake/std/atdm/README.md#ride

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
    Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
 $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j4

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2019-11-26T20:01:54Z

What I don't understand is why the Trilinos cuda-9.2 PR build is not also showing these tests as failing? This is something we should look into.

lucbv · 2019-11-26T20:35:10Z

@bartlettroscoe I have a fix being tested locally for the following type of error:

-Operator complexity = 1.45
+Operator complexity = 1.44

We essentially need to ignore this complexity output as the aggregation process is now non deterministic on GPU which leads to slight variations on operator complexities.

However I strongly suspect that the last error:

Caught exception: Prolongator damping factor needs to be finite.

is of a different nature and unrelated to my changes.
I'll reference this issue in my PR when my fix is ready.

bartlettroscoe · 2019-11-26T20:37:01Z

@lucbv, is is possible for the solver to throw:

Caught exception: Prolongator damping factor needs to be finite.

and still have the test pass? We could search the test output on CDash to see if that output has been seen in passing versions of this test.

…kkos, see issue trilinos#6361 After kokkos refactor of aggregation, non deterministic aggregates are formed. This means that checking operator complexity against a gold file is wrong. This commits implements logic to ignore OperatorComplexity for kokkos runs of ParameterListInterpreter.

…omplexity Automatically Merged using Trilinos Pull Request AutoTester PR Title: MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue #6361 PR Author: lucbv

…s:develop' (2a3751d). * trilinos-develop: MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361 ML: Default fix ML: More experimental coarsening ML: enabling more experimental maxwell stuff ML: enabling more experimental maxwell stuff ML: enabling more experimental maxwell stuff Xpetra: Re-enabled fast TwoMatrixAdd path

…s:develop' (2a3751d). * trilinos-develop: (50 commits) TSQR: Fix minor build error with CUDA TSQR::Matrix: Simplify nonmember functions TSQR::Combine*::apply_inner now takes MatView instead of a pointer TSQR::Combine*::factor_inner now takes MatView instead of a pointer TSQR::Combine: Remove unneeded factor_first overload TSQR::Combine*::factor_pair now takes MatView instead of a pointer MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361 Ifpack2: spelling Ifpack2: Fixing test to only run in parallel Ifpack2: Fixing test Ifpack2: OverlappingRowMatrix cleanup Ifpack2: Adding unit test for 'reduced' matvec for use in s-step methods TSQR::Combine*::factor_first now takes MatView instead of a pointer TSQR: Remove fill_matrix itself TSQR::DistTsqr: Remove uses of fill_matrix TSQR::SequentialCholeskyTsqr: Remove uses of fill_matrix TSQR::SequentialTsqr: Remove uses of fill_matrix TSQR: Remove copy_matrix itself TSQR: Remove all uses of copy_matrix TSQR: Remove more uses of copy_matrix ...

bartlettroscoe · 2019-12-17T16:35:42Z

CC: @trilinos/muelu, @srajama1

@lucbv,

FYI: As shown in this query (click the "Show Matching Output" link at the top to see the errors) we are still seeing many failures of the tests:

MueLu_ParameterListInterpreterTpetra_MPI_1
MueLu_ParameterListInterpreterTpetra_MPI_4
MueLu_ParameterListInterpreterTpetraHeavy_MPI_1
MueLu_ParameterListInterpreterTpetraHeavy_MPI_4

in the builds:

Trilinos-atdm-cee-rhel6_gnu-7.2.0_openmpi-1.10.2_serial_shared_opt
Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt
Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-complex-shared-release-debug
Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-debug
Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-release
Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-release-debug
Trilinos-atdm-sems-rhel6-intel-17.0.1-openmp-release
Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug
Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug
Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-debug
Trilinos-atdm-waterman-cuda-9.2-opt
Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
Trilinos-atdm-waterman-cuda-9.2-release-debug
Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
Trilinos-atdm-white-ride-gnu-7.2.0-openmp-release

showing errors like:

kokkos/EasyParameterListInterpreter/coarse2.xml : failed

and

kokkos/EasyParameterListInterpreter/default_p2d.xml : failed

and

kokkos/EasyParameterListInterpreter/smoother12.xml : failed

and

kokkos/EasyParameterListInterpreter/driver_drekar1_np4.xml : failed

and

kokkos/EasyParameterListInterpreter/reuse-tP-2_np4.xml : failed

and

kokkos/EasyParameterListInterpreter/reuse-RAP-2_np4.xml : failed

and many others like this.

In fact, between the dates [2019-10-01, 2019-12-17], there where 377 of these failures across these various builds.

Looking at this query, the last time one of these tests failed that did not match the regex kokkos/EasyParameterListInterpreter.*[.]xml : failed were timeouts of the test MueLu_ParameterListInterpreterTpetra_MPI_1 in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt, the last of which was on 2019-11-22. Therefore, all of the recent failures of these MueLu_ParameterListInterpreterTpetraXXX tests all show diff failures.

Something seems to have changed recently that have added many new failures in the 'cee-rhel6', 'sems-rhel6' and 'sems-rhel7' non-CUDA builds.

These tests seem to be about the most fragile of all the Trilinos tests if you look back at the last 1.5 years of effort trying to clean up these Trilinos builds and keep them clean. I wonder if this is not a good approach for writing a portable test suite?

bartlettroscoe · 2019-12-17T16:45:54Z

@lucbv,

Just to be clear, as shown in this query there were 198 failures of these tests across all of these 'cee-rhel6', 'sems-rhel6', 'sems-rhel7', 'waterman', and 'ride' builds in the recent date range [2019-12-01, 2019-12-17], which is well after PR #6364 was merged on 2019-11-26.

lucbv · 2019-12-17T17:12:10Z

@bartlettroscoe I will at least take a look at the failures on non-Cuda builds as the test can reasonably be expected to pass for them.
Regarding the tests for Cuda builds we will probably need to talk about it within the MueLu team and I don't think that it will happen before the shutdown...
I'll keep you posted.

bartlettroscoe · 2019-12-17T17:50:00Z

FYI: I changed to ATDM Sev: Blocker for now until we can determine this is not a defect that will impact the ATDM APPs.

bartlettroscoe · 2019-12-18T14:45:39Z

CC: @trilinos/framework

FYI: As shown in this query, the last two post-push CI builds showed the test MueLu_ParameterListInterpreterTpetra_MPI_4 failing. And from this query, it looks like the failing test MueLu_ParameterListInterpreterTpetra_MPI_4 may have brought down a that last testing iteration of PR #6454.

bartlettroscoe · 2019-12-18T17:46:14Z

@brian-kelley

FYI: And the failing test MueLu_ParameterListInterpreterTpetra_MPI_1 killed the PR iteration #6457 (comment). That PR has nothing to do with that test failure.

lucbv · 2019-12-18T18:14:50Z

@bartlettroscoe looking at the failure it seems that the test machines are randomly adding line breaks to outputs. This what is leading to the failures we are seeing.
Is there something that can be done to force the test machines to be consistent in the output they produce?

brian-kelley · 2019-12-18T18:16:44Z

I agree, it doesn't look like a MueLu bug:

Testing: default/FactoryParameterListInterpreter/driver_drekar2_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
--- default/Output/driver_drekar2_np4_tpetra.gold_filtered	2019-12-18 05:55:21.084907219 -0700
+++ default/Output/driver_drekar2_np4_tpetra.out_filtered	2019-12-18 05:55:21.092907127 -0700
@@ -470,3 +470,5 @@
  Nullspace = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
  fix nullspace = 0   [default]
 
+t]
+
default/FactoryParameterListInterpreter/driver_drekar2_np4.xml : failed

lucbv · 2019-12-18T18:24:31Z

Looking at the recent changes in MueLu I am actually suspecting that PR #6432 is a potential offender for the non CUDA errors...

bartlettroscoe · 2019-12-18T18:39:27Z

Is there something that can be done to force the test machines to be consistent in the output they produce?

@lucbv, tests that read and write files are fragile in generally. Is it possible to refactor these tests so that they write into an std::ostringstream object in memory and then compare inside of main memory? That should eliminate things like this (unless multiple MPI processes are writing output to the same stream).

lucbv · 2019-12-18T18:57:01Z

@bartlettroscoe that type of refactor is going to be non-trivial and potentially quite long.
I think it is a decision that the package owner should make, not me...
@jhux2 any thoughts?

cgcgcg · 2019-12-19T06:33:12Z

Yes, I believe #6432 is to blame. I'll have another look at this next year.

bartlettroscoe · 2019-12-19T15:06:21Z

CC: @trilinos/framework

Yes, I believe #6432 is to blame. I'll have another look at this next year.

@cgcgcg, then you may want to rever the PR #6432 until this can get fixed because it looks like this has taken out PR testing as shown in this query which shows this test failing 7 times in testing of 6 different PRs.

Not speaking for other developers but I have a PR that I would like to get merged before 2020 so it might be good to allow this to happen before then by backing out #6432.

And then someone should seriously consider refactoring how these MueLu_ParameterListXXX work because these have been some of the most problematic tests in the Trilinos test suite in ATDM Trilinos testing over the last 2 years.

bartlettroscoe · 2019-12-19T16:53:30Z

FYI: As shown in this query and this query, failures with the test MueLu_ParameterListInterpreterTpetra_MPI_1, at least in the build 'cee-rhel6_intel_18.0.2', appear to be random. It failed today and 3 days ago.

Therefore, perhaps people will get lucking and this test may pass in some PR testing iterations allowing the PRs to get merged.

jhux2 · 2019-12-19T23:12:01Z

@bartlettroscoe wrote:

tests that read and write files are fragile in generally. Is it possible to refactor these tests so that they write into an std::ostringstream object in memory and then compare inside of main memory? That should eliminate things like this (unless multiple MPI processes are writing output to the same stream).

Valid point, but as @lucbv points out, this would require diverting resources. @trilinos/muelu will discuss this in the new year.

bartlettroscoe · 2019-12-20T01:21:52Z

@trilinos/framework

FYI: As shown in:

you can see that this test is failing randomly in these two builds 25% and 15% of the time, respectively. That explains how the original PR #6432 that introduced this defect was able to be merged to 'develop'. The last of the 4 PR testing iterations that ran for PR #6432 had this test passing. This test killed at least one of the PR testing iterations in PR #6452 as shown here.

The Trilinos PR tester is designed to ignore randomly failing tests. And once one of those tests gets on the 'develop' branch, if that test has a moderately high probability of failing, then it results in mass failures of PR testing iterations.

cgcgcg · 2019-12-20T06:30:08Z

@bartlettroscoe Feel free to revert. This should take care of the random component.

bartlettroscoe · 2019-12-20T14:23:17Z

Feel free to revert. This should take care of the random component.

Looks like @lucbv addressed this in PR #6476. Looks like some PRs are getting merged today.

cgcgcg · 2020-01-06T23:38:10Z

@bartlettroscoe I have an updated version of PR #6432 that introduced the random failures: PR #6531.
Locally, I do not see random failures anymore, but I will let the autotester run over this a couple of times to make sure.

bartlettroscoe · 2020-02-19T15:12:56Z

FYI: As shown in this query, lots more failures of the tests:

MueLu_ParameterListInterpreterTpetra_MPI_1
MueLu_ParameterListInterpreterTpetra_MPI_4

in the builds on 'vortex':

Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi

showing diff errors (click the "Show Matching Output" link at top right) like:

Testing: kokkos/EasyParameterListInterpreter/smoother12.xml
--- kokkos/Output/smoother12_tpetra.gold_filtered	2020-02-18 03:44:57.677781714 -0700
+++ kokkos/Output/smoother12_tpetra.out_filtered	2020-02-18 03:44:57.695103171 -0700
@@ -324,87 +324,6 @@
 matrixmatrix: kernel params -> 
  [empty list]
 
-Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
-keep smoother data = 0   [default]
-PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-smoother -> 
- chebyshev: ratio eigenvalue = <ignored>
- chebyshev: boost factor = 1.1   [unused]
- chebyshev: min diagonal value = 2.22045e-16   [default]
- chebyshev: degree = 1   [default]
- chebyshev: eigenvalue max iterations = 10   [default]
- chebyshev: zero starting solution = 1   [default]
- chebyshev: assume matrix does not change = 0   [default]
-
-Level 5
-Prolongator smoothing (MueLu::SaPFactory_kokkos)
-Build (MueLu::CoalesceDropFactory_kokkos)
-Build (MueLu::AmalgamationFactory_kokkos)
-[empty list]
-
-algorithm = "classical": threshold = 0, blocksize = 1
-aggregation: drop tol = 0   [default]
-aggregation: Dirichlet threshold = 0   [default]
-aggregation: drop scheme = classical   [default]
-filtered matrix: use lumping = 1   [default]
-filtered matrix: reuse graph = 1   [default]
-filtered matrix: reuse eigenvalue = 1   [default]
-lightweight wrap = 1
-
-Build (MueLu::TentativePFactory_kokkos)
-Build (MueLu::UncoupledAggregationFactory_kokkos)
-BuildAggregates (Phase - (Dirichlet))
-BuildAggregatesRandom (Phase 1 (main))
-BuildAggregatesRandom (Phase 2a (secondary))
-BuildAggregatesRandom (Phase 2b (expansion))
-BuildAggregatesRandom (Phase 3 (cleanup))
-aggregation: max agg size = -1   [default]
-aggregation: min agg size = 2   [default]
-aggregation: max selected neighbors = 0   [default]
-aggregation: ordering = natural   [default]
-aggregation: deterministic = 0   [default]
-aggregation: coloring algorithm = serial   [default]
-aggregation: enable phase 1 = 1   [default]
-aggregation: enable phase 2a = 1   [default]
-aggregation: enable phase 2b = 1   [default]
-aggregation: enable phase 3 = 1   [default]
-aggregation: preserve Dirichlet points = 0   [default]
-aggregation: allow user-specified singletons = 0   [default]
-OnePt aggregate map name =    [default]
-OnePt aggregate map factory =    [default]
-
-Nullspace factory (MueLu::NullspaceFactory_kokkos)
-Fine level nullspace = Nullspace
-
-Build (MueLu::CoarseMapFactory_kokkos)
-[empty list]
-
-tentative: calculate qr = 1   [default]
-tentative: build coarse coordinates = 1   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-sa: damping factor = 1.33   [default]
-sa: calculate eigenvalue estimate = 0   [default]
-sa: eigenvalue estimate num iterations = 10   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Transpose P (MueLu::TransPFactory)
-matrixmatrix: kernel params -> 
- [empty list]
-
-Computing Ac (MueLu::RAPFactory)
-transpose: use implicit = 0   [default]
-rap: triple product = 0   [default]
-rap: fix zero diagonals = 0   [default]
-rap: relative diagonal floor = {}   [default]
-CheckMainDiagonal = 0   [default]
-RepairMainDiagonal = 0   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
 Max coarse size (<= 100) achieved
 Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
 keep smoother data = 0   [default]
@@ -419,7 +338,7 @@
 --------------------------------------------------------------------------------
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
-Number of levels    = 6
+Number of levels    = 5
 Operator complexity = <ignored>
 Smoother complexity = <ignored>
 Cycle type          = V
@@ -430,7 +349,6 @@
   <ignored>  
   <ignored>  
   <ignored>  
-  <ignored>  
 
 Smoother (level 0) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: 2, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
 
@@ -440,10 +358,6 @@
 
 Smoother (level 3) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: <ignored>, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
 
-Smoother (level 4) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: <ignored>, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
-
-Smoother (level 5) pre  : <Direct> solver interface
-Smoother (level 5) post : no smoother
-
-
+Smoother (level 4) pre  : <Direct> solver interface
+Smoother (level 4) post : no smoother
 
kokkos/EasyParameterListInterpreter/smoother12.xml : failed

bartlettroscoe · 2020-02-19T15:13:45Z

These seem to just be very fragile tests. Can we just turn these tests off in all of the ATDM Trilinos bulids? If you look at this query and this query, you can see these these tests pop up in a lot of bug reports. The either need to be rewritten to be more robust or they just need to be disabled in the ATDM Trilinos builds. (They can stay in in all of the other builds like the PR builds.)

Okay?

jhux2 · 2020-02-19T17:27:38Z

These seem to just be very fragile tests. Can we just turn these tests off in all of the ATDM Trilinos bulids? If you look at this query and this query, you can see these these tests pop up in a lot of bug reports. The either need to be rewritten to be more robust or they just need to be disabled in the ATDM Trilinos builds. (They can stay in in all of the other builds like the PR builds.)

Okay?

@bartlettroscoe The @trilinos/muelu team will have a look and make a decision about the tests. Please don't disable them yet.

bartlettroscoe · 2020-02-19T19:09:35Z

@bartlettroscoe The @trilinos/muelu team will have a look and make a decision about the tests. Please don't disable them yet.

@jhux2, okay, let me know how that goes.

cgcgcg · 2020-03-06T19:28:58Z

@bartlettroscoe We have been working on this issue, and we believe that this should be fixed now. Let's monitor this for a week or so before we declare victory?

bartlettroscoe · 2020-03-06T19:59:39Z

We have been working on this issue, and we believe that this should be fixed now. Let's monitor this for a week or so before we declare victory?

Okay

FYI: @rmmilewi will be working on a bot that will automatically update issues like this for the status of the related tests (see #3778). That will eliminate the manual work to follow up on GitHub issues like this.

bartlettroscoe · 2020-03-08T23:55:49Z

FYI: The tests:

MueLu_ParameterListInterpreterTpetra_MPI_1
MueLu_ParameterListInterpreterTpetra_MPI_4

are no longer running in ATDM Trilinos 'complex' builds because GEMMA does not use MueLu. Therefore, these tests are not even running in the builds:

Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-complex-shared-release-debug
Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug

cgcgcg · 2020-03-16T19:13:38Z

@bartlettroscoe Can we close this?

bartlettroscoe · 2020-03-16T20:47:41Z

Can we close this?

Results yesterday shown here showed all of these tests passing except for timeouts of the test:

MueLu_ParameterListInterpreterTpetra_MPI_1
in the builds:
Trilinos-atdm-waterman_cuda-9.2_shared_opt
Trilinos-atdm-waterman-cuda-9.2-opt

We have seen problems with timeouts other tests in these builds such as reported in #6804, #6801, #6799 (which is likely a problem with running multiple kernels on the same GPU which will hopefully be addressed by #6840).

Excluding timeouts, as shown in this query there have been no failing MueLu_ParameterList tests since 2020-03-03 (about 2 weeks). Therefore, I think these diffs have been fixed.

Therefore, since this issue #6361 is about failures and not timeouts, we can close this issue. (I will open a new issue for the timeouts).

bartlettroscoe · 2020-03-16T20:47:54Z

Closing as complete.

bartlettroscoe added this to the Keep promoted "ATDM" builds of Trilinos clean milestone Nov 26, 2019

lucbv mentioned this issue Nov 26, 2019

MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue #6361 #6364

Merged

bartlettroscoe added ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates and removed ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs labels Nov 26, 2019

bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019

bartlettroscoe mentioned this issue Dec 17, 2019

Numerous MueLu tests randomly failing and timing out (handing) in build Trilinos-atdm-waterman_cuda-9.2_shared_opt starting around 9/12/2019? #6077

Closed

bartlettroscoe added ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs and removed ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Dec 17, 2019

brian-kelley mentioned this issue Dec 18, 2019

Tpetra: Fix #6455 #6457

Merged

brian-kelley mentioned this issue Dec 19, 2019

Trios: Removing deprecated package trios #4864 #6469

Merged

lucbv mentioned this issue Dec 19, 2019

MueLu: removing temporarily faulty work in ParameterListInterpreter #6476

Merged

seheracer mentioned this issue Dec 19, 2019

KokkosKernels: Fix multiple gemm issues with complex type #6472

Merged

cgcgcg added the pkg: MueLu label Mar 2, 2020

bartlettroscoe closed this as completed Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

bartlettroscoe commented Nov 26, 2019

bartlettroscoe commented Nov 26, 2019

lucbv commented Nov 26, 2019

bartlettroscoe commented Nov 26, 2019

bartlettroscoe commented Dec 17, 2019

bartlettroscoe commented Dec 17, 2019

lucbv commented Dec 17, 2019

bartlettroscoe commented Dec 17, 2019

bartlettroscoe commented Dec 18, 2019

bartlettroscoe commented Dec 18, 2019

lucbv commented Dec 18, 2019

brian-kelley commented Dec 18, 2019

lucbv commented Dec 18, 2019

bartlettroscoe commented Dec 18, 2019

lucbv commented Dec 18, 2019

cgcgcg commented Dec 19, 2019

bartlettroscoe commented Dec 19, 2019

bartlettroscoe commented Dec 19, 2019

jhux2 commented Dec 19, 2019

bartlettroscoe commented Dec 20, 2019

cgcgcg commented Dec 20, 2019

bartlettroscoe commented Dec 20, 2019

cgcgcg commented Jan 6, 2020

bartlettroscoe commented Feb 19, 2020 •

edited

Loading

bartlettroscoe commented Feb 19, 2020

jhux2 commented Feb 19, 2020

bartlettroscoe commented Feb 19, 2020

cgcgcg commented Mar 6, 2020

bartlettroscoe commented Mar 6, 2020

bartlettroscoe commented Mar 8, 2020

cgcgcg commented Mar 16, 2020

bartlettroscoe commented Mar 16, 2020

bartlettroscoe commented Mar 16, 2020

Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

Comments

bartlettroscoe commented Nov 26, 2019

Next Action Status

Description

Current Status on CDash

Steps to Reproduce

bartlettroscoe commented Nov 26, 2019

lucbv commented Nov 26, 2019

bartlettroscoe commented Nov 26, 2019

bartlettroscoe commented Dec 17, 2019

bartlettroscoe commented Dec 17, 2019

lucbv commented Dec 17, 2019

bartlettroscoe commented Dec 17, 2019

bartlettroscoe commented Dec 18, 2019

bartlettroscoe commented Dec 18, 2019

lucbv commented Dec 18, 2019

brian-kelley commented Dec 18, 2019

lucbv commented Dec 18, 2019

bartlettroscoe commented Dec 18, 2019

lucbv commented Dec 18, 2019

cgcgcg commented Dec 19, 2019

bartlettroscoe commented Dec 19, 2019

bartlettroscoe commented Dec 19, 2019

jhux2 commented Dec 19, 2019

bartlettroscoe commented Dec 20, 2019

cgcgcg commented Dec 20, 2019

bartlettroscoe commented Dec 20, 2019

cgcgcg commented Jan 6, 2020

bartlettroscoe commented Feb 19, 2020 • edited Loading

bartlettroscoe commented Feb 19, 2020

jhux2 commented Feb 19, 2020

bartlettroscoe commented Feb 19, 2020

cgcgcg commented Mar 6, 2020

bartlettroscoe commented Mar 6, 2020

bartlettroscoe commented Mar 8, 2020

cgcgcg commented Mar 16, 2020

bartlettroscoe commented Mar 16, 2020

bartlettroscoe commented Mar 16, 2020

bartlettroscoe commented Feb 19, 2020 •

edited

Loading