Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 #6361

Closed
bartlettroscoe opened this issue Nov 26, 2019 · 32 comments
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

CC: @trilinos/muelu, @srajama1 (Trilinos Linear Solvers Product Lead), @bartlettroscoe, @fryeguy52, @lucbv

Next Action Status

Description

As shown in this query the tests:

  • MueLu_ParameterListInterpreterTpetra_MPI_1
  • MueLu_ParameterListInterpreterTpetra_MPI_4
  • MueLu_ParameterListInterpreterTpetraHeavy_MPI_4

in the builds:

  • Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
  • Trilinos-atdm-waterman_cuda-9.2_shared_opt
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
  • Trilinos-atdm-waterman-cuda-9.2-release-debug
  • Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug ('white')
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

on the machines 'white','ride', and 'waterman' started failing every day starting 2019-11-23.

As shown in this query all of the failing tests show at lease one of the following error outputs.

Like here showing:

Testing: kokkos/EasyParameterListInterpreter/coarse1.xml
--- kokkos/Output/coarse1_tpetra.gold_filtered	2019-11-23 07:44:05.270777000 -0700
+++ kokkos/Output/coarse1_tpetra.out_filtered	2019-11-23 07:44:05.301774000 -0700
@@ -276,7 +276,7 @@
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
 Number of levels    = 4
-Operator complexity = 1.48
+Operator complexity = 1.47
 Smoother complexity = <ignored>
 Cycle type          = V
 
kokkos/EasyParameterListInterpreter/coarse1.xml : failed

and here showing:

Testing: kokkos/EasyParameterListInterpreter/repartition3_np4.xml
--- kokkos/Output/repartition1_np4_tpetra.gold_filtered	2019-11-26 09:13:16.489926000 -0700
+++ kokkos/Output/repartition1_np4_tpetra.out_filtered	2019-11-26 09:13:16.513926000 -0700
@@ -278,7 +278,7 @@
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
 Number of levels    = 3
-Operator complexity = 1.45
+Operator complexity = 1.44
 Smoother complexity = <ignored>
 Cycle type          = V
 
kokkos/EasyParameterListInterpreter/repartition1_np4.xml : failed

and here showing:

Testing: kokkos/EasyParameterListInterpreter/aggregation1.xml
--- kokkos/Output/aggregation1_tpetra.gold_filtered	2019-10-29 03:34:18.334162000 -0600
+++ kokkos/Output/aggregation1_tpetra.out_filtered	2019-10-29 03:34:18.362184000 -0600
@@ -154,54 +154,4 @@
 matrixmatrix: kernel params -> 
  [empty list]
 
-sa: damping factor = 1.33   [default]
-sa: calculate eigenvalue estimate = 0   [default]
-sa: eigenvalue estimate num iterations = 10   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Transpose P (MueLu::TransPFactory)
-matrixmatrix: kernel params -> 
- [empty list]
-
-Computing Ac (MueLu::RAPFactory)
-transpose: use implicit = 0   [default]
-rap: triple product = 0   [default]
-rap: fix zero diagonals = 0   [default]
-rap: relative diagonal floor = {}   [default]
-CheckMainDiagonal = 0   [default]
-RepairMainDiagonal = 0   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Max coarse size (<= 2000) achieved
-Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
-keep smoother data = 0   [default]
-PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-presmoother -> 
- A = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- Nullspace = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
- fix nullspace = 0   [default]
-
-
---------------------------------------------------------------------------------
----                            Multigrid Summary                             ---
---------------------------------------------------------------------------------
-Number of levels    = 3
-Operator complexity = 1.44
-Smoother complexity = 1.78
-Cycle type          = V
-
-level  rows  nnz    nnz/row  c ratio  procs
-  0  9999  29995  3.00                  1  
-  1  3333  9997   3.00     3.00         1  
-  2  1111  3331   3.00     3.00         1  
-
-Smoother (level 0) both : "Ifpack2::Relaxation": {Initialized: true, Computed: true, Type: Symmetric Gauss-Seidel, sweeps: 1, damping factor: 1, Global matrix dimensions: [9999, 9999], Global nnz: 29995}
-
-Smoother (level 1) both : "Ifpack2::Relaxation": {Initialized: true, Computed: true, Type: Symmetric Gauss-Seidel, sweeps: 1, damping factor: 1, Global matrix dimensions: [3333, 3333], Global nnz: 9997}
-
-Smoother (level 2) pre  : <Direct> solver interface
-Smoother (level 2) post : no smoother
-
+Caught exception: Prolongator damping factor needs to be finite.
kokkos/EasyParameterListInterpreter/aggregation1.xml : failed

But note that these tests pass in the 'sems-rhel7-cuda-9.2' builds. Therefore, there is something different about these Power/GPU machines that are triggering these failures.

The new commits that were pulled in to these builds on 2019-11-26 that these failures started are show, for example, here.

From looking over that set of commits, it seems likely this was triggered by the commits in the PR #6326 from @lucbv (topic branch MueLu_aggregation_kokkos_fixes).

Current Status on CDash

NOTE: Click "Previous" to see status for previous testing day.

Steps to Reproduce

One should be able to reproduce this failure on the machine as described in:

More specifically, the commands given for the system 'white' (SON) and 'ride' (SRN) are provided at:

The exact commands to reproduce this issue should be:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh \
    Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_MueLu=ON \
 $TRILINOS_DIR

$ make NP=16

$ bsub -x -Is -q rhel7F -n 16 ctest -j4
@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area labels Nov 26, 2019
@bartlettroscoe
Copy link
Member Author

What I don't understand is why the Trilinos cuda-9.2 PR build is not also showing these tests as failing? This is something we should look into.

@lucbv
Copy link
Contributor

lucbv commented Nov 26, 2019

@bartlettroscoe I have a fix being tested locally for the following type of error:

-Operator complexity = 1.45
+Operator complexity = 1.44

We essentially need to ignore this complexity output as the aggregation process is now non deterministic on GPU which leads to slight variations on operator complexities.

However I strongly suspect that the last error:

Caught exception: Prolongator damping factor needs to be finite.

is of a different nature and unrelated to my changes.
I'll reference this issue in my PR when my fix is ready.

@bartlettroscoe
Copy link
Member Author

@lucbv, is is possible for the solver to throw:

Caught exception: Prolongator damping factor needs to be finite.

and still have the test pass? We could search the test output on CDash to see if that output has been seen in passing versions of this test.

lucbv added a commit to lucbv/Trilinos that referenced this issue Nov 26, 2019
…kkos, see issue trilinos#6361

After kokkos refactor of aggregation, non deterministic aggregates are formed.
This means that checking operator complexity against a gold file is wrong.
This commits implements logic to ignore OperatorComplexity for kokkos runs of
ParameterListInterpreter.
@bartlettroscoe bartlettroscoe added ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates and removed ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs labels Nov 26, 2019
trilinos-autotester added a commit that referenced this issue Nov 27, 2019
…omplexity

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue #6361
PR Author: lucbv
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Nov 27, 2019
…s:develop' (2a3751d).

* trilinos-develop:
  MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361
  ML: Default fix
  ML: More experimental coarsening
  ML: enabling more experimental maxwell stuff
  ML: enabling more experimental maxwell stuff
  ML: enabling more experimental maxwell stuff
  Xpetra: Re-enabled fast TwoMatrixAdd path
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Nov 27, 2019
…s:develop' (2a3751d).

* trilinos-develop: (50 commits)
  TSQR: Fix minor build error with CUDA
  TSQR::Matrix: Simplify nonmember functions
  TSQR::Combine*::apply_inner now takes MatView instead of a pointer
  TSQR::Combine*::factor_inner now takes MatView instead of a pointer
  TSQR::Combine: Remove unneeded factor_first overload
  TSQR::Combine*::factor_pair now takes MatView instead of a pointer
  MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361
  Ifpack2: spelling
  Ifpack2: Fixing test to only run in parallel
  Ifpack2: Fixing test
  Ifpack2: OverlappingRowMatrix cleanup
  Ifpack2: Adding unit test for 'reduced' matvec for use in s-step methods
  TSQR::Combine*::factor_first now takes MatView instead of a pointer
  TSQR: Remove fill_matrix itself
  TSQR::DistTsqr: Remove uses of fill_matrix
  TSQR::SequentialCholeskyTsqr: Remove uses of fill_matrix
  TSQR::SequentialTsqr: Remove uses of fill_matrix
  TSQR: Remove copy_matrix itself
  TSQR: Remove all uses of copy_matrix
  TSQR: Remove more uses of copy_matrix
  ...
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue Nov 27, 2019
…s:develop' (2a3751d).

* trilinos-develop: (50 commits)
  TSQR: Fix minor build error with CUDA
  TSQR::Matrix: Simplify nonmember functions
  TSQR::Combine*::apply_inner now takes MatView instead of a pointer
  TSQR::Combine*::factor_inner now takes MatView instead of a pointer
  TSQR::Combine: Remove unneeded factor_first overload
  TSQR::Combine*::factor_pair now takes MatView instead of a pointer
  MueLu: ignoring OperatorComplexity in ParameterListInterpreter for kokkos, see issue trilinos#6361
  Ifpack2: spelling
  Ifpack2: Fixing test to only run in parallel
  Ifpack2: Fixing test
  Ifpack2: OverlappingRowMatrix cleanup
  Ifpack2: Adding unit test for 'reduced' matvec for use in s-step methods
  TSQR::Combine*::factor_first now takes MatView instead of a pointer
  TSQR: Remove fill_matrix itself
  TSQR::DistTsqr: Remove uses of fill_matrix
  TSQR::SequentialCholeskyTsqr: Remove uses of fill_matrix
  TSQR::SequentialTsqr: Remove uses of fill_matrix
  TSQR: Remove copy_matrix itself
  TSQR: Remove all uses of copy_matrix
  TSQR: Remove more uses of copy_matrix
  ...
@bartlettroscoe bartlettroscoe added the impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) label Dec 12, 2019
@bartlettroscoe bartlettroscoe changed the title Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', and 'waterman' starting 2019-11-23 Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', and 'waterman' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 Dec 17, 2019
@bartlettroscoe
Copy link
Member Author

CC: @trilinos/muelu, @srajama1

@lucbv,

FYI: As shown in this query (click the "Show Matching Output" link at the top to see the errors) we are still seeing many failures of the tests:

  • MueLu_ParameterListInterpreterTpetra_MPI_1
  • MueLu_ParameterListInterpreterTpetra_MPI_4
  • MueLu_ParameterListInterpreterTpetraHeavy_MPI_1
  • MueLu_ParameterListInterpreterTpetraHeavy_MPI_4

in the builds:

  • Trilinos-atdm-cee-rhel6_gnu-7.2.0_openmpi-1.10.2_serial_shared_opt
  • Trilinos-atdm-cee-rhel6_intel-18.0.2_mpich2-3.2_openmp_static_opt
  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-complex-shared-release-debug
  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-debug
  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-release
  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-release-debug
  • Trilinos-atdm-sems-rhel6-intel-17.0.1-openmp-release
  • Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug
  • Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-shared-release-debug
  • Trilinos-atdm-waterman_cuda-9.2_fpic_static_opt
  • Trilinos-atdm-waterman_cuda-9.2_shared_opt
  • Trilinos-atdm-waterman-cuda-9.2-debug
  • Trilinos-atdm-waterman-cuda-9.2-opt
  • Trilinos-atdm-waterman-cuda-9.2-rdc-release-debug
  • Trilinos-atdm-waterman-cuda-9.2-release-debug
  • Trilinos-atdm-white-ride-cuda-10.1-gnu-7.2.0-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug
  • Trilinos-atdm-white-ride-gnu-7.2.0-openmp-release

showing errors like:

kokkos/EasyParameterListInterpreter/coarse2.xml : failed

and

kokkos/EasyParameterListInterpreter/default_p2d.xml : failed

and

kokkos/EasyParameterListInterpreter/smoother12.xml : failed

and

kokkos/EasyParameterListInterpreter/driver_drekar1_np4.xml : failed

and

kokkos/EasyParameterListInterpreter/reuse-tP-2_np4.xml : failed

and

kokkos/EasyParameterListInterpreter/reuse-RAP-2_np4.xml : failed

and many others like this.

In fact, between the dates [2019-10-01, 2019-12-17], there where 377 of these failures across these various builds.

Looking at this query, the last time one of these tests failed that did not match the regex kokkos/EasyParameterListInterpreter.*[.]xml : failed were timeouts of the test MueLu_ParameterListInterpreterTpetra_MPI_1 in the build Trilinos-atdm-waterman_cuda-9.2_shared_opt, the last of which was on 2019-11-22. Therefore, all of the recent failures of these MueLu_ParameterListInterpreterTpetraXXX tests all show diff failures.

Something seems to have changed recently that have added many new failures in the 'cee-rhel6', 'sems-rhel6' and 'sems-rhel7' non-CUDA builds.

These tests seem to be about the most fragile of all the Trilinos tests if you look back at the last 1.5 years of effort trying to clean up these Trilinos builds and keep them clean. I wonder if this is not a good approach for writing a portable test suite?

@bartlettroscoe
Copy link
Member Author

@lucbv,

Just to be clear, as shown in this query there were 198 failures of these tests across all of these 'cee-rhel6', 'sems-rhel6', 'sems-rhel7', 'waterman', and 'ride' builds in the recent date range [2019-12-01, 2019-12-17], which is well after PR #6364 was merged on 2019-11-26.

@lucbv
Copy link
Contributor

lucbv commented Dec 17, 2019

@bartlettroscoe I will at least take a look at the failures on non-Cuda builds as the test can reasonably be expected to pass for them.
Regarding the tests for Cuda builds we will probably need to talk about it within the MueLu team and I don't think that it will happen before the shutdown...
I'll keep you posted.

@bartlettroscoe bartlettroscoe added ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs and removed ATDM Sev: Nonblocker Problems with Trilinos that should not block ATDM APPs from getting updates labels Dec 17, 2019
@bartlettroscoe
Copy link
Member Author

FYI: I changed to ATDM Sev: Blocker for now until we can determine this is not a defect that will impact the ATDM APPs.

@bartlettroscoe
Copy link
Member Author

CC: @trilinos/framework

FYI: As shown in this query, the last two post-push CI builds showed the test MueLu_ParameterListInterpreterTpetra_MPI_4 failing. And from this query, it looks like the failing test MueLu_ParameterListInterpreterTpetra_MPI_4 may have brought down a that last testing iteration of PR #6454.

@bartlettroscoe
Copy link
Member Author

@brian-kelley

FYI: And the failing test MueLu_ParameterListInterpreterTpetra_MPI_1 killed the PR iteration #6457 (comment). That PR has nothing to do with that test failure.

@lucbv
Copy link
Contributor

lucbv commented Dec 18, 2019

@bartlettroscoe looking at the failure it seems that the test machines are randomly adding line breaks to outputs. This what is leading to the failures we are seeing.
Is there something that can be done to force the test machines to be consistent in the output they produce?

@brian-kelley
Copy link
Contributor

I agree, it doesn't look like a MueLu bug:

Testing: default/FactoryParameterListInterpreter/driver_drekar2_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
Testing: default/FactoryParameterListInterpreter/repartition1_np4.xml
--- default/Output/driver_drekar2_np4_tpetra.gold_filtered	2019-12-18 05:55:21.084907219 -0700
+++ default/Output/driver_drekar2_np4_tpetra.out_filtered	2019-12-18 05:55:21.092907127 -0700
@@ -470,3 +470,5 @@
  Nullspace = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
  fix nullspace = 0   [default]
 
+t]
+
default/FactoryParameterListInterpreter/driver_drekar2_np4.xml : failed

@lucbv
Copy link
Contributor

lucbv commented Dec 18, 2019

Looking at the recent changes in MueLu I am actually suspecting that PR #6432 is a potential offender for the non CUDA errors...

@bartlettroscoe
Copy link
Member Author

Is there something that can be done to force the test machines to be consistent in the output they produce?

@lucbv, tests that read and write files are fragile in generally. Is it possible to refactor these tests so that they write into an std::ostringstream object in memory and then compare inside of main memory? That should eliminate things like this (unless multiple MPI processes are writing output to the same stream).

@lucbv
Copy link
Contributor

lucbv commented Dec 18, 2019

@bartlettroscoe that type of refactor is going to be non-trivial and potentially quite long.
I think it is a decision that the package owner should make, not me...
@jhux2 any thoughts?

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 19, 2019

Yes, I believe #6432 is to blame. I'll have another look at this next year.

@bartlettroscoe
Copy link
Member Author

CC: @trilinos/framework

Yes, I believe #6432 is to blame. I'll have another look at this next year.

@cgcgcg, then you may want to rever the PR #6432 until this can get fixed because it looks like this has taken out PR testing as shown in this query which shows this test failing 7 times in testing of 6 different PRs.

Not speaking for other developers but I have a PR that I would like to get merged before 2020 so it might be good to allow this to happen before then by backing out #6432.

And then someone should seriously consider refactoring how these MueLu_ParameterListXXX work because these have been some of the most problematic tests in the Trilinos test suite in ATDM Trilinos testing over the last 2 years.

@bartlettroscoe
Copy link
Member Author

FYI: As shown in this query and this query, failures with the test MueLu_ParameterListInterpreterTpetra_MPI_1, at least in the build 'cee-rhel6_intel_18.0.2', appear to be random. It failed today and 3 days ago.

Therefore, perhaps people will get lucking and this test may pass in some PR testing iterations allowing the PRs to get merged.

@jhux2
Copy link
Member

jhux2 commented Dec 19, 2019

@bartlettroscoe wrote:

tests that read and write files are fragile in generally. Is it possible to refactor these tests so that they write into an std::ostringstream object in memory and then compare inside of main memory? That should eliminate things like this (unless multiple MPI processes are writing output to the same stream).

Valid point, but as @lucbv points out, this would require diverting resources. @trilinos/muelu will discuss this in the new year.

@bartlettroscoe
Copy link
Member Author

@trilinos/framework

FYI: As shown in:

you can see that this test is failing randomly in these two builds 25% and 15% of the time, respectively. That explains how the original PR #6432 that introduced this defect was able to be merged to 'develop'. The last of the 4 PR testing iterations that ran for PR #6432 had this test passing. This test killed at least one of the PR testing iterations in PR #6452 as shown here.

The Trilinos PR tester is designed to ignore randomly failing tests. And once one of those tests gets on the 'develop' branch, if that test has a moderately high probability of failing, then it results in mass failures of PR testing iterations.

@cgcgcg
Copy link
Contributor

cgcgcg commented Dec 20, 2019

@bartlettroscoe Feel free to revert. This should take care of the random component.

@bartlettroscoe
Copy link
Member Author

Feel free to revert. This should take care of the random component.

Looks like @lucbv addressed this in PR #6476. Looks like some PRs are getting merged today.

@cgcgcg
Copy link
Contributor

cgcgcg commented Jan 6, 2020

@bartlettroscoe I have an updated version of PR #6432 that introduced the random failures: PR #6531.
Locally, I do not see random failures anymore, but I will let the autotester run over this a couple of times to make sure.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Feb 19, 2020

FYI: As shown in this query, lots more failures of the tests:

  • MueLu_ParameterListInterpreterTpetra_MPI_1
  • MueLu_ParameterListInterpreterTpetra_MPI_4

in the builds on 'vortex':

  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_dbg_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-2019.06.24_static_opt_cuda-aware-mpi

showing diff errors (click the "Show Matching Output" link at top right) like:

Testing: kokkos/EasyParameterListInterpreter/smoother12.xml
--- kokkos/Output/smoother12_tpetra.gold_filtered	2020-02-18 03:44:57.677781714 -0700
+++ kokkos/Output/smoother12_tpetra.out_filtered	2020-02-18 03:44:57.695103171 -0700
@@ -324,87 +324,6 @@
 matrixmatrix: kernel params -> 
  [empty list]
 
-Setup Smoother (MueLu::Ifpack2Smoother{type = CHEBYSHEV})
-keep smoother data = 0   [default]
-PreSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-PostSmoother data = Teuchos::RCP<MueLu::SmootherPrototype<ignored> >{ptr=0,node=0,strong_count=0,weak_count=0}   [default]
-smoother -> 
- chebyshev: ratio eigenvalue = <ignored>
- chebyshev: boost factor = 1.1   [unused]
- chebyshev: min diagonal value = 2.22045e-16   [default]
- chebyshev: degree = 1   [default]
- chebyshev: eigenvalue max iterations = 10   [default]
- chebyshev: zero starting solution = 1   [default]
- chebyshev: assume matrix does not change = 0   [default]
-
-Level 5
-Prolongator smoothing (MueLu::SaPFactory_kokkos)
-Build (MueLu::CoalesceDropFactory_kokkos)
-Build (MueLu::AmalgamationFactory_kokkos)
-[empty list]
-
-algorithm = "classical": threshold = 0, blocksize = 1
-aggregation: drop tol = 0   [default]
-aggregation: Dirichlet threshold = 0   [default]
-aggregation: drop scheme = classical   [default]
-filtered matrix: use lumping = 1   [default]
-filtered matrix: reuse graph = 1   [default]
-filtered matrix: reuse eigenvalue = 1   [default]
-lightweight wrap = 1
-
-Build (MueLu::TentativePFactory_kokkos)
-Build (MueLu::UncoupledAggregationFactory_kokkos)
-BuildAggregates (Phase - (Dirichlet))
-BuildAggregatesRandom (Phase 1 (main))
-BuildAggregatesRandom (Phase 2a (secondary))
-BuildAggregatesRandom (Phase 2b (expansion))
-BuildAggregatesRandom (Phase 3 (cleanup))
-aggregation: max agg size = -1   [default]
-aggregation: min agg size = 2   [default]
-aggregation: max selected neighbors = 0   [default]
-aggregation: ordering = natural   [default]
-aggregation: deterministic = 0   [default]
-aggregation: coloring algorithm = serial   [default]
-aggregation: enable phase 1 = 1   [default]
-aggregation: enable phase 2a = 1   [default]
-aggregation: enable phase 2b = 1   [default]
-aggregation: enable phase 3 = 1   [default]
-aggregation: preserve Dirichlet points = 0   [default]
-aggregation: allow user-specified singletons = 0   [default]
-OnePt aggregate map name =    [default]
-OnePt aggregate map factory =    [default]
-
-Nullspace factory (MueLu::NullspaceFactory_kokkos)
-Fine level nullspace = Nullspace
-
-Build (MueLu::CoarseMapFactory_kokkos)
-[empty list]
-
-tentative: calculate qr = 1   [default]
-tentative: build coarse coordinates = 1   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-sa: damping factor = 1.33   [default]
-sa: calculate eigenvalue estimate = 0   [default]
-sa: eigenvalue estimate num iterations = 10   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
-Transpose P (MueLu::TransPFactory)
-matrixmatrix: kernel params -> 
- [empty list]
-
-Computing Ac (MueLu::RAPFactory)
-transpose: use implicit = 0   [default]
-rap: triple product = 0   [default]
-rap: fix zero diagonals = 0   [default]
-rap: relative diagonal floor = {}   [default]
-CheckMainDiagonal = 0   [default]
-RepairMainDiagonal = 0   [default]
-matrixmatrix: kernel params -> 
- [empty list]
-
 Max coarse size (<= 100) achieved
 Setup Smoother (MueLu::Amesos2Smoother{type = <ignored>})
 keep smoother data = 0   [default]
@@ -419,7 +338,7 @@
 --------------------------------------------------------------------------------
 ---                            Multigrid Summary                             ---
 --------------------------------------------------------------------------------
-Number of levels    = 6
+Number of levels    = 5
 Operator complexity = <ignored>
 Smoother complexity = <ignored>
 Cycle type          = V
@@ -430,7 +349,6 @@
   <ignored>  
   <ignored>  
   <ignored>  
-  <ignored>  
 
 Smoother (level 0) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: 2, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
 
@@ -440,10 +358,6 @@
 
 Smoother (level 3) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: <ignored>, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
 
-Smoother (level 4) both : "Ifpack2::Chebyshev": {Initialized: true, Computed: true, "Ifpack2::Details::Chebyshev":{degree: 1, lambdaMax = <ignored>, alpha: <ignored>, lambdaMin = <ignored>, boost factor: 1.1}, Global matrix dimensions: <ignored>, Global nnz: <ignored>}
-
-Smoother (level 5) pre  : <Direct> solver interface
-Smoother (level 5) post : no smoother
-
-
+Smoother (level 4) pre  : <Direct> solver interface
+Smoother (level 4) post : no smoother
 
kokkos/EasyParameterListInterpreter/smoother12.xml : failed

@bartlettroscoe
Copy link
Member Author

These seem to just be very fragile tests. Can we just turn these tests off in all of the ATDM Trilinos bulids? If you look at this query and this query, you can see these these tests pop up in a lot of bug reports. The either need to be rewritten to be more robust or they just need to be disabled in the ATDM Trilinos builds. (They can stay in in all of the other builds like the PR builds.)

Okay?

@jhux2
Copy link
Member

jhux2 commented Feb 19, 2020

These seem to just be very fragile tests. Can we just turn these tests off in all of the ATDM Trilinos bulids? If you look at this query and this query, you can see these these tests pop up in a lot of bug reports. The either need to be rewritten to be more robust or they just need to be disabled in the ATDM Trilinos builds. (They can stay in in all of the other builds like the PR builds.)

Okay?

@bartlettroscoe The @trilinos/muelu team will have a look and make a decision about the tests. Please don't disable them yet.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe The @trilinos/muelu team will have a look and make a decision about the tests. Please don't disable them yet.

@jhux2, okay, let me know how that goes.

@bartlettroscoe bartlettroscoe changed the title Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', and 'waterman' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 Several MueLu_ParameterListXXX tests failing in all CUDA builds on 'white', 'ride', 'waterman', and 'vortex' starting 2019-11-23 and other non-CUDA builds starting 3019-12-14 Feb 21, 2020
@cgcgcg
Copy link
Contributor

cgcgcg commented Mar 6, 2020

@bartlettroscoe We have been working on this issue, and we believe that this should be fixed now. Let's monitor this for a week or so before we declare victory?

@bartlettroscoe
Copy link
Member Author

We have been working on this issue, and we believe that this should be fixed now. Let's monitor this for a week or so before we declare victory?

Okay

FYI: @rmmilewi will be working on a bot that will automatically update issues like this for the status of the related tests (see #3778). That will eliminate the manual work to follow up on GitHub issues like this.

@bartlettroscoe
Copy link
Member Author

FYI: The tests:

  • MueLu_ParameterListInterpreterTpetra_MPI_1
  • MueLu_ParameterListInterpreterTpetra_MPI_4

are no longer running in ATDM Trilinos 'complex' builds because GEMMA does not use MueLu. Therefore, these tests are not even running in the builds:

  • Trilinos-atdm-sems-rhel6-gnu-7.2.0-openmp-complex-shared-release-debug
  • Trilinos-atdm-sems-rhel7-intel-18.0.5-openmp-complex-shared-release-debug

@cgcgcg
Copy link
Contributor

cgcgcg commented Mar 16, 2020

@bartlettroscoe Can we close this?

@bartlettroscoe
Copy link
Member Author

Can we close this?

Results yesterday shown here showed all of these tests passing except for timeouts of the test:

  • MueLu_ParameterListInterpreterTpetra_MPI_1
    in the builds:

  • Trilinos-atdm-waterman_cuda-9.2_shared_opt

  • Trilinos-atdm-waterman-cuda-9.2-opt

We have seen problems with timeouts other tests in these builds such as reported in #6804, #6801, #6799 (which is likely a problem with running multiple kernels on the same GPU which will hopefully be addressed by #6840).

Excluding timeouts, as shown in this query there have been no failing MueLu_ParameterList tests since 2020-03-03 (about 2 weeks). Therefore, I think these diffs have been fixed.

Therefore, since this issue #6361 is about failures and not timeouts, we can close this issue. (I will open a new issue for the timeouts).

@bartlettroscoe
Copy link
Member Author

Closing as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Linear Solvers Issues that fall under the Trilinos Linear Solvers Product Area pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

5 participants