Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

bartlettroscoe · 2018-02-19T23:46:40Z

CC: @trilinos/muelu, @trilinos/framework

Next Action Status

The original failure was caused due to a violation of the additive test assumption of branches due to a 10 day time lag between when the branch was tested and when it was merged. See #2171 (comment) and #2312.

Description

The push of the merge commit:

commit 22e7f91ed633a22d6d44baf46d8f1e4fca7bfb3e
Merge: 52f3559 d851185
Author: Luc Berger <lberge@sandia.gov>
Date:   Mon Feb 19 15:38:53 2018 -0700

    Merge pull request #2171 from lucbv/develop
    
    MueLu_structured_aggregation

broke the standard CI build as shown at:

https://testing.sandia.gov/cdash/index.php?project=Trilinos&parentid=3395403&

which involved breaking the four tests:

MueLu_UnitTestsEpetra_MPI_1
MueLu_UnitTestsEpetra_MPI_4
MueLu_UnitTestsTpetra_MPI_1
MueLu_UnitTestsTpetra_MPI_4

This was the only merge commit pulled in the CI iteration as shown at:

https://testing.sandia.gov/cdash/viewNotes.php?buildid=3395404##note3

Therefore, I will back out this merge commit and run the checkin-test-sems.sh script to fix this ASAP (but I will only run the MueLu test to speed this up). Since the CI build passed before this merge commit was pushed, that should be enough testing. Otherwise, everyone's automated PR testing for changes to MueLu or upsstream from MueLu will fail because of this.

The text was updated successfully, but these errors were encountered:

bartlettroscoe · 2018-02-20T00:01:33Z

Wait, looking at PR #2171, it appears that the automated PR testing says that this branch passed as of a day ago (i.e. 2/18/2018). How can that be the case? For one thing, the PR testing should have been broken for all branches that changed any code in Stokhos or upstream from Stokhos since 2/15/2018 as per #2254.

Given the nature of the failures shown for these tests, I can't see how any build involving these tests can be passing. For example, the test MueLu_UnitTestsEpetra_MPI_1 at:

https://testing.sandia.gov/cdash/testDetails.php?test=44776344&build=3395404

shows a failure:

. Aggregates_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_JustStructuredAggregationGlobal_UnitTest ... About to build the coordinates
About to build the aggregates

 debug: scalar = i
 debug: node   = N6Kokkos6Compat23KokkosDeviceWrapperNodeINS_6SerialENS_9HostSpaceEEE
 debug: linAlgebra = Epetra
 version: MueLu development
 
 p=0: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterName' :
 
  Error, the parameter {name="meshLayout",type="string",value="Global Lexicographic"}
  in the parameter (sub)list "ANONYMOUS"
  was not found in the list of valid parameters!
  
  The valid parameters and types are:
    {
      "aggregation: preserve Dirichlet points" : bool = 0
      "aggregation: allow user-specified singletons" : bool = 0
      "aggregation: error on nodes with no on-rank neighbors" : bool = 0
      "aggregation: mesh layout" : string = Global Lexicographic
      "aggregation: number of spatial dimensions" : int = 3
      "aggregation: coarsening order" : int = 0
      "aggregation: coarsening rate" : string = {3}
      "aggregation: mesh data" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "Coordinates" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "gNodesPerDim" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "lNodesPerDim" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "OnePt aggregate map name" : string = 
      "OnePt aggregate map factory" : string = 
    }
  
  
  Throw number = 2
  
  
 [FAILED]  (0.000874 sec) Aggregates_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_JustStructuredAggregationGlobal_UnitTest
 Location: /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/muelu/test/unit_tests/Aggregates.cpp:541

That is a ParameterList error. That is not some floating point rounding error or something. Therefore, every build should generate this same error.

To protect developers still using the checkin-test-sems.sh script, I will disable these tests just for that system. But if this also fails in any of the the ATDM builds tomorrow morning, we will need to back out this merge merge commit tomorrow.

Otherwise, the @trilinos/framework team should investigate how the PR testing could report this branch as passing when it should have been reported as failing (as per #2254).

This will fix the standard CI build but not any other build. These tests will run in every other build they were running before. Build/Test Cases Summary Enabled Packages: MueLu Disabled Packages: PyTrilinos,Claps,TriKota 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=76,notpassed=0 (20.46 min)

bartlettroscoe · 2018-02-20T00:40:55Z

I just pushed the commit a332932:

commit a33293251d3742083968b9d08cbff08372cbb753
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Feb 19 17:08:07 2018 -0700

    Disable newly failing MueLu_UnitTest for standard CI build (#2264)
    
    This will fix the standard CI build but not any other build. These tests will
    run in every other build they were running before.
    
    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=76,notpassed=0 (20.46 min)

M       cmake/std/BasicCiTestingSettings.cmake

We should see the standard CI build clear up next CI iteration. We will see what happens with the other builds.

bartlettroscoe · 2018-02-20T15:22:30Z

These four MueLu_UnitTestsXXX tests are now disabled in the standard CI build as shown at:

(see the -4 by the number of failing tests). And the new CDash site specifically shows these as missing at:

https://testing-vm.sandia.gov/cdash/viewTest.php?onlypassed&buildid=3325832

(Yea, that feature that CASL paid for works!)

But as I predicted, these tests are failing in about every other build of Trilinos, including all of the "Clean" builds as shown at:

https://testing-vm.sandia.gov/cdash/queryTests.php?project=Trilinos&date=2018-02-20&filtercombine=and&filtercount=2&showfilters=1&filtercombine=and&field1=testname&compare1=65&value1=MueLu_UnitTests&field2=status&compare2=62&value2=Passed

(NOTE: That same query is broken on the testing.sandia.gov site :-( )

This means that all of the automated PR builds that include changes to MueLu or packages upstream to MueLu should fail.

Again I ask, how is it that the automated PR testing for this branch say that it passed and then to merge it and have it fail everywhere?

lucbv · 2018-02-20T17:00:45Z

@bartlettroscoe I have issued a new pull request (#2266) to fix these dashboard failures.
My previous pull request seemed to pass the auto-tester checks fine so I thought it would be fine to merge it?
Evidently something went wrong with that process. For the current pull request I have pushed the changes to my branch using the checkin script before issue the pull request. The checkin script returned good results and the offending tests are not failing any more. I am expecting the auto-tester to accept my changes?

I think the main reason for this failure is that the auto-tester is not yet pushing results to a dashboard accessible by everyone. I could have realized that a package or test was not properly being tested had been able to read what the tester had done.
I know that it takes time to hook the auto-tester with the dashboard and that's fine, hopefully we will learn something from this though...

bartlettroscoe · 2018-02-20T20:45:56Z

@lucbv,

You did not do anything wrong. You were following the process like you were told to follow and that process somehow allowed broken tests to get merged to 'develop'. Since the current automated PR testing process is opaque, we can only guess what happened.

Therefore, I would recommend that until the automated PR testing process is displaying results on CDash so we know exactly what is being tested, that it might be better to do that final merge and push using the checkin-test-sems.sh script using the tip on merging and pushing topic branches at:

https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing#other-workflows

But in the meantime, every build of Trilinos that runs MueLu tests is currently broken (which should break every auto PR test involving changes to MueLu or upstream packages if it is doing what it should). Therefore, we either need to back out your merge commit or fix the failing tests ASAP.

NOTE: If you want to fix and then push with the checkin-test-sems.sh script, you will first need to revert the targeted disable of these tests in your local branch. You can do this with:

$ cd Trilinos/
$ checkout develop
$ git pull
$ git revert a33293251d3742083968b9d08cbff08372cbb753

Then you should be able to reproduce the test failures using:

$ ./checkin-test-sems.sh --enable-packages=MueLu --no-enable-fwd-packages --local-do-all

And then you can fix them and verify that they are fixed by rerunning that build (see the instructions on locally running that build at https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing#reproducing-failures-manually).

Please let me know if you need any help with this if even it is to just revert your merge commit so you can work on the fix for this offline.

lucbv · 2018-02-20T21:12:32Z

@bartlettroscoe, I ran the checkin script with all tests enabled and it passed fine, although for some reason I did not need to follow the procedure that you outlined... here is the output of my checkin, as you can see both E and T petra unit-test are passing.
ctest.txt

I did these tests in my own branch before merging the pull request, see 9b2061c.
Let me know if this is sufficient to re-enable these tests?

bartlettroscoe · 2018-02-20T21:46:59Z

I ran the checkin script with all tests enabled and it passed fine

@lucbv, that does not seem possible given what we are seeing on CDash. The change made in 9b2061c looks to be unrelated to the failures we are seeing on CDash like at:

https://testing.sandia.gov/cdash/testDetails.php?test=44790492&build=3396692

I just did:

$ cd Trilinos/
$ git checkout develop
$ git pull

$ git log-short --name-status -1
commit 50a975d66e2d3c7dc6dc985dbb36c3275346768f
Merge: 48c2daa 9b2061c
Author: Luc Berger <lberge@sandia.gov>
Date:   Tue Feb 20 12:39:55 2018 -0700

    Merge pull request #2266 from lucbv/structured_aggregation_cleanup
    
    MueLu: Structured aggregation cleanup

$ git revert a332932

and now I am running:

$ ./checkin-test-sems.sh --enable-all-packages=off --enable-packages=MueLu --no-enable-fwd-packages --local-do-all

I will post the results here when finished.

bartlettroscoe · 2018-02-20T21:49:16Z

But note that the following commits have been pushed the the 'develop' branch since the nightly tests ran last night:

50a975d "Merge pull request #2266 from lucbv/structured_aggregation_cleanup"
Author: Luc Berger <lberge@sandia.gov>
Date:   Tue Feb 20 12:39:55 2018 -0700 (2 hours ago)

48c2daa "Merge pull request #2248 from jhux2/fix-eti-issue-2232"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Tue Feb 20 12:05:46 2018 -0500 (5 hours ago)

9b2061c "MueLu: removing warning due to helper function that generates structured grids"
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Tue Feb 20 09:35:59 2018 -0700 (5 hours ago)

M       packages/muelu/test/unit_tests/MueLu_TestHelpers.hpp

605e805 "MueLu: clean up of unit tests and StructuredAggregationFactory"
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Tue Feb 20 08:01:05 2018 -0700 (7 hours ago)

M       packages/muelu/src/Graph/StructuredAggregation/MueLu_StructuredAggregationFactory_decl.hpp
M       packages/muelu/src/Graph/StructuredAggregation/MueLu_StructuredAggregationFactory_def.hpp
M       packages/muelu/test/unit_tests/Aggregates.cpp
M       packages/muelu/test/unit_tests/StructuredAggregationFactory.cpp

1b01a63 "MueLu: change template parameter name"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Mon Feb 19 17:45:56 2018 -0700 (21 hours ago)

M       packages/muelu/test/unit_tests/Galeri.cpp

039d70c "MueLu: tests: add ETI for certain Galeri functions"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Mon Jan 29 15:43:40 2018 -0800 (4 days ago)

M       packages/muelu/test/unit_tests/CMakeLists.txt
A       packages/muelu/test/unit_tests/Galeri.cpp
M       packages/muelu/test/unit_tests/MueLu_TestHelpers.hpp
M       packages/muelu/test/unit_tests/Repartition.cpp
M       packages/muelu/test/unit_tests/TentativePFactory.cpp
M       packages/muelu/test/unit_tests/UnsmooshFactory.cpp
M       packages/muelu/test/unit_tests/VariableDofLaplacianFactory.cpp

Did one of these commits fix the failing tests? Me thinks one of those commits from @jhux2 might have fixed these tests. We will see. If so, that would explain why you are now seeing passing tests.

lucbv · 2018-02-20T21:51:35Z

@bartlettroscoe I think that you want to look at 605e805, I think that it should fix the issue with the unit-tests, it was also included today's pull request. The previous commit I was pointing you to was so you can see the checkin script message saying the tests have passed.

…2264)" This reverts commit a332932. Build/Test Cases Summary Enabled Packages: MueLu Disabled Packages: PyTrilinos,Claps,TriKota 0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=80,notpassed=0 (2.26 min)

bartlettroscoe · 2018-02-20T21:57:59Z

Yup, it passed:

$ grep "MueLu_UnitTest" MPI_RELEASE_DEBUG_SHARED_PT/ctest.out 
      Start 21: MueLu_UnitTestsTpetra_MPI_1
      Start 22: MueLu_UnitTestsTpetra_MPI_4
      Start 19: MueLu_UnitTestsEpetra_MPI_1
 1/80 Test #19: MueLu_UnitTestsEpetra_MPI_1 ..................................   Passed    7.21 sec
      Start 20: MueLu_UnitTestsEpetra_MPI_4
 7/80 Test #20: MueLu_UnitTestsEpetra_MPI_4 ..................................   Passed    5.54 sec
21/80 Test #22: MueLu_UnitTestsTpetra_MPI_4 ..................................   Passed   28.87 sec
24/80 Test #21: MueLu_UnitTestsTpetra_MPI_1 ..................................   Passed   30.28 sec

Looks like @jhux2 fixed the failing tests. I will push the revert of the commit a332932.

But none of this explains how the PR testing reported in #2171 passed but resulted in these tests failing. Someone must try to figure that out so that the process and/or tools can be changed to avoid this in the future.

I will bet almost anything that if you checkout the merge commit 22e7f91 locally and then run:

$ ./checkin-test-sems.sh --enable-all-packages=off --enable-packages=MueLu  \
  --no-enable-fwd-packages --local-do-all

that it will show these four tests as failing.

bartlettroscoe · 2018-02-20T21:58:53Z

I just pushed the commit:

commit ee474ba23f3d36af6ae5461c07093ebbbd1964bc
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue Feb 20 14:36:22 2018 -0700

    Revert "Disable newly failing MueLu_UnitTest for standard CI build (#2264)"
    
    This reverts commit a33293251d3742083968b9d08cbff08372cbb753.
    
    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=80,notpassed=0 (2.26 min)

M       cmake/std/BasicCiTestingSettings.cmake

which showed these 4 tests passing.

Now we wait to see these four tests showing up again in the CI build and watch to see that all of the other builds are fixed w.r.t. to these failing tests.

lucbv · 2018-02-21T14:11:35Z

@bartlettroscoe it seems that my pull request fixed these issues as the dashboard is now clean. I am closing this issue. Thanks for the help

bartlettroscoe · 2018-02-21T15:14:55Z

And we see these four tests have been added back to the standard CI build in the CI iteration this morning at:

https://testing.sandia.gov/cdash/viewTest.php?onlydelta&buildid=3399091

bartlettroscoe · 2018-03-30T14:43:03Z

Just to summarize, this failure was caused due to a violation of the additive test assumption of branches due to a 10 day time lag between when the branch was tested and when it was merged. See #2171 (comment) for details. The need to address this in the auto PR system was added to #2312.

bartlettroscoe added pkg: MueLu Framework tasks Framework tasks (used internally by Framework team) labels Feb 19, 2018

This was referenced Feb 20, 2018

Address basic stability of Trilinos 'develop' branch short-term #1304

Closed

MueLu_structured_aggregation #2171

Merged

bartlettroscoe mentioned this issue Feb 20, 2018

Test Stokhos_TpetraCrsMatrixUQPCEUnitTest_Serial[_MPI_4] fails standard CI build and several other builds including the ATDM build Trilinos-atdm-sems-gcc-7-2-0 #2254

Closed

lucbv closed this as completed Feb 21, 2018

bartlettroscoe mentioned this issue Mar 1, 2018

Issues to be address before making automated PR testing and merging mandatory #2312

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

bartlettroscoe commented Feb 19, 2018 •

edited

Loading

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018 •

edited

Loading

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018 •

edited

Loading

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 21, 2018

bartlettroscoe commented Feb 21, 2018

bartlettroscoe commented Mar 30, 2018

Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

Comments

bartlettroscoe commented Feb 19, 2018 • edited Loading

Next Action Status

Description

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018 • edited Loading

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018 • edited Loading

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

bartlettroscoe commented Feb 20, 2018

lucbv commented Feb 21, 2018

bartlettroscoe commented Feb 21, 2018

bartlettroscoe commented Mar 30, 2018

bartlettroscoe commented Feb 19, 2018 •

edited

Loading

lucbv commented Feb 20, 2018 •

edited

Loading

lucbv commented Feb 20, 2018 •

edited

Loading