Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Four MueLu tests broken in CI build in push to MueLu on 2/19/2018 #2264

Closed
bartlettroscoe opened this issue Feb 19, 2018 · 14 comments
Closed
Labels
Framework tasks Framework tasks (used internally by Framework team) pkg: MueLu

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Feb 19, 2018

CC: @trilinos/muelu, @trilinos/framework

Next Action Status

The original failure was caused due to a violation of the additive test assumption of branches due to a 10 day time lag between when the branch was tested and when it was merged. See #2171 (comment) and #2312.

Description

The push of the merge commit:

commit 22e7f91ed633a22d6d44baf46d8f1e4fca7bfb3e
Merge: 52f3559 d851185
Author: Luc Berger <lberge@sandia.gov>
Date:   Mon Feb 19 15:38:53 2018 -0700

    Merge pull request #2171 from lucbv/develop
    
    MueLu_structured_aggregation

broke the standard CI build as shown at:

which involved breaking the four tests:

  • MueLu_UnitTestsEpetra_MPI_1
  • MueLu_UnitTestsEpetra_MPI_4
  • MueLu_UnitTestsTpetra_MPI_1
  • MueLu_UnitTestsTpetra_MPI_4

This was the only merge commit pulled in the CI iteration as shown at:

Therefore, I will back out this merge commit and run the checkin-test-sems.sh script to fix this ASAP (but I will only run the MueLu test to speed this up). Since the CI build passed before this merge commit was pushed, that should be enough testing. Otherwise, everyone's automated PR testing for changes to MueLu or upsstream from MueLu will fail because of this.

@bartlettroscoe bartlettroscoe added pkg: MueLu Framework tasks Framework tasks (used internally by Framework team) labels Feb 19, 2018
@bartlettroscoe
Copy link
Member Author

Wait, looking at PR #2171, it appears that the automated PR testing says that this branch passed as of a day ago (i.e. 2/18/2018). How can that be the case? For one thing, the PR testing should have been broken for all branches that changed any code in Stokhos or upstream from Stokhos since 2/15/2018 as per #2254.

Given the nature of the failures shown for these tests, I can't see how any build involving these tests can be passing. For example, the test MueLu_UnitTestsEpetra_MPI_1 at:

shows a failure:

. Aggregates_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_JustStructuredAggregationGlobal_UnitTest ... About to build the coordinates
About to build the aggregates

 debug: scalar = i
 debug: node   = N6Kokkos6Compat23KokkosDeviceWrapperNodeINS_6SerialENS_9HostSpaceEEE
 debug: linAlgebra = Epetra
 version: MueLu development
 
 p=0: *** Caught standard std::exception of type 'Teuchos::Exceptions::InvalidParameterName' :
 
  Error, the parameter {name="meshLayout",type="string",value="Global Lexicographic"}
  in the parameter (sub)list "ANONYMOUS"
  was not found in the list of valid parameters!
  
  The valid parameters and types are:
    {
      "aggregation: preserve Dirichlet points" : bool = 0
      "aggregation: allow user-specified singletons" : bool = 0
      "aggregation: error on nodes with no on-rank neighbors" : bool = 0
      "aggregation: mesh layout" : string = Global Lexicographic
      "aggregation: number of spatial dimensions" : int = 3
      "aggregation: coarsening order" : int = 0
      "aggregation: coarsening rate" : string = {3}
      "aggregation: mesh data" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "Coordinates" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "gNodesPerDim" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "lNodesPerDim" : Teuchos::RCP<MueLu::FactoryBase const> = Teuchos::RCP<MueLu::FactoryBase const>{ptr=0,node=0,strong_count=0,weak_count=0}
      "OnePt aggregate map name" : string = 
      "OnePt aggregate map factory" : string = 
    }
  
  
  Throw number = 2
  
  
 [FAILED]  (0.000874 sec) Aggregates_double_int_int_Kokkos_Compat_KokkosSerialWrapperNode_JustStructuredAggregationGlobal_UnitTest
 Location: /scratch/rabartl/Trilinos.base/SEMSCIBuild/Trilinos/packages/muelu/test/unit_tests/Aggregates.cpp:541

That is a ParameterList error. That is not some floating point rounding error or something. Therefore, every build should generate this same error.

To protect developers still using the checkin-test-sems.sh script, I will disable these tests just for that system. But if this also fails in any of the the ATDM builds tomorrow morning, we will need to back out this merge merge commit tomorrow.

Otherwise, the @trilinos/framework team should investigate how the PR testing could report this branch as passing when it should have been reported as failing (as per #2254).

bartlettroscoe added a commit that referenced this issue Feb 20, 2018
This will fix the standard CI build but not any other build. These tests will
run in every other build they were running before.

Build/Test Cases Summary
Enabled Packages: MueLu
Disabled Packages: PyTrilinos,Claps,TriKota
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=76,notpassed=0 (20.46 min)
@bartlettroscoe
Copy link
Member Author

I just pushed the commit a332932:

commit a33293251d3742083968b9d08cbff08372cbb753
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Mon Feb 19 17:08:07 2018 -0700

    Disable newly failing MueLu_UnitTest for standard CI build (#2264)
    
    This will fix the standard CI build but not any other build. These tests will
    run in every other build they were running before.
    
    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=76,notpassed=0 (20.46 min)

M       cmake/std/BasicCiTestingSettings.cmake

We should see the standard CI build clear up next CI iteration. We will see what happens with the other builds.

@bartlettroscoe
Copy link
Member Author

These four MueLu_UnitTestsXXX tests are now disabled in the standard CI build as shown at:

(see the -4 by the number of failing tests). And the new CDash site specifically shows these as missing at:

(Yea, that feature that CASL paid for works!)

But as I predicted, these tests are failing in about every other build of Trilinos, including all of the "Clean" builds as shown at:

(NOTE: That same query is broken on the testing.sandia.gov site :-( )

This means that all of the automated PR builds that include changes to MueLu or packages upstream to MueLu should fail.

Again I ask, how is it that the automated PR testing for this branch say that it passed and then to merge it and have it fail everywhere?

@lucbv
Copy link
Contributor

lucbv commented Feb 20, 2018

@bartlettroscoe I have issued a new pull request (#2266) to fix these dashboard failures.
My previous pull request seemed to pass the auto-tester checks fine so I thought it would be fine to merge it?
Evidently something went wrong with that process. For the current pull request I have pushed the changes to my branch using the checkin script before issue the pull request. The checkin script returned good results and the offending tests are not failing any more. I am expecting the auto-tester to accept my changes?

I think the main reason for this failure is that the auto-tester is not yet pushing results to a dashboard accessible by everyone. I could have realized that a package or test was not properly being tested had been able to read what the tester had done.
I know that it takes time to hook the auto-tester with the dashboard and that's fine, hopefully we will learn something from this though...

@bartlettroscoe
Copy link
Member Author

@lucbv,

You did not do anything wrong. You were following the process like you were told to follow and that process somehow allowed broken tests to get merged to 'develop'. Since the current automated PR testing process is opaque, we can only guess what happened.

Therefore, I would recommend that until the automated PR testing process is displaying results on CDash so we know exactly what is being tested, that it might be better to do that final merge and push using the checkin-test-sems.sh script using the tip on merging and pushing topic branches at:

But in the meantime, every build of Trilinos that runs MueLu tests is currently broken (which should break every auto PR test involving changes to MueLu or upstream packages if it is doing what it should). Therefore, we either need to back out your merge commit or fix the failing tests ASAP.

NOTE: If you want to fix and then push with the checkin-test-sems.sh script, you will first need to revert the targeted disable of these tests in your local branch. You can do this with:

$ cd Trilinos/
$ checkout develop
$ git pull
$ git revert a33293251d3742083968b9d08cbff08372cbb753

Then you should be able to reproduce the test failures using:

$ ./checkin-test-sems.sh --enable-packages=MueLu --no-enable-fwd-packages --local-do-all

And then you can fix them and verify that they are fixed by rerunning that build (see the instructions on locally running that build at https://github.com/trilinos/Trilinos/wiki/Policies-%7C-Safe-Checkin-Testing#reproducing-failures-manually).

Please let me know if you need any help with this if even it is to just revert your merge commit so you can work on the fix for this offline.

@lucbv
Copy link
Contributor

lucbv commented Feb 20, 2018

@bartlettroscoe, I ran the checkin script with all tests enabled and it passed fine, although for some reason I did not need to follow the procedure that you outlined... here is the output of my checkin, as you can see both E and T petra unit-test are passing.
ctest.txt

I did these tests in my own branch before merging the pull request, see 9b2061c.
Let me know if this is sufficient to re-enable these tests?

@bartlettroscoe
Copy link
Member Author

I ran the checkin script with all tests enabled and it passed fine

@lucbv, that does not seem possible given what we are seeing on CDash. The change made in 9b2061c looks to be unrelated to the failures we are seeing on CDash like at:

I just did:

$ cd Trilinos/
$ git checkout develop
$ git pull

$ git log-short --name-status -1
commit 50a975d66e2d3c7dc6dc985dbb36c3275346768f
Merge: 48c2daa 9b2061c
Author: Luc Berger <lberge@sandia.gov>
Date:   Tue Feb 20 12:39:55 2018 -0700

    Merge pull request #2266 from lucbv/structured_aggregation_cleanup
    
    MueLu: Structured aggregation cleanup

$ git revert a332932

and now I am running:

$ ./checkin-test-sems.sh --enable-all-packages=off --enable-packages=MueLu --no-enable-fwd-packages --local-do-all

I will post the results here when finished.

@bartlettroscoe
Copy link
Member Author

But note that the following commits have been pushed the the 'develop' branch since the nightly tests ran last night:

50a975d "Merge pull request #2266 from lucbv/structured_aggregation_cleanup"
Author: Luc Berger <lberge@sandia.gov>
Date:   Tue Feb 20 12:39:55 2018 -0700 (2 hours ago)

48c2daa "Merge pull request #2248 from jhux2/fix-eti-issue-2232"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Tue Feb 20 12:05:46 2018 -0500 (5 hours ago)

9b2061c "MueLu: removing warning due to helper function that generates structured grids"
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Tue Feb 20 09:35:59 2018 -0700 (5 hours ago)

M       packages/muelu/test/unit_tests/MueLu_TestHelpers.hpp

605e805 "MueLu: clean up of unit tests and StructuredAggregationFactory"
Author: Luc Berger-Vergiat <lberge@sandia.gov>
Date:   Tue Feb 20 08:01:05 2018 -0700 (7 hours ago)

M       packages/muelu/src/Graph/StructuredAggregation/MueLu_StructuredAggregationFactory_decl.hpp
M       packages/muelu/src/Graph/StructuredAggregation/MueLu_StructuredAggregationFactory_def.hpp
M       packages/muelu/test/unit_tests/Aggregates.cpp
M       packages/muelu/test/unit_tests/StructuredAggregationFactory.cpp

1b01a63 "MueLu: change template parameter name"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Mon Feb 19 17:45:56 2018 -0700 (21 hours ago)

M       packages/muelu/test/unit_tests/Galeri.cpp

039d70c "MueLu: tests: add ETI for certain Galeri functions"
Author: Jonathan Hu <jhu@sandia.gov>
Date:   Mon Jan 29 15:43:40 2018 -0800 (4 days ago)

M       packages/muelu/test/unit_tests/CMakeLists.txt
A       packages/muelu/test/unit_tests/Galeri.cpp
M       packages/muelu/test/unit_tests/MueLu_TestHelpers.hpp
M       packages/muelu/test/unit_tests/Repartition.cpp
M       packages/muelu/test/unit_tests/TentativePFactory.cpp
M       packages/muelu/test/unit_tests/UnsmooshFactory.cpp
M       packages/muelu/test/unit_tests/VariableDofLaplacianFactory.cpp

Did one of these commits fix the failing tests? Me thinks one of those commits from @jhux2 might have fixed these tests. We will see. If so, that would explain why you are now seeing passing tests.

@lucbv
Copy link
Contributor

lucbv commented Feb 20, 2018

@bartlettroscoe I think that you want to look at 605e805, I think that it should fix the issue with the unit-tests, it was also included today's pull request. The previous commit I was pointing you to was so you can see the checkin script message saying the tests have passed.

bartlettroscoe added a commit that referenced this issue Feb 20, 2018
…2264)"

This reverts commit a332932.

Build/Test Cases Summary
Enabled Packages: MueLu
Disabled Packages: PyTrilinos,Claps,TriKota
0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=80,notpassed=0 (2.26 min)
@bartlettroscoe
Copy link
Member Author

Yup, it passed:

$ grep "MueLu_UnitTest" MPI_RELEASE_DEBUG_SHARED_PT/ctest.out 
      Start 21: MueLu_UnitTestsTpetra_MPI_1
      Start 22: MueLu_UnitTestsTpetra_MPI_4
      Start 19: MueLu_UnitTestsEpetra_MPI_1
 1/80 Test #19: MueLu_UnitTestsEpetra_MPI_1 ..................................   Passed    7.21 sec
      Start 20: MueLu_UnitTestsEpetra_MPI_4
 7/80 Test #20: MueLu_UnitTestsEpetra_MPI_4 ..................................   Passed    5.54 sec
21/80 Test #22: MueLu_UnitTestsTpetra_MPI_4 ..................................   Passed   28.87 sec
24/80 Test #21: MueLu_UnitTestsTpetra_MPI_1 ..................................   Passed   30.28 sec

Looks like @jhux2 fixed the failing tests. I will push the revert of the commit a332932.

But none of this explains how the PR testing reported in #2171 passed but resulted in these tests failing. Someone must try to figure that out so that the process and/or tools can be changed to avoid this in the future.

I will bet almost anything that if you checkout the merge commit 22e7f91 locally and then run:

$ ./checkin-test-sems.sh --enable-all-packages=off --enable-packages=MueLu  \
  --no-enable-fwd-packages --local-do-all

that it will show these four tests as failing.

@bartlettroscoe
Copy link
Member Author

I just pushed the commit:

commit ee474ba23f3d36af6ae5461c07093ebbbd1964bc
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Tue Feb 20 14:36:22 2018 -0700

    Revert "Disable newly failing MueLu_UnitTest for standard CI build (#2264)"
    
    This reverts commit a33293251d3742083968b9d08cbff08372cbb753.
    
    Build/Test Cases Summary
    Enabled Packages: MueLu
    Disabled Packages: PyTrilinos,Claps,TriKota
    0) MPI_RELEASE_DEBUG_SHARED_PT => passed: passed=80,notpassed=0 (2.26 min)

M       cmake/std/BasicCiTestingSettings.cmake

which showed these 4 tests passing.

Now we wait to see these four tests showing up again in the CI build and watch to see that all of the other builds are fixed w.r.t. to these failing tests.

@lucbv
Copy link
Contributor

lucbv commented Feb 21, 2018

@bartlettroscoe it seems that my pull request fixed these issues as the dashboard is now clean. I am closing this issue. Thanks for the help

@lucbv lucbv closed this as completed Feb 21, 2018
@bartlettroscoe
Copy link
Member Author

And we see these four tests have been added back to the standard CI build in the CI iteration this morning at:

@bartlettroscoe
Copy link
Member Author

Just to summarize, this failure was caused due to a violation of the additive test assumption of branches due to a 10 day time lag between when the branch was tested and when it was merged. See #2171 (comment) for details. The need to address this in the auto PR system was added to #2312.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) pkg: MueLu
Projects
None yet
Development

No branches or pull requests

2 participants