Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify existing GCC 4.8.4 CI build to match selected auto PR build #2462

Closed
bartlettroscoe opened this issue Mar 27, 2018 · 20 comments
Closed
Assignees
Labels
client: ATDM Any issue primarily impacting the ATDM project stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 27, 2018

CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott

Next Action Status

Post-push CI build and checkin-test-sems.sh script is now updated to use updated GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build. Consideration for using this build in auto PR testing being addressed in #2788.

Description

This Issue is to scope out and track efforts to upgrade the existing SEMS-based Trilinos CI build (see #482 and #1304) to match the selected GCC 4.8.4 auto PR build as described in #2317 (comment). The existing GCC 4.8.4 CI build shown here has been running for 1.5+ years and has been maintained over that time. That build has many but not all of the settings of the selected GCC 4.8.4 auto PR build listed here. The primary changes that need to be made are:

The most difficult change will likely be to enable OpenMP because of the problem of the threads all binding to the same cores as described in #2422. Therefore, the initial auto PR build may not have OpenMP enabled due to these challenges.

Tasks:

  1. Set Xpetra_ENABLE_Experimental=ON and MueLu_ENABLE_Experimental=ON in CI build ... Merged in Enable Xpetra and MueLu Experimental in standard CI build (#2317, #2462) #2467 and was later removed in 7481c76 [DONE]
  2. Switch current CI build from OpenMPI 1.6.5 to 1.10.1 (see build GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP in GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2688) [DONE]
  3. Enable Trilinos_ENABLE_OpenMP=ON and OMP_NUM_THREADS=2 (see build GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP in GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2688) [DONE]
  4. Set up nightly build and clean up tests (see Three ShyLU_DDFROSch_test_frosch_XXX tests failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2691 and Test Teko_testdriver_tpetra_MPI_1 is failing in new GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2712) ... IN PROGRESS ...
  5. Switch auto PR tester to use updated GCC 4.8.4 configuration ...

Related Issues:

@mhoemmen
Copy link
Contributor

@bartlettroscoe Can we enable OpenMP but force OMP_NUM_THREADS=1? Some of those Xpetra and MueLu "experimental" build options may not have any effect unless OpenMP is enabled.

@bartlettroscoe
Copy link
Member Author

Can we enable OpenMP but force OMP_NUM_THREADS=1? Some of those Xpetra and MueLu "experimental" build options may not have any effect unless OpenMP is enabled.

I guess I can try that. But I wonder even with OMP_NUM_THREADS=1 if all of the threads will be bound to the same core or not.

Also, note that there are ATDM builds of Trilinos that enable experimental MueLu code that build and run tests with a serial Kokkos node as shown at:

@mhoemmen
Copy link
Contributor

@csiefer2 would know for sure whether disabling OpenMP is adequate. My guess is no, because some of the sparse matrix-matrix multiply code takes different paths if OpenMP is enabled.

@csiefer2
Copy link
Member

OpenMPNode and SerialNode trigger different code paths in chunks of Tpetra. AFAIK MueLu does not do node type specialization (except for Epetra).

What you choose to test for PR doesn't really matter, but they both need to stay working (more or less).

@bartlettroscoe
Copy link
Member Author

OpenMPNode and SerialNode trigger different code paths in chunks of Tpetra. AFAIK MueLu does not do node type specialization (except for Epetra).

What you choose to test for PR doesn't really matter, but they both need to stay working (more or less).

The GCC 4.8.4 PR build will test OpenMP path and Intel 17.x build will test the Serial node path. And the ATDM builds of Trilinos are already testing both paths and have been for many weeks now as you can see at:

@mhoemmen
Copy link
Contributor

@bartlettroscoe Cool, then I'm OK with this :)

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 27, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 28, 2018
, trilinos#2462)

Since the ATDM APPs enable these, so should the CI and auto PR builds (see
@bartlettroscoe
Copy link
Member Author

I submitted PR #2467 to enable Xpetra and MueLu experimental code in the standard CI build. If someone can quickly review that, then I can merge.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 28, 2018
@bartlettroscoe
Copy link
Member Author

I tested the full CI build going from OpenMPI 1.6.5 to 1.8.7 in the branch 2462-openmpi-1.6.5-to-1.8.7 in my fork of Trilinos git@github.com:bartlettroscoe/Trilinos.git and it caused 30 tests to time out (see details below). I can't tell if these are hangs or just that MPI communication is taking longer. Someone would need to research that. In any case, we are a no-go for upgrading from OpenMPI 1.6.5 to 1.8.7.

I will try updating from OpenMPI 1.6.5 to 1.10.1 (which is the only other OpenMPI implementation that SEMS provides) and see how that goes.

DETAILED NOTES (click to expand)

(3/27/2018)

I created the branch 2462-openmpi-1.6.5-to-1.8.7 in my fork of Trilinos. I added the commit d36479f to change from OpenMPI 1.6.5 to 1.8.7.

I tested this with:

$ ./checkin-test-sems.sh --enable-all-packages=on --local-do-all

and it returned:

FAILED: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=2557,notpassed=30

Tue Mar 27 19:21:00 MDT 2018

Enabled Packages: 
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT

CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake,cmake/std/sems/SEMSDevEnv.cmake -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16 

Pull: Not Performed
Configure: Passed (3.29 min)
Build: Passed (92.01 min)
Test: FAILED (26.13 min)

99% tests passed, 30 tests failed out of 2587

Label Time Summary:
Amesos               =   6.08 sec (14 tests)
Amesos2              =   3.33 sec (9 tests)
Anasazi              =  34.87 sec (71 tests)
AztecOO              =   5.50 sec (17 tests)
Belos                =  35.43 sec (72 tests)
Domi                 = 138.39 sec (125 tests)
Epetra               =  43.01 sec (61 tests)
EpetraExt            =   4.18 sec (11 tests)
FEI                  =  51.74 sec (43 tests)
Galeri               =   7.70 sec (9 tests)
GlobiPack            =   3.08 sec (6 tests)
Ifpack               =  26.56 sec (53 tests)
Ifpack2              =  54.12 sec (35 tests)
Intrepid             = 1467.90 sec (152 tests)
Intrepid2            = 336.96 sec (144 tests)
Isorropia            =   1.74 sec (6 tests)
Kokkos               = 379.99 sec (23 tests)
KokkosKernels        = 701.31 sec (4 tests)
ML                   =  25.33 sec (34 tests)
MiniTensor           =   0.72 sec (2 tests)
MueLu                = 1217.24 sec (84 tests)
NOX                  = 413.98 sec (106 tests)
OptiPack             =   2.69 sec (5 tests)
Panzer               = 808.72 sec (154 tests)
Phalanx              =  21.44 sec (27 tests)
Pike                 =   4.37 sec (7 tests)
Piro                 =  25.49 sec (12 tests)
ROL                  = 3183.59 sec (153 tests)
RTOp                 =  11.01 sec (24 tests)
Rythmos              = 1083.47 sec (83 tests)
SEACAS               =  50.54 sec (14 tests)
STK                  = 109.42 sec (12 tests)
Sacado               = 122.29 sec (292 tests)
Shards               =   1.77 sec (4 tests)
ShyLU_Node           =   2.29 sec (3 tests)
Stokhos              = 436.12 sec (75 tests)
Stratimikos          = 167.50 sec (40 tests)
Teko                 = 362.38 sec (19 tests)
Tempus               = 7650.88 sec (36 tests)
Teuchos              =  50.89 sec (137 tests)
ThreadPool           =   5.48 sec (10 tests)
Thyra                =  35.63 sec (81 tests)
Tpetra               =  71.71 sec (162 tests)
TrilinosCouplings    = 335.97 sec (24 tests)
Triutils             =   0.38 sec (2 tests)
Xpetra               =  27.11 sec (18 tests)
Zoltan               =  26.70 sec (19 tests)
Zoltan2              =  60.17 sec (101 tests)

Total Test time (real) = 1567.87 sec

The following tests FAILED:
	173 - KokkosKernels_graph_serial_MPI_1 (Timeout)
	1984 - MueLu_UnitTestsTpetra_MPI_1 (Timeout)
	1994 - MueLu_ParameterListInterpreterEpetra_MPI_1 (Timeout)
	1998 - MueLu_ParameterListInterpreterTpetra_MPI_1 (Timeout)
	2099 - Rythmos_BackwardEuler_ConvergenceTest_MPI_1 (Timeout)
	2103 - Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 (Timeout)
	2129 - Tempus_BackwardEuler_MPI_1 (Timeout)
	2131 - Tempus_BackwardEuler_Staggered_FSA_MPI_1 (Timeout)
	2133 - Tempus_BackwardEuler_ASA_MPI_1 (Timeout)
	2134 - Tempus_BDF2_MPI_1 (Timeout)
	2135 - Tempus_BDF2_Combined_FSA_MPI_1 (Timeout)
	2136 - Tempus_BDF2_Staggered_FSA_MPI_1 (Timeout)
	2138 - Tempus_BDF2_ASA_MPI_1 (Timeout)
	2139 - Tempus_ExplicitRK_MPI_1 (Timeout)
	2140 - Tempus_ExplicitRK_Combined_FSA_MPI_1 (Timeout)
	2141 - Tempus_ExplicitRK_Staggered_FSA_MPI_1 (Timeout)
	2143 - Tempus_ExplicitRK_ASA_MPI_1 (Timeout)
	2145 - Tempus_DIRK_MPI_1 (Timeout)
	2146 - Tempus_DIRK_Combined_FSA_MPI_1 (Timeout)
	2147 - Tempus_DIRK_Staggered_FSA_MPI_1 (Timeout)
	2149 - Tempus_DIRK_ASA_MPI_1 (Timeout)
	2150 - Tempus_HHTAlpha_MPI_1 (Timeout)
	2151 - Tempus_Newmark_MPI_1 (Timeout)
	2154 - Tempus_IMEX_RK_Combined_FSA_MPI_1 (Timeout)
	2155 - Tempus_IMEX_RK_Staggered_FSA_MPI_1 (Timeout)
	2157 - Tempus_IMEX_RK_Partitioned_Combined_FSA_MPI_1 (Timeout)
	2158 - Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 (Timeout)
	2282 - ROL_test_sol_solSROMGenerator_MPI_1 (Timeout)
	2288 - ROL_test_sol_checkAlmostSureConstraint_MPI_1 (Timeout)
	2320 - ROL_example_burgers-control_example_06_MPI_1 (Timeout)

Errors while running CTest

Total time for MPI_RELEASE_DEBUG_SHARED_PT = 121.44 min

Darn, taht is not good. That is a lot of timeouts. Now, I can't tell if these are timeouts because things are taking longer or if these are hangs. Someone would need to research that.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Mar 28, 2018
@mhoemmen
Copy link
Contributor

@bartlettroscoe I have heard complaints about OpenMPI 1.8.x bugs. The OpenMPI web page considers it "retired" -- in fact, the oldest "not retired" version is 1.10.

@mhoemmen
Copy link
Contributor

@prwolfe Have you seen issues like this with OpenMPI 1.8.x?

@bartlettroscoe
Copy link
Member Author

I tested the full CI build going from OpenMPI 1.6.5 to 1.10.1 in the branch 2462-openmpi-1.6.5-to-1.10.1 in my fork of Trilinos git@github.com:bartlettroscoe/Trilinos.git and it caused 34 tests to time out (see details below). I can't tell if these are hangs or just that MPI communication is taking longer to complete (which is hard to believe).

I am wondering if there is not some problem with the way these tests are using MPI and I am wondering if someone should not dig in and try to debug some of these timeouts to see why they are happening? Perhpas there are some real defects in the code that these updated versions of OpenMPI are bringing out?

DETAILED NOTES (click to expand)

(3/28/2018)

I created the branch 2462-openmpi-1.6.5-to-1.10.1 in my fork of Trilinos git@github.com:bartlettroscoe/Trilinos.git. I added the commit c9e9097 to change from OpenMPI 1.6.5 to 1.10.1.

I tested this with:

$ ./checkin-test-sems.sh --enable-all-packages=on --local-do-all

and it returned:

FAILED: Trilinos/MPI_RELEASE_DEBUG_SHARED_PT: passed=2552,notpassed=34

Wed Mar 28 09:41:16 MDT 2018

Enabled Packages: 
Disabled Packages: PyTrilinos,Claps,TriKota
Enabled all Packages
Hostname: crf450.srn.sandia.gov
Source Dir: /home/rabartl/Trilinos.base/Trilinos/cmake/tribits/ci_support/../../..
Build Dir: /home/rabartl/Trilinos.base/BUILDS/CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT

CMake Cache Varibles: -DTrilinos_TRIBITS_DIR:PATH=/home/rabartl/Trilinos.base/Trilinos/cmake/tribits -DTrilinos_ENABLE_TESTS:BOOL=ON -DTrilinos_TEST_CATEGORIES:STRING=BASIC -DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF -DDART_TESTING_TIMEOUT:STRING=300.0 -DBUILD_SHARED_LIBS=ON -DTrilinos_DISABLE_ENABLED_FORWARD_DEP_PACKAGES=ON -DTrilinos_ENABLE_SECONDARY_TESTED_CODE:BOOL=OFF -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/MpiReleaseDebugSharedPtSettings.cmake,cmake/std/BasicCiTestingSettings.cmake,cmake/std/sems/SEMSDevEnv.cmake -DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_PACKAGES:BOOL=ON -DTrilinos_ENABLE_ALL_FORWARD_DEP_PACKAGES:BOOL=ON -DTrilinos_ENABLE_PyTrilinos:BOOL=OFF -DTrilinos_ENABLE_Claps:BOOL=OFF -DTrilinos_ENABLE_TriKota:BOOL=OFF
Make Options: -j16
CTest Options: -j16 

Pull: Not Performed
Configure: Passed (3.04 min)
Build: Passed (99.75 min)
Test: FAILED (26.52 min)

99% tests passed, 34 tests failed out of 2586

Label Time Summary:
Amesos               =   6.20 sec (14 tests)
Amesos2              =   3.17 sec (9 tests)
Anasazi              =  37.92 sec (71 tests)
AztecOO              =   5.27 sec (17 tests)
Belos                =  35.64 sec (72 tests)
Domi                 = 122.76 sec (125 tests)
Epetra               =  42.48 sec (61 tests)
EpetraExt            =   5.19 sec (11 tests)
FEI                  =  49.67 sec (43 tests)
Galeri               =   6.93 sec (9 tests)
GlobiPack            =   3.39 sec (6 tests)
Ifpack               =  27.06 sec (53 tests)
Ifpack2              =  60.88 sec (35 tests)
Intrepid             = 1661.11 sec (152 tests)
Intrepid2            = 360.30 sec (144 tests)
Isorropia            =   1.67 sec (6 tests)
Kokkos               = 491.07 sec (23 tests)
KokkosKernels        = 653.29 sec (4 tests)
ML                   =  21.03 sec (34 tests)
MiniTensor           =   1.04 sec (2 tests)
MueLu                = 1227.70 sec (83 tests)
NOX                  = 415.46 sec (106 tests)
OptiPack             =   2.10 sec (5 tests)
Panzer               = 876.56 sec (154 tests)
Phalanx              =  17.27 sec (27 tests)
Pike                 =   3.69 sec (7 tests)
Piro                 =  20.75 sec (12 tests)
ROL                  = 3246.73 sec (153 tests)
RTOp                 =  11.13 sec (24 tests)
Rythmos              = 992.68 sec (83 tests)
SEACAS               =  56.99 sec (14 tests)
STK                  = 127.17 sec (12 tests)
Sacado               = 115.83 sec (292 tests)
Shards               =   1.84 sec (4 tests)
ShyLU_Node           =   1.30 sec (3 tests)
Stokhos              = 285.61 sec (75 tests)
Stratimikos          = 178.14 sec (40 tests)
Teko                 = 433.87 sec (19 tests)
Tempus               = 7758.58 sec (36 tests)
Teuchos              =  51.55 sec (137 tests)
ThreadPool           =   5.48 sec (10 tests)
Thyra                =  35.87 sec (81 tests)
Tpetra               =  66.82 sec (162 tests)
TrilinosCouplings    = 386.35 sec (24 tests)
Triutils             =   0.41 sec (2 tests)
Xpetra               =  25.48 sec (18 tests)
Zoltan               =  26.04 sec (19 tests)
Zoltan2              =  54.41 sec (101 tests)

Total Test time (real) = 1590.85 sec

The following tests FAILED:
	173 - KokkosKernels_graph_serial_MPI_1 (Timeout)
	1506 - Teko_testdriver_tpetra_MPI_1 (Failed)
	1983 - MueLu_UnitTestsTpetra_MPI_1 (Timeout)
	1993 - MueLu_ParameterListInterpreterEpetra_MPI_1 (Timeout)
	1997 - MueLu_ParameterListInterpreterTpetra_MPI_1 (Timeout)
	2098 - Rythmos_BackwardEuler_ConvergenceTest_MPI_1 (Timeout)
	2102 - Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 (Timeout)
	2127 - Tempus_ForwardEuler_MPI_1 (Timeout)
	2128 - Tempus_BackwardEuler_MPI_1 (Timeout)
	2129 - Tempus_BackwardEuler_Combined_FSA_MPI_1 (Timeout)
	2130 - Tempus_BackwardEuler_Staggered_FSA_MPI_1 (Timeout)
	2132 - Tempus_BackwardEuler_ASA_MPI_1 (Timeout)
	2133 - Tempus_BDF2_MPI_1 (Timeout)
	2134 - Tempus_BDF2_Combined_FSA_MPI_1 (Timeout)
	2135 - Tempus_BDF2_Staggered_FSA_MPI_1 (Timeout)
	2137 - Tempus_BDF2_ASA_MPI_1 (Timeout)
	2138 - Tempus_ExplicitRK_MPI_1 (Timeout)
	2139 - Tempus_ExplicitRK_Combined_FSA_MPI_1 (Timeout)
	2140 - Tempus_ExplicitRK_Staggered_FSA_MPI_1 (Timeout)
	2142 - Tempus_ExplicitRK_ASA_MPI_1 (Timeout)
	2144 - Tempus_DIRK_MPI_1 (Timeout)
	2145 - Tempus_DIRK_Combined_FSA_MPI_1 (Timeout)
	2146 - Tempus_DIRK_Staggered_FSA_MPI_1 (Timeout)
	2148 - Tempus_DIRK_ASA_MPI_1 (Timeout)
	2149 - Tempus_HHTAlpha_MPI_1 (Timeout)
	2150 - Tempus_Newmark_MPI_1 (Timeout)
	2153 - Tempus_IMEX_RK_Combined_FSA_MPI_1 (Timeout)
	2154 - Tempus_IMEX_RK_Staggered_FSA_MPI_1 (Timeout)
	2156 - Tempus_IMEX_RK_Partitioned_Combined_FSA_MPI_1 (Timeout)
	2157 - Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 (Timeout)
	2281 - ROL_test_sol_solSROMGenerator_MPI_1 (Timeout)
	2287 - ROL_test_sol_checkAlmostSureConstraint_MPI_1 (Timeout)
	2319 - ROL_example_burgers-control_example_06_MPI_1 (Timeout)
	2327 - ROL_example_parabolic-control_example_03_MPI_1 (Timeout)

Errors while running CTest

Total time for MPI_RELEASE_DEBUG_SHARED_PT = 129.32 min

Wow, that is even more test timeouts!

@bartlettroscoe
Copy link
Member Author

I have heard complaints about OpenMPI 1.8.x bugs. The OpenMPI web page considers it "retired" -- in fact, the oldest "not retired" version is 1.10.

Okay, given that OpenMPI 1.10 is the oldest version of MPI that is supported, we should try to debug what is causing these timeouts. I will submit an experimental build to CDash and then we can go from there.

@prwolfe
Copy link
Contributor

prwolfe commented Mar 28, 2018

we had lots of issues with 1.8 - that's why we abandoned it. Basically it was slow and would not properly place processes. In fact we have had some issues with 1.10 but those responded well to placement directives.

@mhoemmen
Copy link
Contributor

I remember the "let's try 1.8 .... oh that was bad let's not" episode :(

bartlettroscoe added a commit that referenced this issue Mar 28, 2018
…-experimental

Enable Xpetra and MueLu Experimental in standard CI build (#2317, #2462)
@bartlettroscoe
Copy link
Member Author

I merged #2467 which enables experimental code in Xpetra and MueLu in the GCC 4.8.4 CI build.

@bartlettroscoe
Copy link
Member Author

I ran the full Trilinos CI build and test suites with OpenMPI 1.6.5 (the current version used) and OpenMPI 1.10.1 on my machine crf450 and submitted to CDash using an all-at-once configure, build, and test:

The machine was loaded by another builds so I don't totally trust the timing numbers is showed but it seems that some tests and package test suites run much faster with OpenMPI 1.10.1 and others run much slower with OpenMPI 1.10.1 vs. OpenMPI 1.6.5 but overall the tests took:

  • OpenMPI 1.6.5: 53m5s run with ctest -j4
  • OpenMPI 1.10.1: 1h8m32s with ctest -j4

You can see some of the detailed numbers on the CDash pages above and in the below notes.

I rebootted my machine crf450 and I will run these again and see what happens. But if I see numbers similar to this again, I will post a new Trilinos GitHub issue to focus on problems with Trilinos with OpenMPI 1.10.1.

DETAILED NOTES (click to expand)

(3/28/2018)

Doing an experimental submit to CDash so we can see the output from these timing out tests and then start to try to diagnose why they are failing:

$ cd MPI_RELEASE_DEBUG_SHARED_PT/

$ rm -r CMake*

$ source ~/Trilinos.base/Trilinos/cmake/load_sems_dev_env.sh

$ ./do-configure -DCTEST_BUILD_FLAGS=-j16 -DCTEST_PARALLEL_LEVEL=16

$ time make dashboard &> make.dashboard.out

real    343m56.832s
user    1232m32.619s
sys     228m59.516s

This submitted to:

Interestingly, when running the tests package-by-package, there were fewer timeouts (16 total). The only timeouts were in Teko (1) and Tempus (15).

(3/29/2018)

A) Initial all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1:

I will then do an all-at-once submmit configure, build, test, and submit and see what happens:

$ cd CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT/

$ rm -r CMake*

$ source ~/Trilinos.base/Trilinos/cmake/load_sems_dev_env.sh

$ export PATH=/home/vera_env/common_tools/cmake-3.10.1/bin:$PATH

$ which cmake
/home/vera_env/common_tools/cmake-3.10.1/bin/cmake

$ which mpirun
/projects/sems/install/rhel6-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.10.1/bin/mpirun

$ ./do-configure -DCTEST_BUILD_FLAGS=-j16 -DCTEST_PARALLEL_LEVEL=4 \
   -DTrilinos_CTEST_DO_ALL_AT_ONCE=TRUE -DTrilinos_CTEST_USE_NEW_AAO_FEATURES=ON

$ time make dashboard &> make.dashboard.out

real    159m59.918s
user    1657m20.003s
sys     132m37.954s

This submitted to:

This showed 18 timeouts for the packages Tempus (14), MueLu (1), ROL (1), Rythmos (1), and Teko (1).

There is a lot of data shown on CDash.

B) Baseline all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1-base:

Now, for a basis of comparison, I should compare with the OpenMPI 1.6.5 build. I can do this by creating another branch that is for the exact same version of Trilinos:

$ cd Trilinos/
$ git checkout -b 2462-openmpi-1.6.5-to-1.10.1-base 65c7ac6
$ git push -u rab-github 2462-openmpi-1.6.5-to-1.10.1-base

$ git log-short -1 --name-status

65c7ac6 "Merge branch 'develop' of github.com:trilinos/Trilinos into develop"
Author: Chris Siefert <csiefer@sandia.gov>
Date:   Tue Mar 27 16:24:26 2018 -0600 (2 days ago)

Now run the all-at-once configure, build, test, and submit again:

$ cd CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT/

$ source ~/Trilinos.base/Trilinos/cmake/load_sems_dev_env.sh

$ export PATH=/home/vera_env/common_tools/cmake-3.10.1/bin:$PATH

$ which cmake
/home/vera_env/common_tools/cmake-3.10.1/bin/cmake

$ which mpirun
/projects/sems/install/rhel6-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.6.5/bin/mpirun

$ rm -r CMake*

$ time ./do-configure -DCTEST_BUILD_FLAGS=-j16 -DCTEST_PARALLEL_LEVEL=4 \
   -DTrilinos_CTEST_DO_ALL_AT_ONCE=TRUE -DTrilinos_CTEST_USE_NEW_AAO_FEATURES=ON \
   -DDART_TESTING_TIMEOUT=1200 \
   &> configure.2462-openmpi-1.6.5-to-1.10.1-base.out

real    2m43.743s
user    1m35.215s
sys     0m36.769s

$ time make dashboard &> make.dashboard.2462-openmpi-1.6.5-to-1.10.1-base.out

real    153m14.541s
user    1335m57.220s
sys     107m48.393s

This passed all of the tests and submitted to:

And the local ctest -S output showed all passing:

100% tests passed, 0 tests failed out of 2586

Subproject Time Summary:
Amesos               =  71.86 sec*proc (14 tests)
Amesos2              =  36.33 sec*proc (9 tests)
Anasazi              = 383.07 sec*proc (71 tests)
AztecOO              =  57.40 sec*proc (17 tests)
Belos                = 374.60 sec*proc (72 tests)
Domi                 = 417.55 sec*proc (125 tests)
Epetra               = 120.94 sec*proc (61 tests)
EpetraExt            =  53.28 sec*proc (11 tests)
FEI                  =  94.68 sec*proc (43 tests)
Galeri               =  10.34 sec*proc (9 tests)
GlobiPack            =   0.53 sec*proc (6 tests)
Ifpack               = 201.11 sec*proc (48 tests)
Ifpack2              = 102.01 sec*proc (35 tests)
Intrepid             = 136.61 sec*proc (152 tests)
Intrepid2            =  36.17 sec*proc (144 tests)
Isorropia            =  27.71 sec*proc (6 tests)
Kokkos               =  39.54 sec*proc (23 tests)
KokkosKernels        =  70.40 sec*proc (4 tests)
ML                   = 158.93 sec*proc (34 tests)
MiniTensor           =   0.12 sec*proc (2 tests)
MueLu                = 773.40 sec*proc (80 tests)
NOX                  = 364.33 sec*proc (106 tests)
OptiPack             =  21.45 sec*proc (5 tests)
Panzer               = 1855.67 sec*proc (154 tests)
Phalanx              =   3.23 sec*proc (27 tests)
Pike                 =   2.74 sec*proc (7 tests)
Piro                 =  90.63 sec*proc (12 tests)
ROL                  = 1916.41 sec*proc (153 tests)
RTOp                 =  27.06 sec*proc (24 tests)
Rythmos              = 154.85 sec*proc (83 tests)
SEACAS               =   7.05 sec*proc (14 tests)
STK                  =  46.83 sec*proc (12 tests)
Sacado               =  44.83 sec*proc (292 tests)
Shards               =   0.35 sec*proc (4 tests)
ShyLU_Node           =   0.18 sec*proc (3 tests)
Stokhos              = 134.80 sec*proc (75 tests)
Stratimikos          =  51.75 sec*proc (40 tests)
Teko                 = 196.13 sec*proc (19 tests)
Tempus               = 2215.52 sec*proc (36 tests)
Teuchos              = 104.16 sec*proc (137 tests)
ThreadPool           =  19.33 sec*proc (10 tests)
Thyra                = 171.53 sec*proc (81 tests)
Tpetra               = 526.10 sec*proc (162 tests)
TrilinosCouplings    =  67.80 sec*proc (24 tests)
Triutils             =   8.64 sec*proc (2 tests)
Xpetra               = 157.05 sec*proc (18 tests)
Zoltan               = 813.39 sec*proc (19 tests)
Zoltan2              = 479.54 sec*proc (101 tests)

Total Test time (real) = 3190.31 sec

The most expensive tests were:

$ grep " Test " make.dashboard.2462-openmpi-1.6.5-to-1.10.1-base.out | grep "sec$" | sort -nr -k 7 | head -n 30
  6/2586 Test #2140: Tempus_ExplicitRK_Staggered_FSA_MPI_1 .........................................................   Passed  215.29 sec
  5/2586 Test #2157: Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 ................................................   Passed  214.29 sec
  4/2586 Test #2145: Tempus_DIRK_Combined_FSA_MPI_1 ................................................................   Passed  188.22 sec
  3/2586 Test #2146: Tempus_DIRK_Staggered_FSA_MPI_1 ...............................................................   Passed  161.72 sec
  9/2586 Test #2156: Tempus_IMEX_RK_Partitioned_Combined_FSA_MPI_1 .................................................   Passed  160.28 sec
  7/2586 Test #2139: Tempus_ExplicitRK_Combined_FSA_MPI_1 ..........................................................   Passed  146.90 sec
 10/2586 Test #2148: Tempus_DIRK_ASA_MPI_1 .........................................................................   Passed  125.28 sec
  8/2586 Test #2149: Tempus_HHTAlpha_MPI_1 .........................................................................   Passed  124.29 sec
 15/2586 Test #2142: Tempus_ExplicitRK_ASA_MPI_1 ...................................................................   Passed  117.84 sec
 19/2586 Test #2154: Tempus_IMEX_RK_Staggered_FSA_MPI_1 ............................................................   Passed  102.77 sec
 17/2586 Test #2153: Tempus_IMEX_RK_Combined_FSA_MPI_1 .............................................................   Passed   97.35 sec
149/2586 Test #564: Zoltan_hg_simple_zoltan_parallel ..............................................................   Passed   89.07 sec
 36/2586 Test #2150: Tempus_Newmark_MPI_1 ..........................................................................   Passed   76.28 sec
147/2586 Test #2368: ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 .......................................   Passed   70.90 sec
151/2586 Test #558: Zoltan_ch_simple_zoltan_parallel ..............................................................   Passed   67.63 sec
148/2586 Test #2533: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 ....................................   Passed   65.29 sec
146/2586 Test #2529: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 ..................................   Passed   64.84 sec
 45/2586 Test #2133: Tempus_BDF2_MPI_1 .............................................................................   Passed   55.00 sec
 27/2586 Test #2128: Tempus_BackwardEuler_MPI_1 ....................................................................   Passed   51.35 sec
 40/2586 Test #2281: ROL_test_sol_solSROMGenerator_MPI_1 ...........................................................   Passed   49.37 sec
 14/2586 Test #2134: Tempus_BDF2_Combined_FSA_MPI_1 ................................................................   Passed   45.76 sec
 46/2586 Test #2102: Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 ...............................................   Passed   45.22 sec
150/2586 Test #2363: ROL_example_PDE-OPT_0ld_poisson_example_01_MPI_4 ..............................................   Passed   43.39 sec
 12/2586 Test #2130: Tempus_BackwardEuler_Staggered_FSA_MPI_1 ......................................................   Passed   40.56 sec
 11/2586 Test #2129: Tempus_BackwardEuler_Combined_FSA_MPI_1 .......................................................   Passed   40.44 sec
 13/2586 Test #2135: Tempus_BDF2_Staggered_FSA_MPI_1 ...............................................................   Passed   39.86 sec
 53/2586 Test #1997: MueLu_ParameterListInterpreterTpetra_MPI_1 ....................................................   Passed   37.92 sec
 20/2586 Test #173: KokkosKernels_graph_serial_MPI_1 ..............................................................   Passed   33.98 sec
153/2586 Test #2382: ROL_example_PDE-OPT_topo-opt_elasticity_example_01_MPI_4 ......................................   Passed   33.44 sec
152/2586 Test #2246: ROL_adapters_tpetra_test_sol_TpetraSimulatedConstraintInterfaceCVaR_MPI_4 .....................   Passed   33.05 sec

Now this is a solid basis of comparison for using OpenMPI 1.10.1.

C) Follow-up all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1:

That is not a lot of free memory left. It may have been that my machine was swapping to disk when trying to run the tests. I should try running the tests again locally bu this time using less processes and a larger timeout after going back to the branch 2462-openmpi-1.6.5-to-1.10.1:

$ cd CHECKIN/MPI_RELEASE_DEBUG_SHARED_PT/

$ source ~/Trilinos.base/Trilinos/cmake/load_sems_dev_env.sh

$ export PATH=/home/vera_env/common_tools/cmake-3.10.1/bin:$PATH

$ which cmake
/home/vera_env/common_tools/cmake-3.10.1/bin/cmake

$ which mpirun
/projects/sems/install/rhel6-x86_64/sems/compiler/gcc/4.8.4/openmpi/1.10.1/bin/mpirun

$ rm -r CMake*

$ time ./do-configure -DCTEST_BUILD_FLAGS=-j16 -DCTEST_PARALLEL_LEVEL=16 \
  -DTrilinos_CTEST_DO_ALL_AT_ONCE=TRUE -DTrilinos_CTEST_USE_NEW_AAO_FEATURES=ON \
  &> configure.2462-openmpi-1.6.5-to-1.10.1.out

real    2m47.845s
user    1m38.024s
sys     0m38.445s

$ time make dashboard &> make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out

real    182m12.103s
user    1554m1.530s
sys     123m35.152s

This posted results to:

The test results shown in the ctest -S output were:

99% tests passed, 5 tests failed out of 2586

Subproject Time Summary:
Amesos               =  18.34 sec*proc (14 tests)
Amesos2              =   8.02 sec*proc (9 tests)
Anasazi              = 110.75 sec*proc (71 tests)
AztecOO              =   7.67 sec*proc (17 tests)
Belos                = 105.01 sec*proc (72 tests)
Domi                 = 162.23 sec*proc (125 tests)
Epetra               =  32.93 sec*proc (61 tests)
EpetraExt            =  14.38 sec*proc (11 tests)
FEI                  =  38.18 sec*proc (43 tests)
Galeri               =   3.93 sec*proc (9 tests)
GlobiPack            =   1.12 sec*proc (6 tests)
Ifpack               =  52.86 sec*proc (48 tests)
Ifpack2              =  57.62 sec*proc (35 tests)
Intrepid             = 465.90 sec*proc (152 tests)
Intrepid2            = 110.21 sec*proc (144 tests)
Isorropia            =   4.74 sec*proc (6 tests)
Kokkos               = 119.54 sec*proc (23 tests)
KokkosKernels        = 219.11 sec*proc (4 tests)
ML                   =  44.09 sec*proc (34 tests)
MiniTensor           =   0.53 sec*proc (2 tests)
MueLu                = 878.95 sec*proc (80 tests)
NOX                  = 252.04 sec*proc (106 tests)
OptiPack             =   6.39 sec*proc (5 tests)
Panzer               = 1802.58 sec*proc (154 tests)
Phalanx              =   8.27 sec*proc (27 tests)
Pike                 =   1.42 sec*proc (7 tests)
Piro                 =  60.27 sec*proc (12 tests)
ROL                  = 2447.89 sec*proc (153 tests)
RTOp                 =   5.18 sec*proc (24 tests)
Rythmos              = 444.69 sec*proc (83 tests)
SEACAS               =  15.40 sec*proc (14 tests)
STK                  =  53.82 sec*proc (12 tests)
Sacado               =  43.39 sec*proc (292 tests)
Shards               =   0.66 sec*proc (4 tests)
ShyLU_Node           =   0.61 sec*proc (3 tests)
Stokhos              = 174.51 sec*proc (75 tests)
Stratimikos          =  67.55 sec*proc (40 tests)
Teko                 = 247.80 sec*proc (19 tests)
Tempus               = 7564.71 sec*proc (36 tests)
Teuchos              =  32.62 sec*proc (137 tests)
ThreadPool           =   3.50 sec*proc (10 tests)
Thyra                =  37.33 sec*proc (81 tests)
Tpetra               = 125.66 sec*proc (162 tests)
TrilinosCouplings    = 140.58 sec*proc (24 tests)
Triutils             =   1.04 sec*proc (2 tests)
Xpetra               =  92.05 sec*proc (18 tests)
Zoltan               =  75.40 sec*proc (19 tests)
Zoltan2              = 152.14 sec*proc (101 tests)

Total Test time (real) = 4116.63 sec

The following tests FAILED:
        1506 - Teko_testdriver_tpetra_MPI_1 (Failed)
        2140 - Tempus_ExplicitRK_Staggered_FSA_MPI_1 (Timeout)
        2145 - Tempus_DIRK_Combined_FSA_MPI_1 (Timeout)
        2146 - Tempus_DIRK_Staggered_FSA_MPI_1 (Timeout)
        2157 - Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 (Timeout)

The most expensive tests were:

$ grep " Test " make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out | grep "Timeout"

  2/2586 Test #2157: Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 ................................................***Timeout 600.01 sec
  3/2586 Test #2140: Tempus_ExplicitRK_Staggered_FSA_MPI_1 .........................................................***Timeout 600.05 sec
  4/2586 Test #2145: Tempus_DIRK_Combined_FSA_MPI_1 ................................................................***Timeout 600.26 sec
  5/2586 Test #2146: Tempus_DIRK_Staggered_FSA_MPI_1 ...............................................................***Timeout 600.14 sec

$ grep " Test " make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out | grep "sec$" | sort -nr -k 7 | head -n 30

  9/2586 Test #2156: Tempus_IMEX_RK_Partitioned_Combined_FSA_MPI_1 .................................................   Passed  579.68 sec
  8/2586 Test #2139: Tempus_ExplicitRK_Combined_FSA_MPI_1 ..........................................................   Passed  567.91 sec
  7/2586 Test #2148: Tempus_DIRK_ASA_MPI_1 .........................................................................   Passed  465.64 sec
  6/2586 Test #2149: Tempus_HHTAlpha_MPI_1 .........................................................................   Passed  438.92 sec
 13/2586 Test #2142: Tempus_ExplicitRK_ASA_MPI_1 ...................................................................   Passed  435.95 sec
 15/2586 Test #2154: Tempus_IMEX_RK_Staggered_FSA_MPI_1 ............................................................   Passed  368.65 sec
 19/2586 Test #2153: Tempus_IMEX_RK_Combined_FSA_MPI_1 .............................................................   Passed  353.16 sec
 33/2586 Test #2150: Tempus_Newmark_MPI_1 ..........................................................................   Passed  255.15 sec
 40/2586 Test #2281: ROL_test_sol_solSROMGenerator_MPI_1 ...........................................................   Passed  199.18 sec
 43/2586 Test #2133: Tempus_BDF2_MPI_1 .............................................................................   Passed  193.66 sec
 26/2586 Test #2128: Tempus_BackwardEuler_MPI_1 ....................................................................   Passed  171.63 sec
 14/2586 Test #2134: Tempus_BDF2_Combined_FSA_MPI_1 ................................................................   Passed  167.25 sec
 51/2586 Test #1997: MueLu_ParameterListInterpreterTpetra_MPI_1 ....................................................   Passed  157.52 sec
 42/2586 Test #2102: Rythmos_IntegratorBuilder_ConvergenceTest_MPI_1 ...............................................   Passed  156.30 sec
 10/2586 Test #2129: Tempus_BackwardEuler_Combined_FSA_MPI_1 .......................................................   Passed  150.13 sec
 11/2586 Test #2130: Tempus_BackwardEuler_Staggered_FSA_MPI_1 ......................................................   Passed  147.52 sec
 12/2586 Test #2135: Tempus_BDF2_Staggered_FSA_MPI_1 ...............................................................   Passed  146.82 sec
 24/2586 Test #1993: MueLu_ParameterListInterpreterEpetra_MPI_1 ....................................................   Passed  125.19 sec
 37/2586 Test #1983: MueLu_UnitTestsTpetra_MPI_1 ...................................................................   Passed  120.19 sec
 29/2586 Test #2287: ROL_test_sol_checkAlmostSureConstraint_MPI_1 ..................................................   Passed  115.47 sec
 16/2586 Test #2137: Tempus_BDF2_ASA_MPI_1 .........................................................................   Passed  113.17 sec
 23/2586 Test #2319: ROL_example_burgers-control_example_06_MPI_1 ..................................................   Passed  108.33 sec
 21/2586 Test #2144: Tempus_DIRK_MPI_1 .............................................................................   Passed  107.34 sec
 28/2586 Test #2098: Rythmos_BackwardEuler_ConvergenceTest_MPI_1 ...................................................   Passed  106.76 sec
 17/2586 Test #173: KokkosKernels_graph_serial_MPI_1 ..............................................................   Passed  106.58 sec
186/2586 Test #2529: PanzerAdaptersSTK_CurlLaplacianExample-ConvTest-Quad-Order-4 ..................................   Passed  102.36 sec
 22/2586 Test #2132: Tempus_BackwardEuler_ASA_MPI_1 ................................................................   Passed   93.00 sec
 25/2586 Test #2138: Tempus_ExplicitRK_MPI_1 .......................................................................   Passed   90.59 sec
 34/2586 Test #2327: ROL_example_parabolic-control_example_03_MPI_1 ................................................   Passed   82.84 sec
188/2586 Test #2533: PanzerAdaptersSTK_MixedPoissonExample-ConvTest-Hex-Order-3 ....................................   Passed   82.81 sec

D) Compare test runtimes:

Comparing the most expensive tests shown in {{make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out}} vs the baseline {{make.dashboard.2462-openmpi-1.6.5-to-1.10.1-base.out}} we can clearly see that some tests took much longer with OpenMPI 1.10.1 vs. OpenMPI 1.6.5.

Let's compare a few tests:

Test Name OpenMPI 1.6.5 OpenMPI 1.10.1
Tempus_ExplicitRK_Staggered_FSA_MPI_1 215.29 sec Timeout 600.05 sec
Tempus_IMEX_RK_Partitioned_Staggered_FSA_MPI_1 214.29 sec Timeout 600.05 sec
Tempus_DIRK_Combined_FSA_MPI_1 188.22 sec Timeout 600.26 sec
Tempus_IMEX_RK_Partitioned_Combined_FSA_MPI 160.28 sec 579.68 sec
Tempus_DIRK_ASA_MPI_1 125.28 sec 465.64 sec
Tempus_HHTAlpha_MPI_1 124.29 sec 438.92 sec
Tempus_ExplicitRK_ASA_MPI_1 117.84 sec 435.95 sec
Tempus_IMEX_RK_Staggered_FSA_MPI_1 102.77 sec 368.65 sec
Tempus_IMEX_RK_Combined_FSA_MPI_1 97.35 sec 353.16 sec
Zoltan_hg_simple_zoltan_parallel 89.07 sec 8.07 sec*
Tempus_Newmark_MPI_1 76.28 sec 255.15 sec
ROL_example_PDE-OPT_0ld_adv-diff-react_example_02_MPI_4 70.90 sec 65.09 sec*

Note: the times with OpenMPI 1.10.1 mared with * were not showen in the list of the 30 most expensive tests for that case. Instead, I had to get the values out of the file make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out on the machine.

I need to run these builds and tests again on a unloaded machine so before I believe these numbers. But it does look like there is a big perforamnce problem with OpenMPI 1.10.1 vs. OpenMPI 1.6.5 for some builds and some packages.

@bartlettroscoe bartlettroscoe added the type: enhancement Issue is an enhancement, not a bug label Apr 3, 2018
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 20, 2018
These options were removed from the EMPIRE configuration of Trilinos in the
EM-Plamsa/BulidScripts repo as of commit:

  commit 285a5a7cad924a4419ede6eccaaefe687f958fa3
  Author: Jason M. Gates <jmgate@sandia.gov>
  Date:   Thu Mar 29 16:41:22 2018 -0600

      Remove Experimental Flags

      See trilinos#2467.

Therefore, we can hopefully safely assume these are not needed to help protect
EMPIRE's usage of Trilinos anymore.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 5, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 7, 2018
I just added an if statement to guard calling INCLUDE_DIRECTORIES() to allow
including in a ctest -S script.  This makes it so that the same enable/disable
options are seen in the outer ctest -S driver enable/disable logic as the
inner CMake configure.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 7, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 7, 2018
Just supply the build configuration name and that is it.
searhein pushed a commit to searhein/Trilinos that referenced this issue May 8, 2018
…evelop

* 'develop' of https://github.com/trilinos/Trilinos: (377 commits)
  CTest -S driver for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (trilinos#2462)
  Generic drivers for builds (trilinos#2462)
  Add GCC 4.8.4, OpenMPI 1.10.1 build with OpenMP enabled (trilinos#2562)
  Allow to be included is ctest -S driver script (trilinos#2462)
  MueLu: fix 2664 by using appropriate type for coordinate multi-vector
  Tpetra: Assemble RHS in finite element assemble examples (trilinos#2660) (trilinos#2682)
  Teuchos: add unit test for whitespace after string
  Teuchos: allow whitespace after YAML string
  Provide better error message when compiler is not supported (TRIL-200)
  Belos: Change all default parameters to be constexpr (trilinos#2483)
  MueLu: using Teuchos::as<SC> instead of (SC) to cast parameter list entry
  Reduce srun timeouts on toss3 (TRIL-200)
  Switch from CMake 3.5 to 3.10.1 (TRIL-204, TRIL-200)
  Update toss3 drivers to use split ctest -S driver to run tests (TRIL-200, TRIL-204)
  Split driver for rhel6 (TRIL-204)
  Create Split driver scripts for config & build, then test (TRIL-204)
  Print ATDM_CONFIG_ vars to help debug issues (TRIL-171)
  Factor out create-src-and-build-dir.sh (TRIL-204)
  Fix small typo in print statement (TRIL-200)
  Fix list of system dirs (TRIL-200)
  ...

# Conflicts:
#	packages/shylu/shylu_dd/frosch/src/SchwarzOperators/FROSch_GDSWCoarseOperator_def.hpp
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 8, 2018
By loading atdm-env, you can load modules from that env like
'atdm-cmake3/.11.1'.  And it is harmless to load the mdoule
atdm-ninja_fortran/1.7.2 and it gives you access to build with Ninja instead
of Makefiles.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 8, 2018
Ninja is faster at building.  Why not use it.  And we need some Ninja testing
in PR testing.

Using CMake 3.11.1 allows for all-at-once submits and faster running of ctest
in parallel.  And it allows for using Ninja and TriBITS generates nice dummy
makefiles.

This removes a hack for CMake 3.11.1 that only worked for my machine crf450.
Now this should work every SNL COE machine that mounts SEMS.
bartlettroscoe added a commit that referenced this issue May 8, 2018
….1-and-ninja

Use atdm-cmake/3.11.1 module and Ninja for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build.  This should be the build that satisfies the GCC auto PR build in #2317 and #2462.
searhein pushed a commit to searhein/Trilinos that referenced this issue May 16, 2018
…eorganizing-coarse-space-construction

* 'develop' of https://github.com/searhein/Trilinos: (405 commits)
  CTest -S driver for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (trilinos#2462)
  Generic drivers for builds (trilinos#2462)
  Add GCC 4.8.4, OpenMPI 1.10.1 build with OpenMP enabled (trilinos#2562)
  Allow to be included is ctest -S driver script (trilinos#2462)
  MueLu: fix 2664 by using appropriate type for coordinate multi-vector
  Tpetra: Assemble RHS in finite element assemble examples (trilinos#2660) (trilinos#2682)
  Teuchos: add unit test for whitespace after string
  Teuchos: allow whitespace after YAML string
  Provide better error message when compiler is not supported (TRIL-200)
  Belos: Change all default parameters to be constexpr (trilinos#2483)
  MueLu: using Teuchos::as<SC> instead of (SC) to cast parameter list entry
  Reduce srun timeouts on toss3 (TRIL-200)
  Switch from CMake 3.5 to 3.10.1 (TRIL-204, TRIL-200)
  Update toss3 drivers to use split ctest -S driver to run tests (TRIL-200, TRIL-204)
  Split driver for rhel6 (TRIL-204)
  Create Split driver scripts for config & build, then test (TRIL-204)
  Print ATDM_CONFIG_ vars to help debug issues (TRIL-171)
  Factor out create-src-and-build-dir.sh (TRIL-204)
  Fix small typo in print statement (TRIL-200)
  Fix list of system dirs (TRIL-200)
  ...
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 30, 2018
These disables will allows this build to be promoted to the CI build and an
auto PR build (see trilinos#2462).
mhoemmen pushed a commit that referenced this issue May 30, 2018
These disables will allows this build to be promoted to the CI build and an
auto PR build (see #2462).
@bartlettroscoe
Copy link
Member Author

Now that the GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP build is 100% clean as described in #2691 (comment), I will change this over to be the new CI build and be used as the default build for the checkin-test-sems.sh script.

@trilinos/framework,

This build is now ready to be used to replace the existing GCC 4.8.4 auto PR build. The build GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP completely matches the agreed-to GCC 4.8.4 in #2317.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 30, 2018
…uild and Ninja (trilinos#2462)

This new build also uses updated OpenMPI 1.10.1 as well as enableing OpenMP.

Now the checkin-test-sems.sh script will use Ninja by default with settings in
the local-checkin-test-defaults.py file.  (But if that file already exists,
you will have to make the updates yourself).
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 30, 2018
…nja (trilinos#2462)

Now that this build is clean, we need to keep it clean.
bartlettroscoe added a commit that referenced this issue May 31, 2018
…p-config

Switch default checkin-test-sems.sh build and post-push CI build to use updated GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP configuration (see #2462).
@bartlettroscoe
Copy link
Member Author

The post-push CI build linked to from:

is now set to the updated GCC 4.8.4 + Intel 1.10.1 + OpenMP build and it finished the initial build this morning of all 53 packages passing all 2722 tests. And it ran all of these tests in a wall-clock time of
24m 56s (on 8 cores).

@trililinos/framework, I this this build should be ready to substitute for the existing GCC 4.8.4 auto PR build. Should we open a new GitHub issue for that?

Otherwise, I am putting this in review.

@bartlettroscoe bartlettroscoe added the stage: in review Primary work is completed and now is just waiting for human review and/or test feedback label Jun 1, 2018
@bartlettroscoe
Copy link
Member Author

Given that issue #2788 exists for using this configuration for that auto PR GCC 4.8.4 build, I am closing this issue #2462 since there is nothing left to do. This updated configuration is being used in the post-push CI build so we will get an email if there are any failures going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project stage: in review Primary work is completed and now is just waiting for human review and/or test feedback type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

4 participants