-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify existing GCC 4.8.4 CI build to match selected auto PR build #2462
Comments
@bartlettroscoe Can we enable OpenMP but force |
I guess I can try that. But I wonder even with Also, note that there are ATDM builds of Trilinos that enable experimental MueLu code that build and run tests with a serial Kokkos node as shown at: |
@csiefer2 would know for sure whether disabling OpenMP is adequate. My guess is no, because some of the sparse matrix-matrix multiply code takes different paths if OpenMP is enabled. |
OpenMPNode and SerialNode trigger different code paths in chunks of Tpetra. AFAIK MueLu does not do node type specialization (except for Epetra). What you choose to test for PR doesn't really matter, but they both need to stay working (more or less). |
The GCC 4.8.4 PR build will test OpenMP path and Intel 17.x build will test the Serial node path. And the ATDM builds of Trilinos are already testing both paths and have been for many weeks now as you can see at: |
@bartlettroscoe Cool, then I'm OK with this :) |
This is according to the plan in trilinos#2462.
, trilinos#2462) Since the ATDM APPs enable these, so should the CI and auto PR builds (see
I submitted PR #2467 to enable Xpetra and MueLu experimental code in the standard CI build. If someone can quickly review that, then I can merge. |
This is according to the plan in trilinos#2462.
I tested the full CI build going from OpenMPI 1.6.5 to 1.8.7 in the branch I will try updating from OpenMPI 1.6.5 to 1.10.1 (which is the only other OpenMPI implementation that SEMS provides) and see how that goes. DETAILED NOTES (click to expand)(3/27/2018) I created the branch I tested this with:
and it returned:
Darn, taht is not good. That is a lot of timeouts. Now, I can't tell if these are timeouts because things are taking longer or if these are hangs. Someone would need to research that. |
This is according to the plan in trilinos#2462.
@bartlettroscoe I have heard complaints about OpenMPI 1.8.x bugs. The OpenMPI web page considers it "retired" -- in fact, the oldest "not retired" version is 1.10. |
@prwolfe Have you seen issues like this with OpenMPI 1.8.x? |
I tested the full CI build going from OpenMPI 1.6.5 to 1.10.1 in the branch I am wondering if there is not some problem with the way these tests are using MPI and I am wondering if someone should not dig in and try to debug some of these timeouts to see why they are happening? Perhpas there are some real defects in the code that these updated versions of OpenMPI are bringing out? DETAILED NOTES (click to expand)(3/28/2018) I created the branch I tested this with:
and it returned:
Wow, that is even more test timeouts! |
Okay, given that OpenMPI 1.10 is the oldest version of MPI that is supported, we should try to debug what is causing these timeouts. I will submit an experimental build to CDash and then we can go from there. |
we had lots of issues with 1.8 - that's why we abandoned it. Basically it was slow and would not properly place processes. In fact we have had some issues with 1.10 but those responded well to placement directives. |
I remember the "let's try 1.8 .... oh that was bad let's not" episode :( |
I merged #2467 which enables experimental code in Xpetra and MueLu in the GCC 4.8.4 CI build. |
I ran the full Trilinos CI build and test suites with OpenMPI 1.6.5 (the current version used) and OpenMPI 1.10.1 on my machine crf450 and submitted to CDash using an all-at-once configure, build, and test: The machine was loaded by another builds so I don't totally trust the timing numbers is showed but it seems that some tests and package test suites run much faster with OpenMPI 1.10.1 and others run much slower with OpenMPI 1.10.1 vs. OpenMPI 1.6.5 but overall the tests took:
You can see some of the detailed numbers on the CDash pages above and in the below notes. I rebootted my machine crf450 and I will run these again and see what happens. But if I see numbers similar to this again, I will post a new Trilinos GitHub issue to focus on problems with Trilinos with OpenMPI 1.10.1. DETAILED NOTES (click to expand)(3/28/2018) Doing an experimental submit to CDash so we can see the output from these timing out tests and then start to try to diagnose why they are failing:
This submitted to: Interestingly, when running the tests package-by-package, there were fewer timeouts (16 total). The only timeouts were in Teko (1) and Tempus (15). (3/29/2018) A) Initial all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1: I will then do an all-at-once submmit configure, build, test, and submit and see what happens:
This submitted to: This showed 18 timeouts for the packages Tempus (14), MueLu (1), ROL (1), Rythmos (1), and Teko (1). There is a lot of data shown on CDash. B) Baseline all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1-base: Now, for a basis of comparison, I should compare with the OpenMPI 1.6.5 build. I can do this by creating another branch that is for the exact same version of Trilinos:
Now run the all-at-once configure, build, test, and submit again:
This passed all of the tests and submitted to: And the local ctest -S output showed all passing:
The most expensive tests were:
Now this is a solid basis of comparison for using OpenMPI 1.10.1. C) Follow-up all-at-once configure, build, test and submit with 2462-openmpi-1.6.5-to-1.10.1: That is not a lot of free memory left. It may have been that my machine was swapping to disk when trying to run the tests. I should try running the tests again locally bu this time using less processes and a larger timeout after going back to the branch
This posted results to: The test results shown in the ctest -S output were:
The most expensive tests were:
D) Compare test runtimes: Comparing the most expensive tests shown in {{make.dashboard.2462-openmpi-1.6.5-to-1.10.1.out}} vs the baseline {{make.dashboard.2462-openmpi-1.6.5-to-1.10.1-base.out}} we can clearly see that some tests took much longer with OpenMPI 1.10.1 vs. OpenMPI 1.6.5. Let's compare a few tests:
Note: the times with OpenMPI 1.10.1 mared with I need to run these builds and tests again on a unloaded machine so before I believe these numbers. But it does look like there is a big perforamnce problem with OpenMPI 1.10.1 vs. OpenMPI 1.6.5 for some builds and some packages. |
These options were removed from the EMPIRE configuration of Trilinos in the EM-Plamsa/BulidScripts repo as of commit: commit 285a5a7cad924a4419ede6eccaaefe687f958fa3 Author: Jason M. Gates <jmgate@sandia.gov> Date: Thu Mar 29 16:41:22 2018 -0600 Remove Experimental Flags See trilinos#2467. Therefore, we can hopefully safely assume these are not needed to help protect EMPIRE's usage of Trilinos anymore.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
I just added an if statement to guard calling INCLUDE_DIRECTORIES() to allow including in a ctest -S script. This makes it so that the same enable/disable options are seen in the outer ctest -S driver enable/disable logic as the inner CMake configure.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
Just supply the build configuration name and that is it.
…evelop * 'develop' of https://github.com/trilinos/Trilinos: (377 commits) CTest -S driver for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (trilinos#2462) Generic drivers for builds (trilinos#2462) Add GCC 4.8.4, OpenMPI 1.10.1 build with OpenMP enabled (trilinos#2562) Allow to be included is ctest -S driver script (trilinos#2462) MueLu: fix 2664 by using appropriate type for coordinate multi-vector Tpetra: Assemble RHS in finite element assemble examples (trilinos#2660) (trilinos#2682) Teuchos: add unit test for whitespace after string Teuchos: allow whitespace after YAML string Provide better error message when compiler is not supported (TRIL-200) Belos: Change all default parameters to be constexpr (trilinos#2483) MueLu: using Teuchos::as<SC> instead of (SC) to cast parameter list entry Reduce srun timeouts on toss3 (TRIL-200) Switch from CMake 3.5 to 3.10.1 (TRIL-204, TRIL-200) Update toss3 drivers to use split ctest -S driver to run tests (TRIL-200, TRIL-204) Split driver for rhel6 (TRIL-204) Create Split driver scripts for config & build, then test (TRIL-204) Print ATDM_CONFIG_ vars to help debug issues (TRIL-171) Factor out create-src-and-build-dir.sh (TRIL-204) Fix small typo in print statement (TRIL-200) Fix list of system dirs (TRIL-200) ... # Conflicts: # packages/shylu/shylu_dd/frosch/src/SchwarzOperators/FROSch_GDSWCoarseOperator_def.hpp
By loading atdm-env, you can load modules from that env like 'atdm-cmake3/.11.1'. And it is harmless to load the mdoule atdm-ninja_fortran/1.7.2 and it gives you access to build with Ninja instead of Makefiles.
Ninja is faster at building. Why not use it. And we need some Ninja testing in PR testing. Using CMake 3.11.1 allows for all-at-once submits and faster running of ctest in parallel. And it allows for using Ninja and TriBITS generates nice dummy makefiles. This removes a hack for CMake 3.11.1 that only worked for my machine crf450. Now this should work every SNL COE machine that mounts SEMS.
…eorganizing-coarse-space-construction * 'develop' of https://github.com/searhein/Trilinos: (405 commits) CTest -S driver for GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build (trilinos#2462) Generic drivers for builds (trilinos#2462) Add GCC 4.8.4, OpenMPI 1.10.1 build with OpenMP enabled (trilinos#2562) Allow to be included is ctest -S driver script (trilinos#2462) MueLu: fix 2664 by using appropriate type for coordinate multi-vector Tpetra: Assemble RHS in finite element assemble examples (trilinos#2660) (trilinos#2682) Teuchos: add unit test for whitespace after string Teuchos: allow whitespace after YAML string Provide better error message when compiler is not supported (TRIL-200) Belos: Change all default parameters to be constexpr (trilinos#2483) MueLu: using Teuchos::as<SC> instead of (SC) to cast parameter list entry Reduce srun timeouts on toss3 (TRIL-200) Switch from CMake 3.5 to 3.10.1 (TRIL-204, TRIL-200) Update toss3 drivers to use split ctest -S driver to run tests (TRIL-200, TRIL-204) Split driver for rhel6 (TRIL-204) Create Split driver scripts for config & build, then test (TRIL-204) Print ATDM_CONFIG_ vars to help debug issues (TRIL-171) Factor out create-src-and-build-dir.sh (TRIL-204) Fix small typo in print statement (TRIL-200) Fix list of system dirs (TRIL-200) ...
These disables will allows this build to be promoted to the CI build and an auto PR build (see trilinos#2462).
These disables will allows this build to be promoted to the CI build and an auto PR build (see #2462).
Now that the @trilinos/framework, This build is now ready to be used to replace the existing GCC 4.8.4 auto PR build. The build |
…uild and Ninja (trilinos#2462) This new build also uses updated OpenMPI 1.10.1 as well as enableing OpenMP. Now the checkin-test-sems.sh script will use Ninja by default with settings in the local-checkin-test-defaults.py file. (But if that file already exists, you will have to make the updates yourself).
…nja (trilinos#2462) Now that this build is clean, we need to keep it clean.
…p-config Switch default checkin-test-sems.sh build and post-push CI build to use updated GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP configuration (see #2462).
The post-push CI build linked to from: is now set to the updated GCC 4.8.4 + Intel 1.10.1 + OpenMP build and it finished the initial build this morning of all 53 packages passing all 2722 tests. And it ran all of these tests in a wall-clock time of @trililinos/framework, I this this build should be ready to substitute for the existing GCC 4.8.4 auto PR build. Should we open a new GitHub issue for that? Otherwise, I am putting this in review. |
CC: @trilinos/framework, @mhoemmen, @rppawlo, @ibaned, @crtrott
Next Action Status
Post-push CI build and checkin-test-sems.sh script is now updated to use updated GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build. Consideration for using this build in auto PR testing being addressed in #2788.
Description
This Issue is to scope out and track efforts to upgrade the existing SEMS-based Trilinos CI build (see #482 and #1304) to match the selected GCC 4.8.4 auto PR build as described in #2317 (comment). The existing GCC 4.8.4 CI build shown here has been running for 1.5+ years and has been maintained over that time. That build has many but not all of the settings of the selected GCC 4.8.4 auto PR build listed here. The primary changes that need to be made are:
Xpetra_ENABLE_Experimental=ON
andMueLu_ENABLE_Experimental=ON
(note objection in Select set of builds for initial mandatory auto PR testing process #2317 (comment)).OMP_NUM_THREADS=2
).The most difficult change will likely be to enable OpenMP because of the problem of the threads all binding to the same cores as described in #2422. Therefore, the initial auto PR build may not have OpenMP enabled due to these challenges.
Tasks:
Xpetra_ENABLE_Experimental=ON
andMueLu_ENABLE_Experimental=ON
in CI build ... Merged in Enable Xpetra and MueLu Experimental in standard CI build (#2317, #2462) #2467 and was later removed in 7481c76 [DONE]GCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
in GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2688) [DONE]Trilinos_ENABLE_OpenMP=ON
andOMP_NUM_THREADS=2
(see buildGCC-4.8.4-OpenMPI-1.10.1-MpiReleaseDebugSharedPtOpenMP
in GCC 4.8.4 + OpenMPI 1.10.1 + OpenMP build #2688) [DONE]Related Issues:
The text was updated successfully, but these errors were encountered: