-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up robust portable pre-push and post-push CI tools and process based on the SEMS Dev Env #482
Comments
This is already in progress. I have already created drafts for the |
Now that I have been added to the necessary metagroups, I can see the setup for the Jenkins build configurations. Looking at the Trilinos CI Jenkins build at: I can see that is basically just starts a build at 1 am MDT and then polls the main Trilinos GitHub 'develp' branch every 10 minutes. The problem with that is that it will not pick up changes in the extra repos. But I can see the exact script that runs the build:
All of this should be put into a version-controlled script and just that script should be run from Jenkins. Also, that script should source "As a best practice, try not to put a long shell script in here. Instead, consider adding the shell script in SCM and simply call that shell script from Jenkins (via bash -ex myscript.sh or something like that), so that you can track changes in your shell script. " This version-controlled driver script can also be modified to perform a check if everything passed, and if it does, then update the 'master' branch from the 'develop' branch automatically. This might require some changes to the TribitsCTestDriverCore.cmake script to clearly print out (or write a results file) "ALL PASSED". This would allow us to automate the update of the 'master' branch. |
I just occurred to me that we could set up the checkin-test-sems.sh script to automatically query CDash for the latest CI build and then automatically disable any failing tests for that CI build. An updated version of CDash will allow that to occur. You would then extract that list of tests from a that Python script and then you would create the MPI_RELEASE_DEBUG_SHARED.config file to contain:
That would avoid developers from having to always check CDash to see if there are existing failing CI builds. However, if Trilinos developers adopt this the usage of the checkin-test-sems.sh script, then that should make it very rare that a Trilinos developer should ever run into a failing test that their local changes have not triggered in some way. |
@bartlettroscoe suggested I make this comment in this ticket, instead of in #158 I am comfortable with having a prepush environment that uses GCC 4.8.4, but for the short time that we have just one build protecting the promotion from develop to master, I really think we need to use GCC 4.7.2 for that promotional testing. That version has to work now, and if it doesn't, it will complicate the integration process for @bmpersc and customers. |
The Pliris package was disabled in an older version of this script. Build/Test Cases Summary Enabled Packages: Pliris Disabled Packages: PyTrilinos,Claps,TriKota Enabled all Forward Packages 0) MPI_DEBUG => passed: passed=2,notpassed=0 (5.06 min) 1) SERIAL_RELEASE => Test case SERIAL_RELEASE was not run! => Does not affect push readiness! (-1.00 min)
There was an example push yesterday shown here: that demonstrates why we urgently need to get this story completed. I will work this next week to get the new checkin-test.py script in place including selecting the new set of PT packages and TPLs based on the SEMS env. We need to discuss this at the next Trilinos Framework meeting. |
One issue that came up in my conversation with Alejandro Mota today about the difficulty of safely pushing changes to Trilinos was that the SEMS Dev Env is not really available for SNL/CA staff members. That is because the offical COE in SNL/CA is RHEL 5! Therefore, most SNL/CA staff members just build their own Linux machines (and they don't use the SNL/NM RHEL 6 COE). Therefore, even if they have access to the SRN and the machine where the SEMS Dev Env NFS mount directory is located, it does them no good since they are not running the SNL/NM RHEL 6 COE OS. SEMS really needs to provide build-from-source scripts to install the SEMS Dev Env on a given Linux machine. That, or they need to build a Docker Container for RHEL 6 that has the SEMS Dev Env created on. Otherwise, Trilinos needs to provide accounts on push servers at SNL/NM that have the SEMS Dev Env available. |
Why can't SNL/CA upgrade? Having multiple OS installs, especially ones that are so old is really difficult to support at this level. |
I don't know the answer to that. We would have to ask them. I heard this from @amota about his situation at SNL/CA when I suggested that they use the SEMS Dev Env to provide a safe way to push to Trilinos. |
I think we should explore this question some more. RHEL5 was getting old during my PhD! |
Either of these would be helpful for developers outside SNL/CA with their own machines. |
Have you seen announcement that RHEL5 is to be retired? I think 6 and 7 are only suppprted now. |
But that is not the immediate issue. The immediate issue is that because RHEL 5 was the offical COE at SNL/CA people went off and built their own linux workstations not using the offical RHEL 6 and 7 COEs. That is the problem. |
I suspect that commit f1225dc is something that would have been caught by this updated pre-push CI testing process using checkin-test-sems.sh |
Being able to do robust git bisection is one of the motivations for using the checkin-test.py script to push all commits. Below is a concrete example. Stefan reports that half of the commits he is trying to bisect on are not even passing configure. From: Bartlett, Roscoe A Stefan, You can first bisect on commits that are marked as good by the checkin-test.py script. These have the string “Build/Test Cases Summary” in the git commit log. In the first round of git bisect, you then skip all commits that don’t have that string and you can bound the true bad commit without hitting false failures due to bad non-complete commits (like you describe below). Details are described here: Unfortunately, very few Trilinos developers are currently using the checkin-test.py script to push so you will not be able to do very fine-grained bisection with recent commits using that approach (but it will bound the bad commit and then you can do manual bisection from there). We are trying to get things set up so that everyone can use the checkin-test.py script when pushing to Trilinos: I was hoping to have that done before the TUG. Once everyone is using the checkin-test.py script to test and push, you will be able to do pretty fine-trained bisection safely. Cheers, -Ross From: Trilinos-developers [mailto:trilinos-developers-bounces@trilinos.org] On Behalf Of Domino, Stefan Paul One final data point for the weekend. Half of the SHA1 configure steps are failing during my bisect of Trilinos/master. Moreover, even some that are configuring fail in various Trilinos build errors. Any suggestions on how to bisect such a situation would be most appreciated. Stefan From: Trilinos-developers trilinos-developers-bounces@trilinos.org on behalf of "Domino, Stefan Paul" spdomin@sandia.gov Greetings, Any other apps seeing new diffs as of: commit 5320963 I have started a GH ticket under NaluCFD to start tracking how many times I need to bisect Trilinos per week: Best, Stefan |
Perhaps this has already been discussed, but instead of using different shell scripts to call The following code could be added to fullEnv = None
if use_sems_env:
required_sems_modules = ['cmake/2.8.11',
'gcc/4.7.2/base',
'gcc/4.7.2/openmpi/1.6.5',
'boost/1.55.0/gcc/4.7.2/base',
'superlu/4.3/gcc/4.7.2/base',
'netcdf/4.3.2/gcc/4.7.2/openmpi/1.6.5',
'hdf5/1.8.12/gcc/4.7.2/openmpi/1.6.5']
# And whatever else is needed in default environment
fullEnv = {'PATH': os.environ['PATH'],
'HOME': os.environ['HOME'],
'MODULEPATH': os.environ['MODULEPATH']}
for sems_module in required_sems_modules:
fullEnv.update(load_sems_module(sems_module, fullEnv)) import os
import re
import subprocess
def load_sems_module(module, env):
# adjust module command to correct path
modulecmd = '/usr/local/Cellar/modules/3.2.10/Modules/bin/modulecmd'
command = '{0} csh load {1}'.format(modulecmd, module)
proc = subprocess.Popen(command.split(),
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
env=env)
proc.wait()
out, err = proc.communicate()
if re.search(r'.*\([0-9]+\):ERROR:', out):
raise Exception(out)
for line in out.split(';'):
line = line.split()
if not line or line[0] != 'setenv':
continue
env[line[1]] = ' '.join(line[2:])
return env Since the SEMS modules also set Just some thoughts... |
Just wanted to put in my 2 cents -- @tjfulle very generously spent time to contribute to the check-in test script. It would be awesome if we could let him help out there. He was about to push changes, but I wasn't sure if it properly belonged to Trilinos or to TriBITS. |
Contributions to TriBITS are most welcome. However, any non-trivial change needs to be pushed to the TriBITS GitHub repo. @tjfuller, please follow the process outlined at: and then it will get snapshotted to Trilinos as described here: I will respond to the above comment in detail in a bit. But note that the checkin-test.py script has to be more general than Trilinos and Sandia (it is being used heavily at ORNL for CASL). What we need is a more general solution for associating particular build envs as mentioned here. And welcome to SNL and Trilinos! -Ross |
Thanks @bartlettroscoe ! |
We have a problem with this strategy. The only GCC compiler that is available on OSX for in the SEMS env is GCC 5.3.0 and the Boost 1.55.0 is not present as shown by:
Therefore, in order to have one consistent CI env across all platforms is to use GCC 5.3.0 (and Boost 1.58.0 or 1.59.0 but not 1.55.0 which is the current selection). That may not be so bad. If we get all warnings out of GCC 5.3.0 they will likely be gone from GCC 4.7.2 as well. However, there is the problem of people using features of C++11 present in GCC 5.3.0 that is not present with GCC 4.7.2. Therefore, we would need to run a second post-push CI build that tests with GCC 4.7.2 to make sure things are okay and deal with them quickly. What do people thing about this? Is GCC 5.3.0 a valid choice for the CI build for Trilinos? That is our only option currently. Otherwise, we are going to need to force OSX developers to test and push from a Linux machine that mounts the SEMS env. |
…linos#482) The SEMS Env does not provide any GCC except for 5.3.0. Also, it does not provide Boolst 1.55.0, only 1.58.0 and 1.59.0. Also, it does not provide any build of Scotch at all. Therefore, I have change the default dev env to be GCC 5.3.0, Boost 1.58.0 and disabled the load of Scotch.
Another problem is that the SEMS env does not even provide a Scotch TPL as shown by:
So Scotch had to be removed from the default set of TPLs because. However, we could not use the Scotch provided because it is 32 bit which is not compatible with the 64 bit Parmetis that is installed (see commit b339bc6). I pushed this updated SEMS env that also works for OSX to the branch: |
Now I can force testing of all packages by passing in 'blocking all'.
This helps to avoid a trival merge commit with lots of changes being made after that.
Now you can pass in 4th 'all' argument and it will enable all packages with the checkin-test-sems.sh script. That is useful when you want to force the testing of all packages or when there are no packages changed and the script aborts because there are no enables.
Now I can force testing of all packages by passing in 'blocking all'.
These are not used for anything and therefore serve no purpose to set them.
…1304) Before, tracing of added tests was only done for the checkin-tset.py script. Now, it will be done for the post-push CI build as well.
The experience over the last 10+ months has shown that the checkin-test-sems.sh script based on the SEMS env is robust and easy to set up and to run (see #1304 for more details and a detailed log). In the last several months every failure shown on the post-push CI build was a result of either not running the checkin-test-sems.sh script or was a result of running it an seeing failing tests but pushing anyway (after disabling the failing tests locally). Closing as complete. |
Closing as complete |
…env.sh (trilinos#482) The logic that resulted in the SEMSDevEnv.cmake file getting picked up was a little too magical. Therefore, this change is to require that you explictily list cmake/std/sems/SEMSDevEnv.cmake in the Trilinos_CONFIGURE_OPTIONS_FILE argument. This also got rid of the StdDevEnvs.cmake file and gets rid of that appraoch. Having to list a single file in Trilinos_CONFIGURE_OPTIONS_FILE is not that big of a deal and it is explicit with no magic. I also got rid of the script load_ci_sems_dev_env.sh because it just soruces load_sems_env.sh with no arguments by default anyway and we will maintain that going forward. This just reduces clutter.
For those that like to use -C instead of -DTrilinos_CONFIGURE_OPTIONS_FILE now you can read this in with -C. But you will need to provide the entire path and can't provide a relative path like with -DTrilinos_CONFIGURE_OPTIONS_FILE:STIRNG=<rel-path>.
This caused the standard CI build to break because it was not pulling in the env.
…env.sh (trilinos#482) The logic that resulted in the SEMSDevEnv.cmake file getting picked up was a little too magical. Therefore, this change is to require that you explictily list cmake/std/sems/SEMSDevEnv.cmake in the Trilinos_CONFIGURE_OPTIONS_FILE argument. This also got rid of the StdDevEnvs.cmake file and gets rid of that appraoch. Having to list a single file in Trilinos_CONFIGURE_OPTIONS_FILE is not that big of a deal and it is explicit with no magic. I also got rid of the script load_ci_sems_dev_env.sh because it just soruces load_sems_env.sh with no arguments by default anyway and we will maintain that going forward. This just reduces clutter.
For those that like to use -C instead of -DTrilinos_CONFIGURE_OPTIONS_FILE now you can read this in with -C. But you will need to provide the entire path and can't provide a relative path like with -DTrilinos_CONFIGURE_OPTIONS_FILE:STIRNG=<rel-path>.
This caused the standard CI build to break because it was not pulling in the env.
This will test to see how long we can go just rebuilding Trilinos without requiring a rebuild from scratch. This will show how fast rebulids can be with Trilinos using just 8 cores.
…-server Automatically Merged using Trilinos Pull Request AutoTester PR Title: Change post-push CI server to only rebuild by default (#482) PR Author: bartlettroscoe
Next Action Status:
New CI build is pushed to 'develop', new post-push CI server is running, and new checkin-test-sem.sh script ready for more testing and review ... Note going to pursue other extensions (e.g. mac OSX, tcsh, etc.). See #482 (comment). Next: Leave in review til 1/1/2017 then close.
Blocked By: #158, #410, #362
Blocking: #380
Related To: #370, #475, #476
CC: @trilinos/framework
Description:
Trilinos has not had an effective pre-push CI development process for many years. When the checkin-test.py script was first created (back in 2008 or so), the primary stack of packages was based on Epetra and the main external dependencies were C/C++/Fortran compilers and BLAS and LAPACK. Those dependencies and the major Trilinos customers at the time were used to select the initial set of Primary Tested (initially called Primary Stable) packages that is being used to this day. However, since that time, many new Trilinos packages have been added and important Trilinos customers are relying on many of these newer packages (e.g. SEACAS, STK, Tpetra, Phalanx, Panzer, etc.). In addition, these new Trilinos packages require more dependencies than just BLAS and LAPACK and now TPLs like Boost, HDF5, NetCDF, ParMETIS, SuperLU and others used by Trilinos are also very important to many Trilinos customers.
Another problem with the current pre-push CI testing processes with Trilinos is that Trilinos developers have a variety of different types of machines, OSs, versions of compilers, TPL implementations, etc. that they use to develop on and push changes for Trilinos. This has resulted in people who tried to use the checkin-test.py script to suffer failed pushes due to failing tests on their machine not triggered by their changes. In contract, projects that have a uniform pre-push CI testing env don't experience these types of problems. One example of such a project is CASL VERA that uses TriBITS and the checkin-test.py script and has a set of uniform development machines where developers almost never see tests that fail in their build of the code that passed on another developer's build. Therefore, the only failed builds and tests are due to their own local changes. In that project, there is no trepidation to running the checkin-test.py script and everyone uses it uniformly for nearly every push.
Another problem with the current CI testing process for Trilinos is that the post-push CI server that posts to CDash enables a different set of packages and TPLs from what the pre-push CI build does (and of course uses different compilers, MPI, etc.). Therefore, a CI build/test failure seen on CDash may not be seen with the checkin-test.py script locally and visa vera. This makes it difficult for developers to determine if the failures they are seeing on their own machine are due to their local changes or due differences with the env on their machine compared to the machine running the CI build posting to CDash, if it is due to a different set of enabled packages and TPL or something else.
As a result, the stability of the main Trilinos development branch (now the 'develop' branch, see #370) has degraded from what it was 5+ years ago. This is a problem because Trilinos needs to have a more stable 'develop' branch in order to more frequently update from the 'develop' branch to the 'master' branch (see #370).
This story is to address all of these shortcomings of the current Trilinos CI testing process. The new SEMS Dev Env (#158) provides an opportunity to create a fairly portable (at least for SNL staff members) uniform pre-push and post-push CI testing environment for the first time.
Here is the plan for setting up a more effective CI process based on the SEMS Dev Env, the checkin-test.py script, and CTest/CDash:
load_ci_sems_dev_env.sh
script, which just calls thelocal_sems_dev_env.sh
script with the selections.load_ci_sems_dev_env.sh
. This should likely only run a single build of Trilinos to speed up the testing/push process. (If there is a single build is would likely include-DTPL_ENABLE_MPI=ON -DCMAKE_BULD_TYPE=RELEASE -DTriinos_ENABLE_DEBUG=ON -DBUILD_SHARED_LIBS=ON -DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON -DTrilinos_ENABLE_FLOAT=OFF -DTrilinos_ENABLE_COMPLEX=OFF
. See Provide Trilinos-global CMake options to make it easier for apps to disable float & complex Scalar types #362 about turning off float and complex gy default.)After this Story is complete, then we can create new Stories to get Trilinos developers to use the checkin-test-sems.sh script and to commit to keeping the CI build(s) 100% all the time with "Stop the Line" urgency to fix.
Definition of Done:
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
that provides a viable CI build based on the SEMS Dev Env.load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
has been written and has been reviewed by a few Trilinos developers.load_ci_sems_dev_env.sh
env and the same default build(s) as defined in thecheckin-test-sems.sh
script.checkin-test.py
script itself to determine what improvements that might help with usability and adoption.Decisions that need to be made:
Tasks:
load_ci_sems_dev_env.sh
andcheckin-test-sems.sh
[Done]--default-builds
for thecheckin-test.py
and therefore thecheckin-test-sems.sh
script" [Done]better-ci-build-482
... IN PROGRESS ...checkin-test-sems.sh
[Done]checkint-test-sems.py --local-do-all
[Done]The text was updated successfully, but these errors were encountered: