-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jgfouca/branch for acme split 2018 08 08 #2743
Conversation
Incorporates fix from billsacks/cime@4cc4f2b [BFB]
So that the dot cannot go on to make an invalid case name. [BFB]
Fix format statement preventing write of coupler aux history files. [BFB]
craype, and hdf5/netcdf related modules
Set environment variable HDF5_DISABLE_VERSION_CHECK=2 for edison
When BFBFLAG is set to true and when INFO_DBUG > 1, the routine seq_diag_avect_mct uses a reproducible sum algorithm that is not as accurate as the algorihm implemented in shr_reprosum_calc. In particular, when summing a vector of INFs, the current algorithm returns zero. Here we replace the existing algorithm with a call to shr_reprosum_calc. This change is BFB for standard usage (INFO_DBUG == 1). It is not BFB with respect to the associated diagnostic, written to cpl.log, when INFO_DBUG > 1. However, these diagnostics are not used in the simulation, and simulation results will BFB. [BFB]
shr_reprosum_calc aborts if input summands include INF or NaN values. For debugging purposes, it can be useful to allow INF or NaN values, returning the IEEE standard results for such a situation (either NaN, positive INF, or negative INF, depending on the situation). An optional logical parameter, allow_infnan, is being added to the shr_reprosum_calc. When set to .true. the routine determines whether summands for an existing field contain NaN or INF values and returns the appropriate value without going through the reproducible sum algorithm (which is very slow and requires signficant memory when summing these special values). Other fields in a multiple field call to shr_reprosum_calc will be computed in the usual fashion. When allow_infnan == .false. or when the parameter is omitted, then the routine aborts with an informative error message when the input contain INF or NaN values, as is done currently. The default can be changed (from allow_infnan=.false. to allow_infnan=.true.) via a new optional parameter, repro_sum_allow_infnan_in, in shr_reprosum_setopts. A new drv_in namelist parameter, reprosum_allow_infnan, has also been added that will be passed to shr_reprosum_setopts to set the default. This can be set in user_nl_cpl. Since the default is not being changed, this change is BFB. If allow_infnan is set to .true., then runs that failed because of INFs or NaNs would now continue to run (longer), but jobs that did not fail with the original default will be BFB even with the default changed. [BFB]
Use ENV to retrieve it from the environment.
Fix domain specification for T42 configuration. This previously omitted the ice domain, so when trying to build a T42_T42 grid configuration, the ice domain would remain unset. This is relevant to the single column model, which uses the T42 grid by default.
…version of netcdf/hdf5. Also turn off file locking HDF5_USE_FILE_LOCKING=FALSE. And turn on logging of MPI rank with compute node, which adds a lot to e3sm.log*
An integration test is added for BGCEXP_BCRC_CNPECACNT_1850 compset
Upstreammerge in order to pull-in fixes from recent CIME updates. * master: Update CIME to ESMCI cime5.7.0 2 (#2428) Remove initialization of tbot to posinf
Add capture of system status and current workload for Summit to CIME performance provenance capture logic. [BFB]
Add syslog.summit checkpointing script for monitoring job progress. [BFB]
…pdate-NOLOCK (PR #2424) Update versions of modules for netcdf/hdf5 at NERSC. On edison, remove the variable that was disabling the HDF version check. Turn off file locking HDF5_USE_FILE_LOCKING=FALSE for NERSC machines. Turn on logging of MPI rank and compute node information for Cori.
Update CIME to ESMCI cime5.7.0-3 Squash merge of jgfouca/branch-for-to-acme-2018-07-12 Bug fixes: Another critical V2 build fix. [BFB]
Another upstream merge to pull in more V2 fixes. * master: Update CIME to ESMCI cime5.7.0-3 (#2437) Add an all-active multi-instance test to e3sm_integration Change milti-instance infile names back to old form For all 3 NERSC machines (cori-knl, cori-haswell,edison), update the version of netcdf/hdf5. Also turn off file locking HDF5_USE_FILE_LOCKING=FALSE. And turn on logging of MPI rank with compute node, which adds a lot to e3sm.log*
Any fail in a core phase should cause the test status to be FAIL.
…upgrade_wait_for_tests_cdash * jgfouca/cime/wait_for_test_upgrades: wait_for_test logging working
Currently MPI task to compute node mapping information is output in two locations, once in CAM, where it is truncated after the first 256 MPI tasks, and once in CLM, where it is truncated after the first 100 MPI tasks, both only for these two components. This is not useful in current production runs. The use of environment variables, such as MPICH_CPUMASK_DISPLAY on Cray systems, generate data that are unnecessarily verbose for our needs. Here a share routine is introduced that writes out one line per compute node. Each line contains the compute node name and the list of MPI tasks assigned to that node for a given communicator. This is then called in the driver and writes out the task-to-node mapping for the entire coupled model. Separate branches will then introduce this into the individual components, replacing the current logic in both CAM and CLM, for example. The share routine also optionally returns the number of compute nodes and the task-to-node mapping, which is needed in the internal CAM load balancing. With the call to the shr_taskmap_write routine in the driver, the mapping data generated by the system when setting the corresponding environment variable is redundant. This is removed for the systems currently setting the variable. Fixes #2457 BFB * origin/worleyph/cime/taskmap: Avoid empty env blocks Remove unnecessary white space in task-to-node map output Modify driver output format Uncomment MV2_CPU_MAPPING definition for Anvil Modify task map output format Unset environment variables to output task-to-node mapping Output MPI task to compute node mapping
…(PR #2480) Big update to wait_for_tests/jenkins Changes: * Upgrade cdash XML spoofing for prettier cdash pages for e3sm * Add ability for jenkins jobs to use alternate baseline area, nice for test cleanup * New test test suite that includes DIFFs * Add ability to turn off all test teardowns from scripts_regr command line * New Jenkins test to upload a realistic dashboard result * Make TESTBUILDFAIL produce a bldlog file (So that uploading of log files for failed builds can be tested) * Add logging to TestStatus processing (waiting) [BFB] * jgfouca/cime/upgrade_wait_for_tests_cdash: Make pylint happy Remove useless commented-out code Make TESTBUILDFAIL produce a bldlog file Big update to wait_for_test/jenkins wait_for_test logging working Remove debug stuff Progress
For a long time, only model build logs were being uploaded. [BFB]
…2484) This PR implements "smart" archiving of old jenkins test results. The previous implementation simply deleted any result that looked like it came from a previous jenkins run of the same job. The new implementation will scan these old results, populating the directories $CIME_OUTPUT_ROOT/old_test_archive/old_cases $CIME_OUTPUT_ROOT/old_test_archive/old_builds $CIME_OUTPUT_ROOT/old_test_archive/old_runs $CIME_OUTPUT_ROOT/old_test_archive/old_archives ... with the appropriate directories from previous runs. The system will allow this old_test_archive directory to fill up until it reaches MAX_GB_OLD_TEST_DATA of data. Once that happens, old job data will be delete in chronological order until we are under MAX_GB worth of data. This MAX_GB is a new per-machine setting for e3sm. [BFB] * jgfouca/cime/archive_old_test_results: Restore melvin to 1TB of test data Lots of fixes Progress
* esmci_remote_for_split/master: (651 commits) Make pylint happy Make it so key members are always defined. add exception for archive_metadata :-( add average and aux cpl hist files response to comments fix pylint issues exclude baselines redo regex for extension matching update for mom add debug info add debug info remove unused glob import fix whitespace issue remove debug print statements use archive info have hist_utils use archive.xml info fix merge issue remove trailing whitespace need to specify compiler to mpilibs Use RawConfigParser instead of ConfigParser ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotations complete.
@@ -180,7 +180,7 @@ ifdef NETCDF_C_PATH | |||
LIB_NETCDF_C:=$(NETCDF_C_PATH)/lib | |||
endif | |||
ifndef LIB_NETCDF_FORTRAN | |||
LIB_NETCDF_FORTRAN:=$(NETCDF_C_PATH)/lib | |||
LIB_NETCDF_FORTRAN:=$(NETCDF_FORTRAN_PATH)/lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine
@@ -330,10 +330,11 @@ def _main_func(description): | |||
wrapper=textwrap.TextWrapper() | |||
wrapper.subsequent_indent = "\t\t\t" | |||
wrapper.fix_sentence_endings = True | |||
|
|||
cnt = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bugfix for xmlquery. The output was wrong with --value when multiple values were requested (comma was in the wrong place).
@@ -601,6 +601,11 @@ def build_phase(self, sharedlib_only=False, model_only=False): | |||
TESTRUNPASS.build_phase(self, sharedlib_only, model_only) | |||
else: | |||
if (not sharedlib_only): | |||
blddir = self._case.get_value("EXEROOT") | |||
bldlog = os.path.join(blddir, "{}.bldlog.{}".format(get_model(), get_timestamp("%y%m%d-%H%M%S"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make TESTBUILDFAIL more realistic by having it produce a log file.
rv = None | ||
for perm in itertools.permutations(lines): | ||
ts = TestStatus(test_dir="/", test_name="ERS.foo.A") | ||
ts._parse_test_status("\n".join(perm)) # pylint: disable=protected-access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big increase in robustness of testing. All permutations of phase orders are tested, they should all produce a consistent result.
rv = TEST_PASS_STATUS | ||
run_phase_found = False | ||
for phase in phases: # ensure correct order of processing phases | ||
if phase in self._phase_statuses: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e3sm was having problems with incorrect test status reports for tests, so I refactored this code a bit. Basically, the idea is to give priority to the "core" phases when trying to determine the overall test status.
xmlet.SubElement(phase_elem, "StartDateTime").text = time.ctime(current_time) | ||
xmlet.SubElement(phase_elem, "Start{}Time".format("Test" if phase == "Testing" else phase)).text = str(int(current_time)) | ||
|
||
return site_elem, phase_elem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of e3sm dashboard upgrades, should not impact cesm.
test_log_path = "/dev/null" | ||
|
||
prior_ts = None | ||
with open(test_log_path, "w") as log_fd: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait_for_tests now logs it's behavior when waiting for a test.
|
||
# TODO: Any further checking of xml output worth doing? | ||
|
||
########################################################################### | ||
def live_test_impl(self, testdir, expected_results, last_phase, last_status): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add new "live" tests for wait_for_test testing. These tests handle dynamic TestStatus files instead of static.
@@ -2911,6 +3010,9 @@ def _main_func(description): | |||
parser.add_argument("--no-cmake", action="store_true", | |||
help="Do not run cmake tests") | |||
|
|||
parser.add_argument("--no-teardown", action="store_true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ability to disable teardowns from command line.
@@ -935,6 +936,7 @@ subroutine cime_pre_init2() | |||
wall_time_limit=wall_time_limit , & | |||
force_stop_at=force_stop_at , & | |||
reprosum_use_ddpdd=reprosum_use_ddpdd , & | |||
reprosum_allow_infnan=reprosum_allow_infnan, & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone else will have to explain fortran changes.
@rljacob , did we make significant changes to the coupler on the e3sm side? |
@@ -180,7 +180,7 @@ ifdef NETCDF_C_PATH | |||
LIB_NETCDF_C:=$(NETCDF_C_PATH)/lib | |||
endif | |||
ifndef LIB_NETCDF_FORTRAN | |||
LIB_NETCDF_FORTRAN:=$(NETCDF_C_PATH)/lib | |||
LIB_NETCDF_FORTRAN:=$(NETCDF_FORTRAN_PATH)/lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine
@@ -23,6 +23,8 @@ def _get_batch_job_id_for_syslog(case): | |||
return os.environ["SLURM_JOB_ID"] | |||
elif mach in ['mira', 'theta']: | |||
return os.environ["COBALT_JOBID"] | |||
elif mach in ['summit']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that this is just extending bad code already present, but we shouldn't need to check the machine name here, instead we should get case.get_value(BATCH_SYSTEM)
<entry id="reprosum_allow_infnan"> | ||
<type>logical</type> | ||
<category>reprosum</category> | ||
<group>seq_infodata_inparm</group> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yuch - but I guess as long as the default is false...
One issue: scripts/lib/jenkins_generic_job.py py3 expects an exception type |
@jedwards4b , yeah, after tests pass I'll be sure to get travis working. |
There will be some additional fails on Melvin on the dashboard until we fix an OpenMP issue on that machine. |
@jgfouca was the scripts regression test run on a machine with fortran unit test support? |
Test suite: scripts_regresion_tests
Test baseline:
Test namelist changes:
Test status: bit for bit
Fixes #1543
User interface changes?: N
Update gh-pages html (Y/N)?: N
Code review: @jedwards4b