Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jgfouca/branch for acme split 2018 08 08 #2743

Merged
merged 82 commits into from
Aug 8, 2018

Conversation

jgfouca
Copy link
Contributor

@jgfouca jgfouca commented Aug 8, 2018

Test suite: scripts_regresion_tests
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1543

User interface changes?: N

Update gh-pages html (Y/N)?: N

Code review: @jedwards4b

Gautam Bisht and others added 30 commits May 2, 2018 13:47
So that the dot cannot go on to make an invalid case name.

[BFB]
 Fix format statement preventing write of coupler aux history files.

 [BFB]
craype, and hdf5/netcdf related modules
Set environment variable HDF5_DISABLE_VERSION_CHECK=2 for edison
When BFBFLAG is set to true and when INFO_DBUG > 1, the
routine seq_diag_avect_mct uses a reproducible sum algorithm
that is not as accurate as the algorihm implemented in
shr_reprosum_calc. In particular, when summing a vector of INFs,
the current algorithm returns zero. Here we replace the
existing algorithm with a call to shr_reprosum_calc.

This change is BFB for standard usage (INFO_DBUG == 1).
It is not BFB with respect to the associated diagnostic,
written to cpl.log, when INFO_DBUG > 1. However, these diagnostics are
not used in the simulation, and simulation results will BFB.

[BFB]
shr_reprosum_calc aborts if input summands include INF or NaN values.
For debugging purposes, it can be useful to allow INF or NaN values,
returning the IEEE standard results for such a situation (either NaN,
positive INF, or negative INF, depending on the situation). An optional
logical parameter, allow_infnan, is being added to the shr_reprosum_calc.
When set to .true. the routine determines whether summands for an existing
field contain NaN or INF values and returns the appropriate value without
going through the reproducible sum algorithm (which is very slow and
requires signficant memory when summing these special values). Other fields
in a multiple field call to shr_reprosum_calc will be computed in the usual
fashion. When allow_infnan == .false. or when the parameter is omitted, then
the routine aborts with an informative error message when the input
contain INF or NaN values, as is done currently.

The default can be changed (from allow_infnan=.false. to allow_infnan=.true.)
via a new optional parameter, repro_sum_allow_infnan_in, in shr_reprosum_setopts.
A new drv_in namelist parameter, reprosum_allow_infnan, has also been added
that will be passed to shr_reprosum_setopts to set the default. This can
be set in user_nl_cpl.

Since the default is not being changed, this change is BFB. If allow_infnan
is set to .true., then runs that failed because of INFs or NaNs would
now continue to run (longer), but jobs that did not fail with the original
default will be BFB even with the default changed.

[BFB]
Use ENV to retrieve it from the environment.
Fix domain specification for T42 configuration. This previously omitted
the ice domain, so when trying to build a T42_T42 grid configuration,
the ice domain would remain unset. This is relevant to the single column
model, which uses the T42 grid by default.
…version of netcdf/hdf5.

Also turn off file locking HDF5_USE_FILE_LOCKING=FALSE.
And turn on logging of MPI rank with compute node, which adds a lot to e3sm.log*
An integration test is added for BGCEXP_BCRC_CNPECACNT_1850 compset
Upstreammerge in order to pull-in fixes from recent CIME updates.

* master:
  Update CIME to ESMCI cime5.7.0 2 (#2428)
  Remove initialization of tbot to posinf
Add capture of system status and current workload for
Summit to CIME performance provenance capture logic.

[BFB]
Add syslog.summit checkpointing script for monitoring
job progress.

[BFB]
…pdate-NOLOCK (PR #2424)

Update versions of modules for netcdf/hdf5 at NERSC.
On edison, remove the variable that was disabling the HDF version check.
Turn off file locking HDF5_USE_FILE_LOCKING=FALSE for NERSC machines.
Turn on logging of MPI rank and compute node information for Cori.
Update CIME to ESMCI cime5.7.0-3

Squash merge of jgfouca/branch-for-to-acme-2018-07-12

Bug fixes:
Another critical V2 build fix.

[BFB]
Another upstream merge to pull in more V2 fixes.

* master:
  Update CIME to ESMCI cime5.7.0-3 (#2437)
  Add an all-active multi-instance test to e3sm_integration
  Change milti-instance infile names back to old form
  For all 3 NERSC machines (cori-knl, cori-haswell,edison), update the version of netcdf/hdf5. Also turn off file locking HDF5_USE_FILE_LOCKING=FALSE. And turn on logging of MPI rank with compute node, which adds a lot to e3sm.log*
Any fail in a core phase should cause the test status to be FAIL.
…upgrade_wait_for_tests_cdash

* jgfouca/cime/wait_for_test_upgrades:
  wait_for_test logging working
Currently MPI task to compute node mapping information is
output in two locations, once in CAM, where it is
truncated after the first 256 MPI tasks, and once in CLM,
where it is truncated after the first 100 MPI tasks,
both only for these two components. This is not useful in current
production runs. The use of environment variables, such as
MPICH_CPUMASK_DISPLAY on Cray systems, generate data that are
unnecessarily verbose for our needs. Here a share routine is
introduced that writes out one line per compute node. Each line
contains the compute node name and the list of MPI tasks assigned
to that node for a given communicator. This is then called
in the driver and writes out the task-to-node mapping for the
entire coupled model. Separate branches will then introduce
this into the individual components, replacing the current logic
in both CAM and CLM, for example.

The share routine also optionally returns the number of compute
nodes and the task-to-node mapping, which is needed in the
internal CAM load balancing.

With the call to the shr_taskmap_write routine in the
driver, the mapping data generated by the system when setting
the corresponding environment variable is redundant. This
is removed for the systems currently setting the variable.

Fixes #2457

BFB

* origin/worleyph/cime/taskmap:
  Avoid empty env blocks
  Remove unnecessary white space in task-to-node map output
  Modify driver output format
  Uncomment MV2_CPU_MAPPING definition for Anvil
  Modify task map output format
  Unset environment variables to output task-to-node mapping
  Output MPI task to compute node mapping
…(PR #2480)

Big update to wait_for_tests/jenkins

Changes:

* Upgrade cdash XML spoofing for prettier cdash pages for e3sm
* Add ability for jenkins jobs to use alternate baseline area, nice for test cleanup
* New test test suite that includes DIFFs
* Add ability to turn off all test teardowns from scripts_regr command line
* New Jenkins test to upload a realistic dashboard result
* Make TESTBUILDFAIL produce a bldlog file (So that uploading of log files for failed builds can be tested)
* Add logging to TestStatus processing (waiting)

[BFB]

* jgfouca/cime/upgrade_wait_for_tests_cdash:
  Make pylint happy
  Remove useless commented-out code
  Make TESTBUILDFAIL produce a bldlog file
  Big update to wait_for_test/jenkins
  wait_for_test logging working
  Remove debug stuff
  Progress
For a long time, only model build logs were being uploaded.

[BFB]
…2484)

This PR implements "smart" archiving of old jenkins test results.

The previous implementation simply deleted any result that looked like it came from a previous jenkins run of the same job.

The new implementation will scan these old results, populating the directories
$CIME_OUTPUT_ROOT/old_test_archive/old_cases
$CIME_OUTPUT_ROOT/old_test_archive/old_builds
$CIME_OUTPUT_ROOT/old_test_archive/old_runs
$CIME_OUTPUT_ROOT/old_test_archive/old_archives
... with the appropriate directories from previous runs.

The system will allow this old_test_archive directory to fill up until
it reaches MAX_GB_OLD_TEST_DATA of data. Once that happens, old job
data will be delete in chronological order until we are under MAX_GB
worth of data. This MAX_GB is a new per-machine setting for e3sm.

[BFB]

* jgfouca/cime/archive_old_test_results:
  Restore melvin to 1TB of test data
  Lots of fixes
  Progress
* esmci_remote_for_split/master: (651 commits)
  Make pylint happy
  Make it so key members are always defined.
  add exception for archive_metadata :-(
  add average and aux cpl hist files
  response to comments
  fix pylint issues
  exclude baselines
  redo regex for extension matching
  update for mom
  add debug info
  add debug info
  remove unused glob import
  fix whitespace issue
  remove debug print statements
  use archive info
  have hist_utils use archive.xml info
  fix merge issue
  remove trailing whitespace
  need to specify compiler to mpilibs
  Use RawConfigParser instead of ConfigParser
  ...
Copy link
Contributor Author

@jgfouca jgfouca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotations complete.

@@ -180,7 +180,7 @@ ifdef NETCDF_C_PATH
LIB_NETCDF_C:=$(NETCDF_C_PATH)/lib
endif
ifndef LIB_NETCDF_FORTRAN
LIB_NETCDF_FORTRAN:=$(NETCDF_C_PATH)/lib
LIB_NETCDF_FORTRAN:=$(NETCDF_FORTRAN_PATH)/lib
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine

@@ -330,10 +330,11 @@ def _main_func(description):
wrapper=textwrap.TextWrapper()
wrapper.subsequent_indent = "\t\t\t"
wrapper.fix_sentence_endings = True

cnt = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bugfix for xmlquery. The output was wrong with --value when multiple values were requested (comma was in the wrong place).

@@ -601,6 +601,11 @@ def build_phase(self, sharedlib_only=False, model_only=False):
TESTRUNPASS.build_phase(self, sharedlib_only, model_only)
else:
if (not sharedlib_only):
blddir = self._case.get_value("EXEROOT")
bldlog = os.path.join(blddir, "{}.bldlog.{}".format(get_model(), get_timestamp("%y%m%d-%H%M%S")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make TESTBUILDFAIL more realistic by having it produce a log file.

rv = None
for perm in itertools.permutations(lines):
ts = TestStatus(test_dir="/", test_name="ERS.foo.A")
ts._parse_test_status("\n".join(perm)) # pylint: disable=protected-access
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big increase in robustness of testing. All permutations of phase orders are tested, they should all produce a consistent result.

rv = TEST_PASS_STATUS
run_phase_found = False
for phase in phases: # ensure correct order of processing phases
if phase in self._phase_statuses:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e3sm was having problems with incorrect test status reports for tests, so I refactored this code a bit. Basically, the idea is to give priority to the "core" phases when trying to determine the overall test status.

xmlet.SubElement(phase_elem, "StartDateTime").text = time.ctime(current_time)
xmlet.SubElement(phase_elem, "Start{}Time".format("Test" if phase == "Testing" else phase)).text = str(int(current_time))

return site_elem, phase_elem
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of e3sm dashboard upgrades, should not impact cesm.

test_log_path = "/dev/null"

prior_ts = None
with open(test_log_path, "w") as log_fd:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_tests now logs it's behavior when waiting for a test.


# TODO: Any further checking of xml output worth doing?

###########################################################################
def live_test_impl(self, testdir, expected_results, last_phase, last_status):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add new "live" tests for wait_for_test testing. These tests handle dynamic TestStatus files instead of static.

@@ -2911,6 +3010,9 @@ def _main_func(description):
parser.add_argument("--no-cmake", action="store_true",
help="Do not run cmake tests")

parser.add_argument("--no-teardown", action="store_true",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add ability to disable teardowns from command line.

@@ -935,6 +936,7 @@ subroutine cime_pre_init2()
wall_time_limit=wall_time_limit , &
force_stop_at=force_stop_at , &
reprosum_use_ddpdd=reprosum_use_ddpdd , &
reprosum_allow_infnan=reprosum_allow_infnan, &
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone else will have to explain fortran changes.

@jgfouca
Copy link
Contributor Author

jgfouca commented Aug 8, 2018

@rljacob , did we make significant changes to the coupler on the e3sm side?

@jgfouca jgfouca requested a review from rljacob August 8, 2018 18:41
@@ -180,7 +180,7 @@ ifdef NETCDF_C_PATH
LIB_NETCDF_C:=$(NETCDF_C_PATH)/lib
endif
ifndef LIB_NETCDF_FORTRAN
LIB_NETCDF_FORTRAN:=$(NETCDF_C_PATH)/lib
LIB_NETCDF_FORTRAN:=$(NETCDF_FORTRAN_PATH)/lib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine

@@ -23,6 +23,8 @@ def _get_batch_job_id_for_syslog(case):
return os.environ["SLURM_JOB_ID"]
elif mach in ['mira', 'theta']:
return os.environ["COBALT_JOBID"]
elif mach in ['summit']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that this is just extending bad code already present, but we shouldn't need to check the machine name here, instead we should get case.get_value(BATCH_SYSTEM)

<entry id="reprosum_allow_infnan">
<type>logical</type>
<category>reprosum</category>
<group>seq_infodata_inparm</group>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yuch - but I guess as long as the default is false...

@jedwards4b
Copy link
Contributor

One issue:

scripts/lib/jenkins_generic_job.py
No exception type(s) specified
114
except:

py3 expects an exception type

@jgfouca
Copy link
Contributor Author

jgfouca commented Aug 8, 2018

@jedwards4b , yeah, after tests pass I'll be sure to get travis working.

@jgfouca
Copy link
Contributor Author

jgfouca commented Aug 8, 2018

There will be some additional fails on Melvin on the dashboard until we fix an OpenMP issue on that machine.

@jgfouca jgfouca merged commit 07d476d into master Aug 8, 2018
@jgfouca jgfouca deleted the jgfouca/branch-for-acme-split-2018-08-08 branch August 8, 2018 22:46
@jedwards4b
Copy link
Contributor

@jgfouca was the scripts regression test run on a machine with fortran unit test support?
The Fortran unit testing is broken:
https://my.cdash.org/index.php?project=CIME&date=2018-08-13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.