Skip to content

Commit

Permalink
Set default NTEST_PARALLEL_JOBS=MAX_MPITASKS_PER_NODE and limit E3SM …
Browse files Browse the repository at this point in the history
…machines

This resets the default value of NTEST_PARALLEL_JOBS to MAX_MPITASKS_PER_NODE
so as to not make any behavioral changes to CESM.

Warning: This is not a safe value on machine with batch systems who's
login nodes are more limited than the compute nodes and therefore
NTEST_PARALLEL_JOBS should be set on these systems.

@jgfouca found via E3SM testing that limiting to 4 parallel jobs was
required for many of the testing machines with batch systems to prevent
hammering login nodes. Therefore, we set that value for these E3SM
machines:
* cori-haswell
* cori-knl
* blues
* anvil
* bebop
* theta
* titan
* summit

Warning: Non test machines for E3SM that have a batch system may still
oversubscribe parallel test jobs.
  • Loading branch information
jhkennedy committed May 22, 2019
1 parent c23ea15 commit cae6b73
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 1 deletion.
8 changes: 8 additions & 0 deletions config/e3sm/machines/config_machines.xml
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@
<CCSM_CPRNC>/project/projectdirs/acme/tools/cprnc.cori/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>nersc_slurm</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -356,6 +357,7 @@
<CCSM_CPRNC>/project/projectdirs/acme/tools/cprnc.cori/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>nersc_slurm</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>128</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1038,6 +1040,7 @@
<CCSM_CPRNC>/home/ccsm-data/tools/cprnc</CCSM_CPRNC>
<GMAKE_J>4</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>pbs</BATCH_SYSTEM>
<SUPPORTED_BY>acme</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>16</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1141,6 +1144,7 @@
<CCSM_CPRNC>/lcrc/group/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>36</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1255,6 +1259,7 @@
<CCSM_CPRNC>/lcrc/group/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>36</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1570,6 +1575,7 @@
<CCSM_CPRNC>/projects/ccsm/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>cobalt_theta</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>128</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -2177,6 +2183,7 @@
<CCSM_CPRNC>/lustre/atlas1/cli900/world-shared/cesm/tools/cprnc/cprnc.titan</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>pbs</BATCH_SYSTEM>
<ALLOCATE_SPARE_NODES>TRUE</ALLOCATE_SPARE_NODES>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
Expand Down Expand Up @@ -3065,6 +3072,7 @@
<CCSM_CPRNC>/gpfs/alpine/cli115/world-shared/e3sm/tools/cprnc.summit/cprnc</CCSM_CPRNC>
<GMAKE_J>32</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>lsf</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>84</MAX_TASKS_PER_NODE>
Expand Down
2 changes: 1 addition & 1 deletion scripts/lib/CIME/test_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ def __init__(self, test_names, test_data=None,
if parallel_jobs is None:
mach_parallel_jobs = self._machobj.get_value("NTEST_PARALLEL_JOBS")
if mach_parallel_jobs is None:
mach_parallel_jobs = 3
mach_parallel_jobs = self._machobj.get_value("MAX_MPITASKS_PER_NODE")
self._parallel_jobs = min(len(test_names), mach_parallel_jobs)
else:
self._parallel_jobs = parallel_jobs
Expand Down

0 comments on commit cae6b73

Please sign in to comment.