Skip to content

Commit

Permalink
Merge pull request #3115 from ESMCI/jhkennedy/test-concurrent-tasks
Browse files Browse the repository at this point in the history
Provide default for the number of create_test parallel jobs
E3SM's e3sm_developer test suite will launch a large number of parallel build on the login node unless explicitly passing create_test the number of parallel jobs (-j/--parallel-jobs) it should use. This is because the current default is set by the MAX_MPITASKS_PER_NODE machine/env config variable, which for Cori-knl is 64.

This commit:

sets the default number of parallel jobs to 3
add a possible machine config (xml or env) variable, NTEST_PARALLEL_JOBS,
which can be set to override the default number on a per machine basis
The parallel jobs setting priority is now (highest to lowest):

-j/--parallel-jobs command line argument
NTEST_PARALLEL_JOBS config_machines.xml or environment variable
the default value
Test suite: scripts_regression_tests.py on Cori-knl
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #2923
User interface changes?: N
Update gh-pages html (Y/N)?: N
  • Loading branch information
jedwards4b authored May 22, 2019
2 parents ea3d1bf + cae6b73 commit a224635
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 2 deletions.
8 changes: 8 additions & 0 deletions config/e3sm/machines/config_machines.xml
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,7 @@
<CCSM_CPRNC>/project/projectdirs/acme/tools/cprnc.cori/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>nersc_slurm</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -356,6 +357,7 @@
<CCSM_CPRNC>/project/projectdirs/acme/tools/cprnc.cori/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>nersc_slurm</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>128</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1038,6 +1040,7 @@
<CCSM_CPRNC>/home/ccsm-data/tools/cprnc</CCSM_CPRNC>
<GMAKE_J>4</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>pbs</BATCH_SYSTEM>
<SUPPORTED_BY>acme</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>16</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1141,6 +1144,7 @@
<CCSM_CPRNC>/lcrc/group/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>36</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1255,6 +1259,7 @@
<CCSM_CPRNC>/lcrc/group/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_integration</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>slurm</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>36</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -1570,6 +1575,7 @@
<CCSM_CPRNC>/projects/ccsm/acme/tools/cprnc/cprnc</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>cobalt_theta</BATCH_SYSTEM>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>128</MAX_TASKS_PER_NODE>
Expand Down Expand Up @@ -2177,6 +2183,7 @@
<CCSM_CPRNC>/lustre/atlas1/cli900/world-shared/cesm/tools/cprnc/cprnc.titan</CCSM_CPRNC>
<GMAKE_J>8</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>pbs</BATCH_SYSTEM>
<ALLOCATE_SPARE_NODES>TRUE</ALLOCATE_SPARE_NODES>
<SUPPORTED_BY>E3SM</SUPPORTED_BY>
Expand Down Expand Up @@ -3065,6 +3072,7 @@
<CCSM_CPRNC>/gpfs/alpine/cli115/world-shared/e3sm/tools/cprnc.summit/cprnc</CCSM_CPRNC>
<GMAKE_J>32</GMAKE_J>
<TESTS>e3sm_developer</TESTS>
<NTEST_PARALLEL_JOBS>4</NTEST_PARALLEL_JOBS>
<BATCH_SYSTEM>lsf</BATCH_SYSTEM>
<SUPPORTED_BY>e3sm</SUPPORTED_BY>
<MAX_TASKS_PER_NODE>84</MAX_TASKS_PER_NODE>
Expand Down
3 changes: 3 additions & 0 deletions config/xml_schemas/config_machines.xsd
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
<xs:element name="GMAKE" type="xs:string"/>
<xs:element name="GMAKE_J" type="xs:integer"/>
<xs:element name="TESTS" type="xs:string"/>
<xs:element name="NTEST_PARALLEL_JOBS" type="xs:integer"/>
<xs:element name="BATCH_SYSTEM" type="xs:NCName"/>
<xs:element name="ALLOCATE_SPARE_NODES" type="upperBoolean"/>
<xs:element name="SUPPORTED_BY" type="xs:string"/>
Expand Down Expand Up @@ -140,6 +141,8 @@
<xs:element ref="GMAKE_J" minOccurs="0" maxOccurs="1"/>
<!-- TESTS: (acme only) list of tests to run on this machine -->
<xs:element ref="TESTS" minOccurs="0" maxOccurs="1"/>
<!-- NTEST_PARALLEL_JOBS: number of parallel jobs create_test will launch -->
<xs:element ref="NTEST_PARALLEL_JOBS" minOccurs="0" maxOccurs="1"/>
<!-- BATCH_SYSTEM: batch system used on this machine (none is okay) -->
<xs:element ref="BATCH_SYSTEM" minOccurs="1" maxOccurs="1"/>
<!-- ALLOCATE_SPARE_NODES: allocate spare nodes when job is launched default False-->
Expand Down
6 changes: 4 additions & 2 deletions scripts/lib/CIME/test_scheduler.py
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,10 @@ def __init__(self, test_names, test_data=None,
self._walltime = walltime

if parallel_jobs is None:
self._parallel_jobs = min(len(test_names),
self._machobj.get_value("MAX_MPITASKS_PER_NODE"))
mach_parallel_jobs = self._machobj.get_value("NTEST_PARALLEL_JOBS")
if mach_parallel_jobs is None:
mach_parallel_jobs = self._machobj.get_value("MAX_MPITASKS_PER_NODE")
self._parallel_jobs = min(len(test_names), mach_parallel_jobs)
else:
self._parallel_jobs = parallel_jobs

Expand Down

0 comments on commit a224635

Please sign in to comment.