Number of nodes on Titan #1255

amametjanov · 2017-02-03T18:48:48Z

Tried to run on Titan with the recommended PE layout from here: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000-ne30_oEC-Titan-931.

I was expecting 675 nodes for ATM and 256 nodes for OCN for a total of 931 nodes. However, case.run shows #PBS -l nodes=1699: i.e. total requested nodes is 1699. This can only happen by assigning 4x more nodes to OCN: 675+1024=1699. In other words, scripts are counting that each OCN MPI rank has 4 threads like ATM ranks do, but the PE layout asked for OCN to be single-threaded. So the job is over-charged 1.82x more core-hours. Appears to be a bug.

The text was updated successfully, but these errors were encountered:

worleyph · 2017-02-03T18:55:53Z

Can you point me at your case directory? I don't think that I have run into this issue on Titan recently - it was broken this way earlier is I remember correctly. Is this from master?

amametjanov · 2017-02-03T18:57:33Z

Yes, git master v1.0.0-beta-223-g628a26d from 2017-02-02

Case-dir: ~azamat/repos/ACME-homme/cime/scripts/cases/A_WCYCL2000-ne30_oEC-intel-01

worleyph · 2017-02-03T19:06:33Z

Looking at your env_mach_pes.xml:

 <entry id="NTASKS_ATM" value="2700">
 <entry id="NTHRDS_ATM" value="4">
 ...
 <entry id="NTASKS_OCN" value="4096">
 <entry id="NTHRDS_OCN" value="1">

Is the current env_mach_pes.xml the one that you tried? The default for this case has

 <entry id="NTASKS_ATM" value="675">
 <entry id="NTHRDS_ATM" value="2"> 
 ...
 <entry id="NTASKS_OCN" value="512">
 <entry id="NTHRDS_OCN" value="1">

amametjanov · 2017-02-03T19:07:28Z

Yes, I'm building with this PE layout now. Not with the default.

worleyph · 2017-02-03T19:08:14Z

Sorry - you complained about a layout with 675 ATM processes, but the one I see has 2700 ATM processes. What should I be looking at?

worleyph · 2017-02-03T19:08:56Z

SORRY - 675 NODES (not ATM processes). "never mind"

amametjanov · 2017-02-03T19:09:12Z

2700 ATM processes with 4 threads per process is 10800 tasks that are laid out on 675 nodes.

worleyph · 2017-02-03T19:11:24Z

I see the same problem - the aprun command is correct however.

amametjanov · 2017-02-03T19:12:21Z

Yes, aprun is good, just getting over-charged.

amametjanov · 2017-02-03T19:17:39Z

And the qstat output after submission:

moab1.ccs.ornl.gov:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
3211787                 azamat      batch    A_WCYCL2000-ne30    --   1699  27184    --   00:45:00 Q       --

worleyph · 2017-02-03T19:19:29Z

Okay, we got the earlier aprun error fixed (Issue #1158, which may have a similar origin to this problem), but not the calculation of the number of nodes needed. @jgfouca , this is critical for production runs on titan.

worleyph · 2017-02-03T19:24:13Z

@amametjanov , in the mean time, you can manually edit the case.run file to request the actual number of nodes needed, to get back to what you were trying to do. (Which you probably already know - just wanted to be certain.)

amametjanov · 2017-02-03T19:27:39Z

Yes, everyone should check case.run after case.setup to make sure #PBS -l nodes= is as expected.

jgfouca · 2017-02-06T19:26:03Z

task_maker.py got completely refactored in CIME 5.2, so I'm going to wait until we have that before looking too closely at this.

worleyph · 2017-03-25T13:34:36Z

@jgfouca , this problem still exists in CIME5.2 (including in PR #1321 )

jgfouca · 2017-03-27T19:29:02Z

@worleyph @amametjanov OK, looking at this now

jgfouca · 2017-03-27T23:50:32Z

@worleyph @amametjanov I have a fix ready, I just need #1321 to get merged to master.

worleyph · 2017-03-27T23:53:34Z

When can I do the merge ( @rljacob )? Skybridge and melvin tests all failed because of the BFBFLAG change (last I checked). Should I check again tomorrow? What is "good enough" for the merge to master?

rljacob · 2017-03-27T23:59:04Z

Go ahead.

worleyph · 2017-03-28T00:01:03Z

Now?

worleyph · 2017-03-28T00:03:25Z

We have a thunderstorm here right now. Would hate to get in the middle and lost connectivity. If the answer is "now", I'll do the merge as soon as things settle down, or tomorrow morning.

Thanks.

rljacob · 2017-03-28T00:06:23Z

Yes.

…1351) Restore optimized num nodes: On titan, the aprun command is now correct but num_nodes was incorrect due to not using the 'optimized' num nodes that task maker was programmed to use with aprun long ago. This commit restores that capability. Fixes #1255 [BFB]

@jedwards4b

Update queue selection to take walltime into account Adds concept of strict walltime. The idea here is to have better support for machines like blues that have a "debug" queue and a "standard" queue. The "debug" has strict limits on both walltime and num_pes and therefore should not be selected as the user's queue if they asked for a long walltime. For other machines, the maxwalltime setting is being used more like a default walltime than a true max. Test suite: scripts_regression_tests (melvin and skybridge) and some by-hand testing on blues Test baseline: Test namelist changes: Test status: bit for bit Fixes #1255 User interface changes?: Changes in how walltime is handled Code review: @jedwards4b @jayeshkrishna @rljacob

amametjanov · 2017-06-16T21:53:15Z

This still appears to be an issue. Nodes calculated in case.run are incorrect: expected 1926 nodes, but got #PBS -l nodes=2119. Same thing happened when doing a single-node run: expected 1, but got case.run with 2. aprun command args are correct though.

jayeshkrishna · 2017-06-16T21:55:30Z

Note that the aprun command (1926 node run on Titan - see https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000+ne120_oRRS15+Titan+1926) is correct, the PBS args (number of nodes requested) are wrong.

mt5555 · 2017-06-16T23:15:31Z

there's a new feature with the job will allocate more nodes than aprun, so if a node fails (very common on Titan) it will attempt a restart in the same job. But I thought that would only ask for an extra 4 nodes or something like that. Just thought I'd mention it since it might be related.

worleyph · 2017-06-17T02:29:02Z

Look in your env_mach_pes.xml file ...

   <group id="mach_pes">
     <entry id="PCT_SPARE_NODES" value="10">
       <type>integer</type>
       <desc>Percent of extra spare nodes to allocate</desc>
     </entry>
     <entry id="NTASKS">
 ...

The default is an increase of 10% in the requested number of nodes (on Titan). Probably overkill, but I do not know the reasoning that determined this number.

jayeshkrishna · 2017-06-17T04:56:49Z

Ok, now it makes sense. 1926 + 10% increase = 1926 + ceil(1926 * 0.1) = 2119 nodes.

mt5555 · 2017-06-17T16:10:03Z

@jgfouca : 10% seems like a lot of nodes for the default - 192 nodes in this case, meaning we are allowing for 192 failures, each requiring a restart run? Or is it that dozens of nodes fail at a time? If the former, can we change this to max(10%,5)?

worleyph · 2017-06-17T16:30:47Z

Since failures can happen multiple times during a long run, maybe more than 5 nodes as the max? 10? 20? Perhaps OLCF can advise. Perhaps they already have?

mt5555 · 2017-06-17T16:37:35Z

Take a look at Luke's run log:

https://acme-climate.atlassian.net/wiki/display/OCNICE/GMPAS-IAF_oRRS18to6v3

some runs had no failures, a few with 1 failure and a few with 2 failures. One run he listed as "numerous failures".

Pinging our new Titan POC, @minxu74 , can you advise?

jgfouca · 2017-06-17T19:49:58Z

10% was a WAG (wild-&$! guess) by me when I initially implemented this capability. I would be happy to change it to something more reasonable if someone can tell me what that is.

vanroekel · 2017-06-18T02:08:23Z

In my Titan experience 5 nodes extra is plenty. I have been running with ten extra nodes but have never come close to needing all ten.

…

Sent from my iPhone

On Jun 17, 2017, at 1:50 PM, James Foucar ***@***.***> wrote: 10% was a WAG (wild-&$! guess) by me when I initially implemented this capability. I would be happy to change it to something more reasonable if someone can tell me what that is. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

jgfouca · 2017-06-18T19:30:32Z

@vanroekel , thanks. The last piece of information is this: is there a 1-1 correlation between the number of times the magic string (error in the log that would indicate a node failure) appears in the log and the number of nodes that failed?

jgfouca · 2017-08-16T16:16:36Z

This has been addressed in CIME:

ESMCI/cime#1816

New user interface features: option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email. manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative. Add --pesfile option to create_test New case.qstatus tool Allow user to adjust ulimits Add message at end of successful build Add new user-mods option to create_clone New ways to select queue in config_batch.xml scripts_regression_tests.py now has --compiler and --mpilibs option new case variables controlling spare nodes (for failing nodes workaround) User interface bug fixes: xmlquery options --file and --listall now work together. xmlquery options clarified and help message improved. Improve error message for passing incorrect grid alias to create_newcase. Improve error messages from xmlchange Removed user features: Remove long term archiving support. "case.lt_archive" no longer in caseroot. Other changes to case control system: Improve ERR test. New ERS2 test Add IRT test Implement NCR with SystemTestsCompareTwo Rewrite ERP with SystemTestsCompareTwo Convert the ERR and ERRI test to the compare2 framework Fix SEQ test on skybridge new case.test --reset option to manually reset a test to initial conditions. TOTAL_CORES was inaccurate so remove it. Only create rest directories if they are going to be populated get_timing.py should use largest ncpl value, not assume it's the atm value Improve comment for BASELINE status in TestStatus Better handling of fails in the build (don't print a python stacktrace) Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily. Add ability for create_test to infer non-default machine from testname Change all python2 style strings to python3 style strings Improve handling of hist comparisons in TestStatus.log version 3.0 schema for entry_id files Downscale build parallelism on small machines (so we don't overload it) Bring bless_test_results behavior into an exact match with create_test -g lcdf4 flag does apply to pnetcdf. Replace cdf64 logical with PIO_NETCDF_FORMAT character variable Refactor how spare nodes get computed Preserve test order in test_scheduler Allow queue selection by node count instead of task count Improve error check for non-python sub processes Data model changes: Refactored data models Use %CPLHIST for DATM coupler hist forcing Add support for data model to read multiple time slices at once New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file DATM updates for clm Change SIAF compset name in dice to IAF to be consistent with other components Coupler/driver changes: Map SMB from lnd to glc using bilinear mapping with a conservation correction Remove more cesm refs in filenames. seq_comm_mct code cleanup Implementation of multiple couplers within multi-instance runs. Other: User's guide rst files now in cime/doc directory Get PFUnit working on melvin Homebrew generic mac port Fixes #1255 Fixes #1424 Fixes #1426 Fixes #1573 Fixes #1576 Fixes #1585 * jgfouca/update_to_cime_5.3.0.34: (932 commits) One more short term archiver fix Fix merge conflicts, should mostly work cetus xml fix Hack to get PET test to work on chama Fix HOMME Cmake pointing to wrong cime area Update run_acme version Fixes for AQUAP Removes long term archiving from run_acme Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members Cherry-pick fixes from older merge branch cime_developer now working on melvin new git pelayout doc Update ChangeLog Change atm_grid "48x96" to T31 to match the model_grid_alias that is actually being used. Minor fix Fix mistake Better encapsulation of common calls (create_test, wait_for_tests) Add a timeout option to wait_for_tests update to sphinx documentation Restore task_count ...

Update CIME to 5.3.0alpha34 New user interface features: option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email. manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative. Add --pesfile option to create_test New case.qstatus tool Allow user to adjust ulimits Add message at end of successful build Add new user-mods option to create_clone New ways to select queue in config_batch.xml scripts_regression_tests.py now has --compiler and --mpilibs option new case variables controlling spare nodes (for failing nodes workaround) User interface bug fixes: xmlquery options --file and --listall now work together. xmlquery options clarified and help message improved. Improve error message for passing incorrect grid alias to create_newcase. Improve error messages from xmlchange Removed user features: Remove long term archiving support. "case.lt_archive" no longer in caseroot. Other changes to case control system: Improve ERR test. New ERS2 test Add IRT test Implement NCR with SystemTestsCompareTwo Rewrite ERP with SystemTestsCompareTwo Convert the ERR and ERRI test to the compare2 framework Fix SEQ test on skybridge new case.test --reset option to manually reset a test to initial conditions. TOTAL_CORES was inaccurate so remove it. Only create rest directories if they are going to be populated get_timing.py should use largest ncpl value, not assume it's the atm value Improve comment for BASELINE status in TestStatus Better handling of fails in the build (don't print a python stacktrace) Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily. Add ability for create_test to infer non-default machine from testname Change all python2 style strings to python3 style strings Improve handling of hist comparisons in TestStatus.log version 3.0 schema for entry_id files Downscale build parallelism on small machines (so we don't overload it) Bring bless_test_results behavior into an exact match with create_test -g lcdf4 flag does apply to pnetcdf. Replace cdf64 logical with PIO_NETCDF_FORMAT character variable Refactor how spare nodes get computed Preserve test order in test_scheduler Allow queue selection by node count instead of task count Improve error check for non-python sub processes Data model changes: Refactored data models Use %CPLHIST for DATM coupler hist forcing Add support for data model to read multiple time slices at once New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file DATM updates for clm Change SIAF compset name in dice to IAF to be consistent with other components Coupler/driver changes: Map SMB from lnd to glc using bilinear mapping with a conservation correction Remove more cesm refs in filenames. seq_comm_mct code cleanup Implementation of multiple couplers within multi-instance runs. Other: User's guide rst files now in cime/doc directory Get PFUnit working on melvin Homebrew generic mac port Fixes #1255 Fixes #1424 Fixes #1426 Fixes #1573 Fixes #1576 Fixes #1585 * jgfouca/update_to_cime_5.3.0.34: (942 commits) Apply patch for cpl to avoid hangs Adjust ne4 PE layout on Anvil Reset configpes_all to what's on master Cherry-pick wait_for_tests fix from ESMCI Fix xmllint errors on Titan Convert attribute DEBUG to uppercase Update ne4-wcycl PEs on Anvil Fix the rest of the pylint errors for case_st_archive Fix jenkins script mistake made during conflict resolution. short term archiver get_datenames call for last restarts fixed One more short term archiver fix Fix merge conflicts, should mostly work cetus xml fix Hack to get PET test to work on chama Fix HOMME Cmake pointing to wrong cime area Update run_acme version Fixes for AQUAP Removes long term archiving from run_acme Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members Cherry-pick fixes from older merge branch ...

worleyph · 2017-10-13T00:29:19Z

@jgfouca and @vanroekel and @mt5555 , the new logic supports both asking CIME to do something reasonable and specifying the extra number of nodes directly:

   <entry id="ALLOCATE_SPARE_NODES" value="TRUE">

or something like

   <entry id="FORCE_SPARE_NODES" value="5">

(Note that FORCE_SPARE_NODES takes precendence when both are set.)

However, the default for Titan is now

   <entry id="ALLOCATE_SPARE_NODES" value="FALSE">
   <entry id="FORCE_SPARE_NODES" value="-999">

where as the default used to be TRUE (adding 10% more nodes). Just wanted to check whether this is what we wanted (default to be FALSE) or not.

jgfouca · 2017-10-16T18:31:38Z

@worleyph I think we want it to be true on titan, I will investigate.

vanroekel · 2017-10-16T18:52:48Z

@jgfouca and @worleyph I agree that this should be true on titan. I would also suggest adding only a few spare nodes (~5). 10% is much too large for high-resolution and could be too small for low resolution.

jgfouca · 2017-10-16T18:53:59Z

@vanroekel , yes, that change was made a while ago. I think it maxes-out at 10 nodes.

worleyph · 2017-10-16T20:15:21Z

@vanroekel , in my experiments it appears that the default for Titan is 10, though you can also modify this on a case-by-case basis. I did not look at the code to verify whoat or how the default was determined.

…1846) Titan should allocate spare nodes by default. Fixes #1255 [BFB] * origin/jgfouca/cime/titan_allocate_spare_nodes: Titan should allocate spare nodes by default.

…#1846) Titan should allocate spare nodes by default. Fixes #1255 [BFB] * origin/jgfouca/cime/titan_allocate_spare_nodes: Titan should allocate spare nodes by default.

amametjanov added CIME Titan labels Feb 3, 2017

amametjanov assigned jgfouca Feb 3, 2017

worleyph added bug Critical labels Feb 3, 2017

worleyph mentioned this issue Mar 25, 2017

Restore Titan support after CIME5.2 merge #1321

Merged

jgfouca mentioned this issue Mar 28, 2017

Restore optimized num nodes. #1351

Merged

amametjanov reopened this Jun 16, 2017

jgfouca added the Fixed in ESMCI label Aug 16, 2017

jgfouca mentioned this issue Sep 15, 2017

Update CIME to 5.3.0alpha34 #1788

Merged

jgfouca closed this as completed in #1788 Oct 12, 2017

jgfouca reopened this Oct 16, 2017

jgfouca mentioned this issue Oct 16, 2017

Titan should allocate spare nodes by default. #1846

Merged

jgfouca closed this as completed in #1846 Oct 16, 2017

Number of nodes on Titan #1255

Number of nodes on Titan #1255

Comments

amametjanov commented Feb 3, 2017 • edited Loading

worleyph commented Feb 3, 2017

amametjanov commented Feb 3, 2017

worleyph commented Feb 3, 2017

amametjanov commented Feb 3, 2017 • edited Loading

worleyph commented Feb 3, 2017

worleyph commented Feb 3, 2017

amametjanov commented Feb 3, 2017

worleyph commented Feb 3, 2017

amametjanov commented Feb 3, 2017

amametjanov commented Feb 3, 2017

worleyph commented Feb 3, 2017

worleyph commented Feb 3, 2017

amametjanov commented Feb 3, 2017

jgfouca commented Feb 6, 2017

worleyph commented Mar 25, 2017

jgfouca commented Mar 27, 2017

jgfouca commented Mar 27, 2017

worleyph commented Mar 27, 2017 • edited Loading

rljacob commented Mar 27, 2017

worleyph commented Mar 28, 2017

worleyph commented Mar 28, 2017

rljacob commented Mar 28, 2017

amametjanov commented Jun 16, 2017 • edited Loading

jayeshkrishna commented Jun 16, 2017 • edited Loading

mt5555 commented Jun 16, 2017

worleyph commented Jun 17, 2017

jayeshkrishna commented Jun 17, 2017

mt5555 commented Jun 17, 2017

worleyph commented Jun 17, 2017

mt5555 commented Jun 17, 2017

jgfouca commented Jun 17, 2017

vanroekel commented Jun 18, 2017 via email

jgfouca commented Jun 18, 2017

jgfouca commented Aug 16, 2017

worleyph commented Oct 13, 2017 • edited Loading

jgfouca commented Oct 16, 2017

vanroekel commented Oct 16, 2017

jgfouca commented Oct 16, 2017

worleyph commented Oct 16, 2017

amametjanov commented Feb 3, 2017 •

edited

Loading

amametjanov commented Feb 3, 2017 •

edited

Loading

worleyph commented Mar 27, 2017 •

edited

Loading

amametjanov commented Jun 16, 2017 •

edited

Loading

jayeshkrishna commented Jun 16, 2017 •

edited

Loading

worleyph commented Oct 13, 2017 •

edited

Loading