Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of nodes on Titan #1255

Closed
amametjanov opened this issue Feb 3, 2017 · 39 comments · Fixed by #1788
Closed

Number of nodes on Titan #1255

amametjanov opened this issue Feb 3, 2017 · 39 comments · Fixed by #1788

Comments

@amametjanov
Copy link
Member

amametjanov commented Feb 3, 2017

Tried to run on Titan with the recommended PE layout from here: https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000-ne30_oEC-Titan-931.

I was expecting 675 nodes for ATM and 256 nodes for OCN for a total of 931 nodes. However, case.run shows #PBS -l nodes=1699: i.e. total requested nodes is 1699. This can only happen by assigning 4x more nodes to OCN: 675+1024=1699. In other words, scripts are counting that each OCN MPI rank has 4 threads like ATM ranks do, but the PE layout asked for OCN to be single-threaded. So the job is over-charged 1.82x more core-hours. Appears to be a bug.

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

Can you point me at your case directory? I don't think that I have run into this issue on Titan recently - it was broken this way earlier is I remember correctly. Is this from master?

@amametjanov
Copy link
Member Author

Yes, git master v1.0.0-beta-223-g628a26d from 2017-02-02

Case-dir: ~azamat/repos/ACME-homme/cime/scripts/cases/A_WCYCL2000-ne30_oEC-intel-01

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

Looking at your env_mach_pes.xml:

 <entry id="NTASKS_ATM" value="2700">
 <entry id="NTHRDS_ATM" value="4">
 ...
 <entry id="NTASKS_OCN" value="4096">
 <entry id="NTHRDS_OCN" value="1">

Is the current env_mach_pes.xml the one that you tried? The default for this case has

 <entry id="NTASKS_ATM" value="675">
 <entry id="NTHRDS_ATM" value="2"> 
 ...
 <entry id="NTASKS_OCN" value="512">
 <entry id="NTHRDS_OCN" value="1">

@amametjanov
Copy link
Member Author

amametjanov commented Feb 3, 2017

Yes, I'm building with this PE layout now. Not with the default.

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

Sorry - you complained about a layout with 675 ATM processes, but the one I see has 2700 ATM processes. What should I be looking at?

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

SORRY - 675 NODES (not ATM processes). "never mind"

@amametjanov
Copy link
Member Author

2700 ATM processes with 4 threads per process is 10800 tasks that are laid out on 675 nodes.

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

I see the same problem - the aprun command is correct however.

@amametjanov
Copy link
Member Author

Yes, aprun is good, just getting over-charged.

@amametjanov
Copy link
Member Author

And the qstat output after submission:

moab1.ccs.ornl.gov:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
3211787                 azamat      batch    A_WCYCL2000-ne30    --   1699  27184    --   00:45:00 Q       --

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

Okay, we got the earlier aprun error fixed (Issue #1158, which may have a similar origin to this problem), but not the calculation of the number of nodes needed. @jgfouca , this is critical for production runs on titan.

@worleyph
Copy link
Contributor

worleyph commented Feb 3, 2017

@amametjanov , in the mean time, you can manually edit the case.run file to request the actual number of nodes needed, to get back to what you were trying to do. (Which you probably already know - just wanted to be certain.)

@amametjanov
Copy link
Member Author

Yes, everyone should check case.run after case.setup to make sure #PBS -l nodes= is as expected.

@jgfouca
Copy link
Member

jgfouca commented Feb 6, 2017

task_maker.py got completely refactored in CIME 5.2, so I'm going to wait until we have that before looking too closely at this.

@worleyph
Copy link
Contributor

@jgfouca , this problem still exists in CIME5.2 (including in PR #1321 )

@jgfouca
Copy link
Member

jgfouca commented Mar 27, 2017

@worleyph @amametjanov OK, looking at this now

@jgfouca
Copy link
Member

jgfouca commented Mar 27, 2017

@worleyph @amametjanov I have a fix ready, I just need #1321 to get merged to master.

@worleyph
Copy link
Contributor

worleyph commented Mar 27, 2017

When can I do the merge ( @rljacob )? Skybridge and melvin tests all failed because of the BFBFLAG change (last I checked). Should I check again tomorrow? What is "good enough" for the merge to master?

@rljacob
Copy link
Member

rljacob commented Mar 27, 2017

Go ahead.

@worleyph
Copy link
Contributor

Now?

@worleyph
Copy link
Contributor

We have a thunderstorm here right now. Would hate to get in the middle and lost connectivity. If the answer is "now", I'll do the merge as soon as things settle down, or tomorrow morning.

Thanks.

@rljacob
Copy link
Member

rljacob commented Mar 28, 2017

Yes.

amametjanov added a commit that referenced this issue Mar 28, 2017
…1351)

Restore optimized num nodes:
On titan, the aprun command is now correct but num_nodes was
incorrect due to not using the 'optimized' num nodes that task maker
was programmed to use with aprun long ago. This commit restores
that capability.

Fixes #1255

[BFB]
agsalin pushed a commit that referenced this issue May 2, 2017
Update queue selection to take walltime into account

Adds concept of strict walltime.

The idea here is to have better support for machines like blues that have a "debug" queue and a "standard" queue. The "debug" has strict limits on both walltime and num_pes and therefore should not be selected as the user's queue if they asked for a long walltime. For other machines, the maxwalltime setting is being used more like a default walltime than a true max.

Test suite: scripts_regression_tests (melvin and skybridge) and some by-hand testing on blues
Test baseline:
Test namelist changes:
Test status: bit for bit

Fixes #1255

User interface changes?: Changes in how walltime is handled

Code review: @jedwards4b @jayeshkrishna @rljacob
@amametjanov
Copy link
Member Author

amametjanov commented Jun 16, 2017

This still appears to be an issue. Nodes calculated in case.run are incorrect: expected 1926 nodes, but got #PBS -l nodes=2119. Same thing happened when doing a single-node run: expected 1, but got case.run with 2. aprun command args are correct though.

@amametjanov amametjanov reopened this Jun 16, 2017
@jayeshkrishna
Copy link
Contributor

jayeshkrishna commented Jun 16, 2017

Note that the aprun command (1926 node run on Titan - see https://acme-climate.atlassian.net/wiki/display/PERF/A_WCYCL2000+ne120_oRRS15+Titan+1926) is correct, the PBS args (number of nodes requested) are wrong.

@mt5555
Copy link
Contributor

mt5555 commented Jun 16, 2017

there's a new feature with the job will allocate more nodes than aprun, so if a node fails (very common on Titan) it will attempt a restart in the same job. But I thought that would only ask for an extra 4 nodes or something like that. Just thought I'd mention it since it might be related.

@worleyph
Copy link
Contributor

Look in your env_mach_pes.xml file ...

   <group id="mach_pes">
     <entry id="PCT_SPARE_NODES" value="10">
       <type>integer</type>
       <desc>Percent of extra spare nodes to allocate</desc>
     </entry>
     <entry id="NTASKS">
 ...

The default is an increase of 10% in the requested number of nodes (on Titan). Probably overkill, but I do not know the reasoning that determined this number.

@jayeshkrishna
Copy link
Contributor

Ok, now it makes sense. 1926 + 10% increase = 1926 + ceil(1926 * 0.1) = 2119 nodes.

@mt5555
Copy link
Contributor

mt5555 commented Jun 17, 2017

@jgfouca : 10% seems like a lot of nodes for the default - 192 nodes in this case, meaning we are allowing for 192 failures, each requiring a restart run? Or is it that dozens of nodes fail at a time? If the former, can we change this to max(10%,5)?

@worleyph
Copy link
Contributor

Since failures can happen multiple times during a long run, maybe more than 5 nodes as the max? 10? 20? Perhaps OLCF can advise. Perhaps they already have?

@mt5555
Copy link
Contributor

mt5555 commented Jun 17, 2017

Take a look at Luke's run log:

https://acme-climate.atlassian.net/wiki/display/OCNICE/GMPAS-IAF_oRRS18to6v3

some runs had no failures, a few with 1 failure and a few with 2 failures. One run he listed as "numerous failures".

Pinging our new Titan POC, @minxu74 , can you advise?

@jgfouca
Copy link
Member

jgfouca commented Jun 17, 2017

10% was a WAG (wild-&$! guess) by me when I initially implemented this capability. I would be happy to change it to something more reasonable if someone can tell me what that is.

@vanroekel
Copy link
Contributor

vanroekel commented Jun 18, 2017 via email

@jgfouca
Copy link
Member

jgfouca commented Jun 18, 2017

@vanroekel , thanks. The last piece of information is this: is there a 1-1 correlation between the number of times the magic string (error in the log that would indicate a node failure) appears in the log and the number of nodes that failed?

@jgfouca
Copy link
Member

jgfouca commented Aug 16, 2017

This has been addressed in CIME:

ESMCI/cime#1816

jgfouca added a commit that referenced this issue Oct 3, 2017
New user interface features:
option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run
option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification
option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email.
manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative.
Add --pesfile option to create_test
New case.qstatus tool
Allow user to adjust ulimits
Add message at end of successful build
Add new user-mods option to create_clone
New ways to select queue in config_batch.xml
scripts_regression_tests.py now has --compiler and --mpilibs option
new case variables controlling spare nodes (for failing nodes workaround)

User interface bug fixes:
xmlquery options --file and --listall now work together.
xmlquery options clarified and help message improved.
Improve error message for passing incorrect grid alias to create_newcase.
Improve error messages from xmlchange

Removed user features:
Remove long term archiving support. "case.lt_archive" no longer in caseroot.

Other changes to case control system:
Improve ERR test.
New ERS2 test
Add IRT test
Implement NCR with SystemTestsCompareTwo
Rewrite ERP with SystemTestsCompareTwo
Convert the ERR and ERRI test to the compare2 framework
Fix SEQ test on skybridge
new case.test --reset option to manually reset a test to initial conditions.
TOTAL_CORES was inaccurate so remove it.
Only create rest directories if they are going to be populated
get_timing.py should use largest ncpl value, not assume it's the atm value
Improve comment for BASELINE status in TestStatus
Better handling of fails in the build (don't print a python stacktrace)
Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily.
Add ability for create_test to infer non-default machine from testname
Change all python2 style strings to python3 style strings
Improve handling of hist comparisons in TestStatus.log
version 3.0 schema for entry_id files
Downscale build parallelism on small machines (so we don't overload it)
Bring bless_test_results behavior into an exact match with create_test -g
lcdf4 flag does apply to pnetcdf.
Replace cdf64 logical with PIO_NETCDF_FORMAT character variable
Refactor how spare nodes get computed
Preserve test order in test_scheduler
Allow queue selection by node count instead of task count
Improve error check for non-python sub processes

Data model changes:
Refactored data models
Use %CPLHIST for DATM coupler hist forcing
Add support for data model to read multiple time slices at once
New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file
DATM updates for clm
Change SIAF compset name in dice to IAF to be consistent with other components

Coupler/driver changes:
Map SMB from lnd to glc using bilinear mapping with a conservation correction
Remove more cesm refs in filenames.
seq_comm_mct code cleanup
Implementation of multiple couplers within multi-instance runs.

Other:
User's guide rst files now in cime/doc directory
Get PFUnit working on melvin
Homebrew generic mac port

Fixes #1255
Fixes #1424
Fixes #1426
Fixes #1573
Fixes #1576
Fixes #1585

* jgfouca/update_to_cime_5.3.0.34: (932 commits)
  One more short term archiver fix
  Fix merge conflicts, should mostly work
  cetus xml fix
  Hack to get PET test to work on chama
  Fix HOMME Cmake pointing to wrong cime area
  Update run_acme version
  Fixes for AQUAP
  Removes long term archiving from run_acme
  Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members
  Cherry-pick fixes from older merge branch
  cime_developer now working on melvin
  new git pelayout doc
  Update ChangeLog
  Change atm_grid "48x96" to T31 to match the model_grid_alias that is actually being used.
  Minor fix
  Fix mistake
  Better encapsulation of common calls (create_test, wait_for_tests)
  Add a timeout option to wait_for_tests
  update to sphinx documentation
  Restore task_count
  ...
jgfouca added a commit that referenced this issue Oct 11, 2017
Update CIME to 5.3.0alpha34

New user interface features:
option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run
option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification
option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email.
manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative.
Add --pesfile option to create_test
New case.qstatus tool
Allow user to adjust ulimits
Add message at end of successful build
Add new user-mods option to create_clone
New ways to select queue in config_batch.xml
scripts_regression_tests.py now has --compiler and --mpilibs option
new case variables controlling spare nodes (for failing nodes workaround)

User interface bug fixes:
xmlquery options --file and --listall now work together.
xmlquery options clarified and help message improved.
Improve error message for passing incorrect grid alias to create_newcase.
Improve error messages from xmlchange

Removed user features:
Remove long term archiving support. "case.lt_archive" no longer in caseroot.

Other changes to case control system:
Improve ERR test.
New ERS2 test
Add IRT test
Implement NCR with SystemTestsCompareTwo
Rewrite ERP with SystemTestsCompareTwo
Convert the ERR and ERRI test to the compare2 framework
Fix SEQ test on skybridge
new case.test --reset option to manually reset a test to initial conditions.
TOTAL_CORES was inaccurate so remove it.
Only create rest directories if they are going to be populated
get_timing.py should use largest ncpl value, not assume it's the atm value
Improve comment for BASELINE status in TestStatus
Better handling of fails in the build (don't print a python stacktrace)
Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily.
Add ability for create_test to infer non-default machine from testname
Change all python2 style strings to python3 style strings
Improve handling of hist comparisons in TestStatus.log
version 3.0 schema for entry_id files
Downscale build parallelism on small machines (so we don't overload it)
Bring bless_test_results behavior into an exact match with create_test -g
lcdf4 flag does apply to pnetcdf.
Replace cdf64 logical with PIO_NETCDF_FORMAT character variable
Refactor how spare nodes get computed
Preserve test order in test_scheduler
Allow queue selection by node count instead of task count
Improve error check for non-python sub processes

Data model changes:
Refactored data models
Use %CPLHIST for DATM coupler hist forcing
Add support for data model to read multiple time slices at once
New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file
DATM updates for clm
Change SIAF compset name in dice to IAF to be consistent with other components

Coupler/driver changes:
Map SMB from lnd to glc using bilinear mapping with a conservation correction
Remove more cesm refs in filenames.
seq_comm_mct code cleanup
Implementation of multiple couplers within multi-instance runs.

Other:
User's guide rst files now in cime/doc directory
Get PFUnit working on melvin
Homebrew generic mac port

Fixes #1255
Fixes #1424
Fixes #1426
Fixes #1573
Fixes #1576
Fixes #1585

* jgfouca/update_to_cime_5.3.0.34: (942 commits)
  Apply patch for cpl to avoid hangs
  Adjust ne4 PE layout on Anvil
  Reset configpes_all to what's on master
  Cherry-pick wait_for_tests fix from ESMCI
  Fix xmllint errors on Titan
  Convert attribute DEBUG to uppercase
  Update ne4-wcycl PEs on Anvil
  Fix the rest of the pylint errors for case_st_archive
  Fix jenkins script mistake made during conflict resolution.
  short term archiver get_datenames call for last restarts fixed
  One more short term archiver fix
  Fix merge conflicts, should mostly work
  cetus xml fix
  Hack to get PET test to work on chama
  Fix HOMME Cmake pointing to wrong cime area
  Update run_acme version
  Fixes for AQUAP
  Removes long term archiving from run_acme
  Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members
  Cherry-pick fixes from older merge branch
  ...
jgfouca added a commit that referenced this issue Oct 12, 2017
Update CIME to 5.3.0alpha34

New user interface features:
option --skip-preview-namelist to case.submit to skip calling preview_namelist during case.run
option -M MAIL_USER, --mail-user MAIL_USER to case.submit to set an email for batch system notification
option -m {never,all,begin,end,fail}, --mail-type {never,all,begin,end,fail} to case.submit for when the batch system should send email.
manage_case renamed to more accurate query_config and usage/arguments made easier and output more informative.
Add --pesfile option to create_test
New case.qstatus tool
Allow user to adjust ulimits
Add message at end of successful build
Add new user-mods option to create_clone
New ways to select queue in config_batch.xml
scripts_regression_tests.py now has --compiler and --mpilibs option
new case variables controlling spare nodes (for failing nodes workaround)

User interface bug fixes:
xmlquery options --file and --listall now work together.
xmlquery options clarified and help message improved.
Improve error message for passing incorrect grid alias to create_newcase.
Improve error messages from xmlchange

Removed user features:
Remove long term archiving support. "case.lt_archive" no longer in caseroot.

Other changes to case control system:
Improve ERR test.
New ERS2 test
Add IRT test
Implement NCR with SystemTestsCompareTwo
Rewrite ERP with SystemTestsCompareTwo
Convert the ERR and ERRI test to the compare2 framework
Fix SEQ test on skybridge
new case.test --reset option to manually reset a test to initial conditions.
TOTAL_CORES was inaccurate so remove it.
Only create rest directories if they are going to be populated
get_timing.py should use largest ncpl value, not assume it's the atm value
Improve comment for BASELINE status in TestStatus
Better handling of fails in the build (don't print a python stacktrace)
Add a new directory testmod that creates cpl history files on a daily basis and thereby permits answer changes to be compared more easily.
Add ability for create_test to infer non-default machine from testname
Change all python2 style strings to python3 style strings
Improve handling of hist comparisons in TestStatus.log
version 3.0 schema for entry_id files
Downscale build parallelism on small machines (so we don't overload it)
Bring bless_test_results behavior into an exact match with create_test -g
lcdf4 flag does apply to pnetcdf.
Replace cdf64 logical with PIO_NETCDF_FORMAT character variable
Refactor how spare nodes get computed
Preserve test order in test_scheduler
Allow queue selection by node count instead of task count
Improve error check for non-python sub processes

Data model changes:
Refactored data models
Use %CPLHIST for DATM coupler hist forcing
Add support for data model to read multiple time slices at once
New aquaplanet capability: adds aquaplanet capability to DOCN where the input is from a file rather than analytic; adds new A compsets to the driver config_compsets.xml that have different DOCN functionality: ADSOM - DOCN SOM, ADSOMAQP - DOCN aquaplanet SOM, ADAQP3 - DOCN analytic aquaplanet (mode 3), ADAQFILE - DOCN aquaplanet from file
DATM updates for clm
Change SIAF compset name in dice to IAF to be consistent with other components

Coupler/driver changes:
Map SMB from lnd to glc using bilinear mapping with a conservation correction
Remove more cesm refs in filenames.
seq_comm_mct code cleanup
Implementation of multiple couplers within multi-instance runs.

Other:
User's guide rst files now in cime/doc directory
Get PFUnit working on melvin
Homebrew generic mac port

Fixes #1255
Fixes #1424
Fixes #1426
Fixes #1573
Fixes #1576
Fixes #1585

* jgfouca/update_to_cime_5.3.0.34: (942 commits)
  Apply patch for cpl to avoid hangs
  Adjust ne4 PE layout on Anvil
  Reset configpes_all to what's on master
  Cherry-pick wait_for_tests fix from ESMCI
  Fix xmllint errors on Titan
  Convert attribute DEBUG to uppercase
  Update ne4-wcycl PEs on Anvil
  Fix the rest of the pylint errors for case_st_archive
  Fix jenkins script mistake made during conflict resolution.
  short term archiver get_datenames call for last restarts fixed
  One more short term archiver fix
  Fix merge conflicts, should mostly work
  cetus xml fix
  Hack to get PET test to work on chama
  Fix HOMME Cmake pointing to wrong cime area
  Update run_acme version
  Fixes for AQUAP
  Removes long term archiving from run_acme
  Move xml compiler and mpilib tags from env and command elements This is required for the new xml schema requirements; compiler and mpilib tags should be in the modules and environment_variables elements rather than their members
  Cherry-pick fixes from older merge branch
  ...
@worleyph
Copy link
Contributor

worleyph commented Oct 13, 2017

@jgfouca and @vanroekel and @mt5555 , the new logic supports both asking CIME to do something reasonable and specifying the extra number of nodes directly:

   <entry id="ALLOCATE_SPARE_NODES" value="TRUE">

or something like

   <entry id="FORCE_SPARE_NODES" value="5">

(Note that FORCE_SPARE_NODES takes precendence when both are set.)

However, the default for Titan is now

   <entry id="ALLOCATE_SPARE_NODES" value="FALSE">
   <entry id="FORCE_SPARE_NODES" value="-999">

where as the default used to be TRUE (adding 10% more nodes). Just wanted to check whether this is what we wanted (default to be FALSE) or not.

@jgfouca
Copy link
Member

jgfouca commented Oct 16, 2017

@worleyph I think we want it to be true on titan, I will investigate.

@jgfouca jgfouca reopened this Oct 16, 2017
@vanroekel
Copy link
Contributor

@jgfouca and @worleyph I agree that this should be true on titan. I would also suggest adding only a few spare nodes (~5). 10% is much too large for high-resolution and could be too small for low resolution.

@jgfouca
Copy link
Member

jgfouca commented Oct 16, 2017

@vanroekel , yes, that change was made a while ago. I think it maxes-out at 10 nodes.

@worleyph
Copy link
Contributor

@vanroekel , in my experiments it appears that the default for Titan is 10, though you can also modify this on a case-by-case basis. I did not look at the code to verify whoat or how the default was determined.

jgfouca added a commit that referenced this issue Oct 16, 2017
…1846)

Titan should allocate spare nodes by default.

Fixes #1255

[BFB]

* origin/jgfouca/cime/titan_allocate_spare_nodes:
  Titan should allocate spare nodes by default.
jgfouca added a commit that referenced this issue Oct 16, 2017
…#1846)

Titan should allocate spare nodes by default.

Fixes #1255

[BFB]

* origin/jgfouca/cime/titan_allocate_spare_nodes:
  Titan should allocate spare nodes by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants