Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Data Capture and Provenance in CIME5.2 (discussion) #1348

Closed
worleyph opened this issue Mar 27, 2017 · 9 comments
Closed

Performance Data Capture and Provenance in CIME5.2 (discussion) #1348

worleyph opened this issue Mar 27, 2017 · 9 comments

Comments

@worleyph
Copy link
Contributor

CIME 5.2 has lost a few minor capabilities / does things a little differently than before.

a) software_environment.txt is created by case.setup, not case.build . if env_mach_specific is changed, or it the system defaults change, between setup and build, this will not be captured.

b) CaseStatus looks different (note @bmayerornl ). Some of this is an improvement - job submission information is platform-agnostic, for example. Time spent in the pre-run and post-run is missing however, and this is probably important to track, especially if we are trying to minimize this further.

c) Compile time per component is no longer output. Measurement of this is somewhat problematic now as all components are built concurrently, but which one takes the longest might still be useful information.

d) I'll be updating the mach_syslog routines to look more like the optimizations that @amametjanov suggested (and were implemented) for Anvil. Might also eliminate some data that seemed like it might be useful, but which we never look at. ( @bmayerornl , you should weigh in here as well.)

@rljacob
Copy link
Member

rljacob commented Mar 27, 2017

I had also complained about the compile time per component. I believe the splitting by component is done at the python level so it should be possible to time each thread (?).

My complaint about CaseStatus was all the extra newlines. That was a side effect from making TestStatus more readable.

@worleyph
Copy link
Contributor Author

With PR #1429 the performance data provenance being collected looks like:

a) Case information:

 CaseDocs.$lid/

(CaseDocs from the case directory as well as all env_xxx files, user_nl_xxx files, case.run, Depends file, README.case, and software_environment.txt)

b) Code information:

 GIT_DESCRIBE.$lid
 SourceMods.$lid.tar

c) Environment information (build and run):

 build_environment.txt.$lid
 run_environment.txt.$lid

d) "What happened" information, including runtime per day (cpl.log) and actual PIO strides, etc. (cpl.log and acme.log)

 CaseStatus.$lid
 $jobid.OU.$lid (titan and anvil), $case.$jobid.$lid.gz (SLURM), $jobid.output.$lid and $jobid.cobaltlog.$lid (cetus and mira)
 acme.log.$lid
 cpl.log.$lid

e) Performance data:

 acme_timing.$case.$lid
 acme_timing_stats.$lid
 timing.$lid.tar

f) System status when job started (system-specific - see provenance.py for actual commands):

 Titan: qstatf_jobid.$lid, showq.$lid, xtnodestat.$lid
 SLURM: sqsf_jobid.$lid, squeuef.$lid, squeues.$lid, sinfol.$lid
 Mira/Cetus: qstatf.$lid, qstatf_jobid.$lid
 Anvil: qstatf.$lid, qstatf_jobid.$lid, qstatr.$lid

g) Then the checkpoints subdirectory captures snapshots of progress during the run (at SYSLOG_N second intervals, where frequency of checkpoint timing data is controlled by TPROF_N):

g1) Progress for each component (in terms of component-specific metrics, usually something like step) appended to the following files every SYSLOG_N seconds

 atm.log.$lid.step
 cpl.log.$lid.step
 ice.log.$lid.step
 lnd.log.$lid.step
 ocn.log.$lid.step
 rof.log.$lid.step

g2) All runtimes per day from cpl.log up to that point

  cpl.log.$lid.step-all

g3) System status information (what other jobs are running, and where). $remaining suffix is how much time is left before job exceeds requested wallclock time.

 titan: showqr.$lid.$remaining, xtnodestat.$lid.$remaining
 SLURM: squeuef.$lid.$remaining, squeues.$lid.$remaining
 mira/cetus: qstatf.$lid.$remaining
 anvil: qstatr.$lid.$remaining, qstatn.$lid.$remaining

g4) acme.log right after job has started (on Titan, makes sure that get process and thread to core mapping information that shows up first)

 acme.log.$lid.$remaining

g5) checkpoint timing data for root PE for each component, and also global statistics. Copied over at SYSLOG_N intervals, but generated at TPROF_N-derived frequency, e.g. for every day with all components have process 0 as root:

 model_timing_$simulationtime.$processid
 model_timing_$simulationtime_stats

@jgfouca
Copy link
Member

jgfouca commented Apr 21, 2017

@worleyph I'm looking at this ticket. Is there anything you want me to do here? I could potentially handle (b) and (c) from your list.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 21, 2017

@jgfouca , this is just a discussion thread so far. Once we decide on something, if we do, we'll generate a new github issue. Thanks for asking though.

agsalin pushed a commit that referenced this issue May 1, 2017
create_test was not handling user-selected projects correctly
@rljacob
Copy link
Member

rljacob commented Sep 19, 2017

@worleyph should this still be open?

@worleyph
Copy link
Contributor Author

Probably not. If okay with you, I'll wait for the new CIME to be integrated and then check whether any items in the original list are missing and still might be useful. I think that most have been addressed. At this point I'll close it and open a new issue if necessary.

@jgfouca
Copy link
Member

jgfouca commented Oct 16, 2017

@worleyph can I close?

@worleyph
Copy link
Contributor Author

I'll take care of this by the end of the week. I'll definitely close it, but will first check if there is anything that I want to move to a new issue. Thanks for the reminder though.

@worleyph
Copy link
Contributor Author

Closing this now. Created github issue #1857 for item (b) and will start working on a PR to address this.

@jgfouca jgfouca closed this as completed Oct 19, 2017
mark-petersen pushed a commit to mark-petersen/E3SM that referenced this issue Jan 19, 2021
Update needed for ACME

Framework features brought in:

* f444d0f Merge PR E3SM-Project#1418 'matthewhoffman/framework/output_record_reference_time' into develop
* 263e14f Merge PR E3SM-Project#1428 'mark-petersen/framework/couple_fixes' into develop
* bcce31d Merge PR E3SM-Project#1424 'amametjanov:az/tools/cp-prebuilt-tools' into develop
* 98cfeea Merge PR# 1349 'akturner/framework/forcing_cleanup' into develop
* 9359319 Merge PR E3SM-Project#1347 'akturner/framework/forcing_restart_timestamp' into develop
* e9ce203 Merge PR E3SM-Project#1348 'akturner/framework/forcing_at_init' into develop
* 4974284 Merge PR E3SM-Project#1368 'akturner/framework/improved_messages_in_driver' into develop
* 86d50c5 Merge PR E3SM-Project#1417 'akturner/framework/forcing_multiple_blocks' into develop
* 9116da3 Merge branch 'framework/validation-of-streams-using-interval_in-interval_out' into develop
* e466b46 Merge branch 'framework/interval_in-interval_out-support-for-streams' into develop
* 30dc955 Merge branch 'az/framework/mpas_dmpar-race-fix' into develop
* b632938 Merge branch 'framework/i8_interval_division' into develop
* 6dac06c Merge branch 'framework/log_write_IBM_error' into develop
* 960a648 Merge branch 'framework/cleanup-logging-stream-manager' into develop
* 504c282 Merge branch 'framework/make-streams-with-direction-none-inactive' into develop
* 5903748 Merge branch 'framework/correctly_remove_blk_fields' into develop
* 3565965 Merge branch 'framework/iostreams-real4dfield-bug' into develop
* 8b60591 Merge branch 'framework/missing-deallocate-nEdgesOnCellField-bootstrapping' into develop
* 70b953b Merge branch 'master' into develop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants