-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Data Capture and Provenance in CIME5.2 (discussion) #1348
Comments
I had also complained about the compile time per component. I believe the splitting by component is done at the python level so it should be possible to time each thread (?). My complaint about CaseStatus was all the extra newlines. That was a side effect from making TestStatus more readable. |
With PR #1429 the performance data provenance being collected looks like: a) Case information:
(CaseDocs from the case directory as well as all env_xxx files, user_nl_xxx files, case.run, Depends file, README.case, and software_environment.txt) b) Code information:
c) Environment information (build and run):
d) "What happened" information, including runtime per day (cpl.log) and actual PIO strides, etc. (cpl.log and acme.log)
e) Performance data:
f) System status when job started (system-specific - see provenance.py for actual commands):
g) Then the checkpoints subdirectory captures snapshots of progress during the run (at SYSLOG_N second intervals, where frequency of checkpoint timing data is controlled by TPROF_N): g1) Progress for each component (in terms of component-specific metrics, usually something like step) appended to the following files every SYSLOG_N seconds
g2) All runtimes per day from cpl.log up to that point
g3) System status information (what other jobs are running, and where). $remaining suffix is how much time is left before job exceeds requested wallclock time.
g4) acme.log right after job has started (on Titan, makes sure that get process and thread to core mapping information that shows up first)
g5) checkpoint timing data for root PE for each component, and also global statistics. Copied over at SYSLOG_N intervals, but generated at TPROF_N-derived frequency, e.g. for every day with all components have process 0 as root:
|
@worleyph I'm looking at this ticket. Is there anything you want me to do here? I could potentially handle (b) and (c) from your list. |
@jgfouca , this is just a discussion thread so far. Once we decide on something, if we do, we'll generate a new github issue. Thanks for asking though. |
create_test was not handling user-selected projects correctly
@worleyph should this still be open? |
Probably not. If okay with you, I'll wait for the new CIME to be integrated and then check whether any items in the original list are missing and still might be useful. I think that most have been addressed. At this point I'll close it and open a new issue if necessary. |
@worleyph can I close? |
I'll take care of this by the end of the week. I'll definitely close it, but will first check if there is anything that I want to move to a new issue. Thanks for the reminder though. |
Closing this now. Created github issue #1857 for item (b) and will start working on a PR to address this. |
Update needed for ACME Framework features brought in: * f444d0f Merge PR E3SM-Project#1418 'matthewhoffman/framework/output_record_reference_time' into develop * 263e14f Merge PR E3SM-Project#1428 'mark-petersen/framework/couple_fixes' into develop * bcce31d Merge PR E3SM-Project#1424 'amametjanov:az/tools/cp-prebuilt-tools' into develop * 98cfeea Merge PR# 1349 'akturner/framework/forcing_cleanup' into develop * 9359319 Merge PR E3SM-Project#1347 'akturner/framework/forcing_restart_timestamp' into develop * e9ce203 Merge PR E3SM-Project#1348 'akturner/framework/forcing_at_init' into develop * 4974284 Merge PR E3SM-Project#1368 'akturner/framework/improved_messages_in_driver' into develop * 86d50c5 Merge PR E3SM-Project#1417 'akturner/framework/forcing_multiple_blocks' into develop * 9116da3 Merge branch 'framework/validation-of-streams-using-interval_in-interval_out' into develop * e466b46 Merge branch 'framework/interval_in-interval_out-support-for-streams' into develop * 30dc955 Merge branch 'az/framework/mpas_dmpar-race-fix' into develop * b632938 Merge branch 'framework/i8_interval_division' into develop * 6dac06c Merge branch 'framework/log_write_IBM_error' into develop * 960a648 Merge branch 'framework/cleanup-logging-stream-manager' into develop * 504c282 Merge branch 'framework/make-streams-with-direction-none-inactive' into develop * 5903748 Merge branch 'framework/correctly_remove_blk_fields' into develop * 3565965 Merge branch 'framework/iostreams-real4dfield-bug' into develop * 8b60591 Merge branch 'framework/missing-deallocate-nEdgesOnCellField-bootstrapping' into develop * 70b953b Merge branch 'master' into develop
CIME 5.2 has lost a few minor capabilities / does things a little differently than before.
a) software_environment.txt is created by case.setup, not case.build . if env_mach_specific is changed, or it the system defaults change, between setup and build, this will not be captured.
b) CaseStatus looks different (note @bmayerornl ). Some of this is an improvement - job submission information is platform-agnostic, for example. Time spent in the pre-run and post-run is missing however, and this is probably important to track, especially if we are trying to minimize this further.
c) Compile time per component is no longer output. Measurement of this is somewhat problematic now as all components are built concurrently, but which one takes the longest might still be useful information.
d) I'll be updating the mach_syslog routines to look more like the optimizations that @amametjanov suggested (and were implemented) for Anvil. Might also eliminate some data that seemed like it might be useful, but which we never look at. ( @bmayerornl , you should weigh in here as well.)
The text was updated successfully, but these errors were encountered: