Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing #9

Merged
merged 1 commit into from
Sep 18, 2014

Conversation

worleyph
Copy link
Contributor

Fixing bug that can cause job scripts on Mira to fail after successful
completion of CESM run but before postprocessing has completed.

For historical (?) reasons, some of the CESM mkbatch scripts include a
'wait' command after the parallel job launch (e.g. aprun). For
collecting checkpoint data (to guarantee that some data are archived
even if the job fails) a background job is spawned before the parallel
job launch. This must be explicitly killed after the parallel job
completes otherwise the job script hangs on the wait command.

This kill command was added to mkbatch.mira even though mkbatch.mira
does not have a wait after the runjob command. The background script
sometimes dies on Mira before the parallel job finishes. Because of this
the kill command fails and the $CASE.run script dies before finishing
its postprocessing tasks.

The simple fix is to just delete the kill of the background script as
it is not needed on Mira, but it seems to be good policy to clean up
background jobs anyway.

Code is added to test whether the background job has already
disappeared before trying to kill it.

Code is also added to the background job script to eliminate the
primary reason that it dies before the parallel application is
complete. (The issue is that parsing the output from qstat to determine
the amount of time remaining for the run generates numbers that begin
with 0, which the script interprets as octal. Thus the number '09' is
illegal as '09' is not a legal octal number.)

Note that either of these changes is sufficient to solve the problem,
but both are included in case new issues arise with the background
script in the future.

(bit-for-bit - does not touch source code or compiler options)

…ompletion of parallel job but before postprocessing has completed

For historical (?) reasons, some of the CESM mkbatch scripts include a
'wait' command after the parallel job launch (e.g. aprun). For
collecting checkpoint data (to guarantee that some data are archived
even if the job fails) a background job is spawned before the parallel
job launch. This must be explicitly killed after the parallel job
completes otherwise the job script hangs on the wait command.

This kill command was added to mkbatch.mira even though mkbatch.mira
does not have a wait after the runjob command. The background script
often dies on Mira before the parallel job finishes. Because of this
the kill command fails and the $CASE.run script dies before finishing
its postprocessing tasks.

The simple fix is to just delete the kill of the background script as
it is not needed on Mira, but it seems to be good policy to clean up
background jobs anyway.

Code is added to test whether the background job has already
disappeared before trying to kill it.

Code is also added to the background job script to eliminate the
primary reason that it dies before the parallel application is
complete. (The issue is that parsing the output from qstat to determine
the amount of time remaining for the run generates numbers that begin
with 0, which the script interprets as octal. Thus the number '09' is
illegal as '09' is not a legal octal number.)

Note that either of these changes is sufficient to solve the problem,
but both are included in case new issues arise with the background
script in the future.
@worleyph worleyph assigned amametjanov and unassigned amametjanov Sep 12, 2014
agsalin added a commit that referenced this pull request Sep 18, 2014
…h-fix

Fixing bug that can cause job scripts on Mira to exit before finishing postprocessing
@agsalin agsalin merged commit ed554ed into master Sep 18, 2014
@agsalin agsalin deleted the worleyph/Machines/mira-mkbatch-fix branch September 18, 2014 16:45
douglasjacobsen pushed a commit that referenced this pull request Aug 27, 2015
Removed references to DOUT_S_GENERATE_TSERIES XML variable from cesm_…
@jgfouca jgfouca mentioned this pull request Oct 23, 2015
bishtgautam pushed a commit that referenced this pull request Nov 19, 2016
Merge #9 for this PR.

More updates for 20TR compset
@jonbob jonbob mentioned this pull request Dec 15, 2016
akturner pushed a commit that referenced this pull request Apr 17, 2018
Add a CMake build system
akturner pushed a commit that referenced this pull request Apr 17, 2018
Add a CMake build system
jgfouca pushed a commit that referenced this pull request Apr 27, 2018
…_mpas

Add test files for MPAS in E3SM's configuration
apcraig pushed a commit to apcraig/E3SM that referenced this pull request Mar 27, 2022
yunpengshan2014 pushed a commit that referenced this pull request Dec 6, 2022
…nto NGD_v3atm (PR #9)

Implement the new dust emission scheme

In addition to modifying the dust module, the changes are also made
to not update soil erodibility factor and to retune emission factor.

[non-BFB]
philipwjones pushed a commit to philipwjones/E3SM that referenced this pull request Apr 11, 2023
Adds a design document for a logging capability
* initial draft
* refine texts in Logging.md
* updates the requirement of E3SM inter-operability
* Update logging requirments per github reviews including:
* Update markdown formatting
* Add a test requirement for Omega data types

---------

Co-authored-by: Youngsung Kim <kimy@ornl.gov>
njeffery pushed a commit that referenced this pull request Apr 28, 2023
AaronDonahue pushed a commit that referenced this pull request May 9, 2023
cee/15.0.0 with GPU MPI buffers can crash in a system lib like this:

#4  0x00007fffe159e35b in (anonymous namespace)::do_free_with_callback(void*, void (*)(void*)) [clone .constprop.0] () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libtcmalloc_minimal.so.1
#5  0x00007fffe15a8f16 in tc_free () from /opt/cray/pe/cce/15.0.0/cce/x86_64/lib/libtcmalloc_minimal.so.1
#6  0x00007fffe99c2bcd in _dlerror_run () from /lib64/libdl.so.2
#7  0x00007fffe99c2481 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#8  0x00007fffea7bce42 in _ad_cray_lock_init () from /opt/cray/pe/lib64/libmpi_cray.so.12
#9  0x00007fffed7eb37a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#10 0x00007fffed7eb496 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#11 0x00007fffed7dc58a in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#12 0x0000000000000001 in ?? ()
#13 0x00007fffffff42e7 in ?? ()
#14 0x0000000000000000 in ?? ()

Work around this by using cee/14.0.3.
philipwjones pushed a commit to philipwjones/E3SM that referenced this pull request May 31, 2023
Adds a design document for a logging capability
* initial draft
* refine texts in Logging.md
* updates the requirement of E3SM inter-operability
* Update logging requirments per github reviews including:
* Update markdown formatting
* Add a test requirement for Omega data types

---------

Co-authored-by: Youngsung Kim <kimy@ornl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants