Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with unexpected Parsl workflow shutdown #2123

Open
TomGlanzman opened this issue Sep 20, 2021 · 1 comment
Open

Problems with unexpected Parsl workflow shutdown #2123

TomGlanzman opened this issue Sep 20, 2021 · 1 comment
Assignees

Comments

@TomGlanzman
Copy link

Describe the bug
As a consequence of an unexpected or abnormal workflow shutdown, the workflow can hang and incomplete bookkeeping prevents monitoring.db from providing an accurate summary status reporting basis.

To Reproduce
Steps to reproduce the behavior, for e.g:

  1. Parsl version: DESC branch as of 8/28/2021, 1.1.0+desc-2021.08.28a
  2. Any simple workflow with cache and monitoring enabled, using either workQueue executor
  3. While workflow is running, send a terminating signal, e.g., SIGINT (ctrl-c, if interactive), SIGTERM, etc.
  4. Watch parsl.log, then later look at monitoring.db

Expected behavior
Prompt and proper shutdown with full bookkeeping in monitoring.db

Environment

  • OS: NERSC/Cori (Crayified suse Linux)
  • Python version 3.8.8
  • Parsl version 1.1.0+desc-2021.08.28a

Distributed Environment

  • Where are you running the Parsl script from ? Login node
  • Where do you need the workers to run ? locally or SLURM (batch)

More description.

The current Parsl shutdown scheme exhibits multiple issues when said shutdown is unexpected, e.g., SIGINT. The following symptoms are easily reproducible even in simple, test workflows. My current workflows use the workQueue executor but I have observed similar symptoms with htex.

  1. Top-level workflow script is interactive. Receipt of a SIGINT (ctrl-c) or other terminating signal, such as SIGTERM, via "kill" nearly always causes a significant delay in response and often an indefinite hang. A second terminating signal usually completes the shutdown process returning the user's shell to the command prompt.

  2. Final messages from running tasks (Parsl workers) and other Parsl agents are not received and processed into monitoring.db.

The failure to exit cleanly means that it is not possible to determine an accurate status -- or fate -- of the workflow ex post facto. For large workflows (10's of thousands of tasks) this means digging into uncountable log files.

These symptoms occur most of the time but not all of the time. On at least one occasion, I have observed a full shutdown resulting from a single terminating signal. However, in that single example, a lengthy (>30 second) delay occurred part-way through the shutdown.

It is not clear that checkpointing is being properly handled in these situations.

This general topic has been raised in the past, e.g., #641, #1589, #1670 and referenced issues.

@benclifford benclifford self-assigned this Sep 22, 2021
@TomGlanzman
Copy link
Author

After some experimentation and discussion with @benclifford, we have laid out a possible approach to this issue.

Part I. Parsl

Within Parsl there is NO signal handling of any kind. There is an
"atexit" routine which, under normal circumstances, is called by the
main Parsl process (the process from which parsl.load() is called)
when that process performs a normal, orderly shutdown, e.g.,
sys.exit().

Interactively, typing ctrl-c causes SIGINT to be broadcast to all
Parsl processes (parent and children). Within the main Parsl process,
this invokes a KeyboardInterrupt which then invokes the atexit
routine(s) and exits. A problem arises when Parsl attempts an orderly
shutdown for child processes that have already been killed.

In batch, SLURM, for example, issues SIGTERM to all processes a
certain amount of time prior to job end. This has the same
unfortunate effect as SIGINT in that all processes receive this signal
and are killed without an orderly shutdown.

  1. Short-term: deploy SIGUSR1

SIGUSR1 is, by default, ignored globally. Only when there is a
handler is it noticed. Thus, providing a SIGUSR1 handler for Parsl's
main python script is the recommended work-around. This handler
should only call sys.exit(exitcode), which will trigger Parsl's atexit
routine for an orderly shutdown.

This solution can be made to work with SLURM using the "--signal"
option when submitting a job. This option specifies the signal and
the time before job-end for delivery.

For interactive Parsl workflows, this work-around involves manually
issuing a "kill -USR1" to the main Parsl script, i.e., do NOT type
ctrl-c to halt the workflow.

  1. Long-term: provide SIGINT and SIGTERM handlers for all Parsl
    (child) processes that need to be shutdown by the main Parsl script.

All but the main Parsl python script should handle but ignore these
signals. The main Parsl script should call sys.exit(exitcode) which
will allow for an orderly shutdown. This should be a clean solution
and will work both when the main Parsl script is interactive or
running in batch.

Part II. workQueue

There is at least one recorded instance where a SIGUSR1 was delivered
to the main Parsl script, but the orderly shutdown hung. This is
thought to be an issue with the workQueue executor which will be
investigated and, hopefully, fixed.

benclifford added a commit that referenced this issue Dec 6, 2021
#2123
by tomg

i suspect there are shutdown hangs still in both monitoring and in
workqueue.

there are also "surprising" behaviours to do with the threadlocal executor.

maybe I should make a documentation section, "shutting down parsl" which
details how to do it and also unexpected behaviours?


1. with default config: no wq, no monitoring, threadlocalexecutor:

after 1 ctrl-c, the main thread is interrupted.
thread local executor will continue running any tasks it has:
the process will not exit until all of those tasks are completed,
so if they are long, process exit will not happen then.
that's an "expected" python behaviour.

parsl's "atexit" shutdown behaviour will not run *until* all of those
tasks are completed - pressing ctrl-C more times will cause that to
not happen at all, or be interrupted multiple times, perhaps.

2. with htex_local_alternate

2.1   with ctrl-c:
   this seems to shut everything down correctly, and immediately.

2.2   with kill (TERM) aimed at the main process - this marks the process as "Terminated" immediately apparently without running any atexit handling.
   this leaves at least all of the monitoring processes, the htex interchange, and the htex process workers still running.

   What happens if I send a SIGINT instead? that's different than ctrl-C: ctrl-C goes to all processes, sigint goes to just the process its aimed at.
benclifford added a commit that referenced this issue May 23, 2022
#2123
by tomg

i suspect there are shutdown hangs still in both monitoring and in
workqueue.

there are also "surprising" behaviours to do with the threadlocal executor.

maybe I should make a documentation section, "shutting down parsl" which
details how to do it and also unexpected behaviours?


1. with default config: no wq, no monitoring, threadlocalexecutor:

after 1 ctrl-c, the main thread is interrupted.
thread local executor will continue running any tasks it has:
the process will not exit until all of those tasks are completed,
so if they are long, process exit will not happen then.
that's an "expected" python behaviour.

parsl's "atexit" shutdown behaviour will not run *until* all of those
tasks are completed - pressing ctrl-C more times will cause that to
not happen at all, or be interrupted multiple times, perhaps.

2. with htex_local_alternate

2.1   with ctrl-c:
   this seems to shut everything down correctly, and immediately.

2.2   with kill (TERM) aimed at the main process - this marks the process as "Terminated" immediately apparently without running any atexit handling.
   this leaves at least all of the monitoring processes, the htex interchange, and the htex process workers still running.

   What happens if I send a SIGINT instead? that's different than ctrl-C: ctrl-C goes to all processes, sigint goes to just the process its aimed at.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants