Problems with unexpected Parsl workflow shutdown #2123

TomGlanzman · 2021-09-20T19:03:07Z

Describe the bug
As a consequence of an unexpected or abnormal workflow shutdown, the workflow can hang and incomplete bookkeeping prevents monitoring.db from providing an accurate summary status reporting basis.

To Reproduce
Steps to reproduce the behavior, for e.g:

Parsl version: DESC branch as of 8/28/2021, 1.1.0+desc-2021.08.28a
Any simple workflow with cache and monitoring enabled, using either workQueue executor
While workflow is running, send a terminating signal, e.g., SIGINT (ctrl-c, if interactive), SIGTERM, etc.
Watch parsl.log, then later look at monitoring.db

Expected behavior
Prompt and proper shutdown with full bookkeeping in monitoring.db

Environment

OS: NERSC/Cori (Crayified suse Linux)
Python version 3.8.8
Parsl version 1.1.0+desc-2021.08.28a

Distributed Environment

Where are you running the Parsl script from ? Login node
Where do you need the workers to run ? locally or SLURM (batch)

More description.

The current Parsl shutdown scheme exhibits multiple issues when said shutdown is unexpected, e.g., SIGINT. The following symptoms are easily reproducible even in simple, test workflows. My current workflows use the workQueue executor but I have observed similar symptoms with htex.

Top-level workflow script is interactive. Receipt of a SIGINT (ctrl-c) or other terminating signal, such as SIGTERM, via "kill" nearly always causes a significant delay in response and often an indefinite hang. A second terminating signal usually completes the shutdown process returning the user's shell to the command prompt.
Final messages from running tasks (Parsl workers) and other Parsl agents are not received and processed into monitoring.db.

The failure to exit cleanly means that it is not possible to determine an accurate status -- or fate -- of the workflow ex post facto. For large workflows (10's of thousands of tasks) this means digging into uncountable log files.

These symptoms occur most of the time but not all of the time. On at least one occasion, I have observed a full shutdown resulting from a single terminating signal. However, in that single example, a lengthy (>30 second) delay occurred part-way through the shutdown.

It is not clear that checkpointing is being properly handled in these situations.

This general topic has been raised in the past, e.g., #641, #1589, #1670 and referenced issues.

TomGlanzman · 2021-09-24T16:44:19Z

After some experimentation and discussion with @benclifford, we have laid out a possible approach to this issue.

Part I. Parsl

Within Parsl there is NO signal handling of any kind. There is an
"atexit" routine which, under normal circumstances, is called by the
main Parsl process (the process from which parsl.load() is called)
when that process performs a normal, orderly shutdown, e.g.,
sys.exit().

Interactively, typing ctrl-c causes SIGINT to be broadcast to all
Parsl processes (parent and children). Within the main Parsl process,
this invokes a KeyboardInterrupt which then invokes the atexit
routine(s) and exits. A problem arises when Parsl attempts an orderly
shutdown for child processes that have already been killed.

In batch, SLURM, for example, issues SIGTERM to all processes a
certain amount of time prior to job end. This has the same
unfortunate effect as SIGINT in that all processes receive this signal
and are killed without an orderly shutdown.

Short-term: deploy SIGUSR1

SIGUSR1 is, by default, ignored globally. Only when there is a
handler is it noticed. Thus, providing a SIGUSR1 handler for Parsl's
main python script is the recommended work-around. This handler
should only call sys.exit(exitcode), which will trigger Parsl's atexit
routine for an orderly shutdown.

This solution can be made to work with SLURM using the "--signal"
option when submitting a job. This option specifies the signal and
the time before job-end for delivery.

For interactive Parsl workflows, this work-around involves manually
issuing a "kill -USR1" to the main Parsl script, i.e., do NOT type
ctrl-c to halt the workflow.

Long-term: provide SIGINT and SIGTERM handlers for all Parsl
(child) processes that need to be shutdown by the main Parsl script.

All but the main Parsl python script should handle but ignore these
signals. The main Parsl script should call sys.exit(exitcode) which
will allow for an orderly shutdown. This should be a clean solution
and will work both when the main Parsl script is interactive or
running in batch.

Part II. workQueue

There is at least one recorded instance where a SIGUSR1 was delivered
to the main Parsl script, but the orderly shutdown hung. This is
thought to be an issue with the workQueue executor which will be
investigated and, hopefully, fixed.

#2123 by tomg i suspect there are shutdown hangs still in both monitoring and in workqueue. there are also "surprising" behaviours to do with the threadlocal executor. maybe I should make a documentation section, "shutting down parsl" which details how to do it and also unexpected behaviours? 1. with default config: no wq, no monitoring, threadlocalexecutor: after 1 ctrl-c, the main thread is interrupted. thread local executor will continue running any tasks it has: the process will not exit until all of those tasks are completed, so if they are long, process exit will not happen then. that's an "expected" python behaviour. parsl's "atexit" shutdown behaviour will not run *until* all of those tasks are completed - pressing ctrl-C more times will cause that to not happen at all, or be interrupted multiple times, perhaps. 2. with htex_local_alternate 2.1 with ctrl-c: this seems to shut everything down correctly, and immediately. 2.2 with kill (TERM) aimed at the main process - this marks the process as "Terminated" immediately apparently without running any atexit handling. this leaves at least all of the monitoring processes, the htex interchange, and the htex process workers still running. What happens if I send a SIGINT instead? that's different than ctrl-C: ctrl-C goes to all processes, sigint goes to just the process its aimed at.

TomGlanzman added the bug label Sep 20, 2021

benclifford self-assigned this Sep 22, 2021

benclifford mentioned this issue Dec 28, 2021

Ctrl-C handling is generally unreliable #2174

Open

benclifford added the safe-exit label Apr 25, 2022

tcompa mentioned this issue May 30, 2022

parsl-visualize creates invalid monitoring.db SQL schema #2266

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with unexpected Parsl workflow shutdown #2123

Problems with unexpected Parsl workflow shutdown #2123

TomGlanzman commented Sep 20, 2021

TomGlanzman commented Sep 24, 2021

Problems with unexpected Parsl workflow shutdown #2123

Problems with unexpected Parsl workflow shutdown #2123

Comments

TomGlanzman commented Sep 20, 2021

TomGlanzman commented Sep 24, 2021