-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with unexpected Parsl workflow shutdown #2123
Comments
After some experimentation and discussion with @benclifford, we have laid out a possible approach to this issue. Part I. Parsl Within Parsl there is NO signal handling of any kind. There is an Interactively, typing ctrl-c causes SIGINT to be broadcast to all In batch, SLURM, for example, issues SIGTERM to all processes a
SIGUSR1 is, by default, ignored globally. Only when there is a This solution can be made to work with SLURM using the "--signal" For interactive Parsl workflows, this work-around involves manually
All but the main Parsl python script should handle but ignore these Part II. workQueue There is at least one recorded instance where a SIGUSR1 was delivered |
#2123 by tomg i suspect there are shutdown hangs still in both monitoring and in workqueue. there are also "surprising" behaviours to do with the threadlocal executor. maybe I should make a documentation section, "shutting down parsl" which details how to do it and also unexpected behaviours? 1. with default config: no wq, no monitoring, threadlocalexecutor: after 1 ctrl-c, the main thread is interrupted. thread local executor will continue running any tasks it has: the process will not exit until all of those tasks are completed, so if they are long, process exit will not happen then. that's an "expected" python behaviour. parsl's "atexit" shutdown behaviour will not run *until* all of those tasks are completed - pressing ctrl-C more times will cause that to not happen at all, or be interrupted multiple times, perhaps. 2. with htex_local_alternate 2.1 with ctrl-c: this seems to shut everything down correctly, and immediately. 2.2 with kill (TERM) aimed at the main process - this marks the process as "Terminated" immediately apparently without running any atexit handling. this leaves at least all of the monitoring processes, the htex interchange, and the htex process workers still running. What happens if I send a SIGINT instead? that's different than ctrl-C: ctrl-C goes to all processes, sigint goes to just the process its aimed at.
#2123 by tomg i suspect there are shutdown hangs still in both monitoring and in workqueue. there are also "surprising" behaviours to do with the threadlocal executor. maybe I should make a documentation section, "shutting down parsl" which details how to do it and also unexpected behaviours? 1. with default config: no wq, no monitoring, threadlocalexecutor: after 1 ctrl-c, the main thread is interrupted. thread local executor will continue running any tasks it has: the process will not exit until all of those tasks are completed, so if they are long, process exit will not happen then. that's an "expected" python behaviour. parsl's "atexit" shutdown behaviour will not run *until* all of those tasks are completed - pressing ctrl-C more times will cause that to not happen at all, or be interrupted multiple times, perhaps. 2. with htex_local_alternate 2.1 with ctrl-c: this seems to shut everything down correctly, and immediately. 2.2 with kill (TERM) aimed at the main process - this marks the process as "Terminated" immediately apparently without running any atexit handling. this leaves at least all of the monitoring processes, the htex interchange, and the htex process workers still running. What happens if I send a SIGINT instead? that's different than ctrl-C: ctrl-C goes to all processes, sigint goes to just the process its aimed at.
Describe the bug
As a consequence of an unexpected or abnormal workflow shutdown, the workflow can hang and incomplete bookkeeping prevents monitoring.db from providing an accurate summary status reporting basis.
To Reproduce
Steps to reproduce the behavior, for e.g:
Expected behavior
Prompt and proper shutdown with full bookkeeping in monitoring.db
Environment
Distributed Environment
More description.
The current Parsl shutdown scheme exhibits multiple issues when said shutdown is unexpected, e.g., SIGINT. The following symptoms are easily reproducible even in simple, test workflows. My current workflows use the workQueue executor but I have observed similar symptoms with htex.
Top-level workflow script is interactive. Receipt of a SIGINT (ctrl-c) or other terminating signal, such as SIGTERM, via "kill" nearly always causes a significant delay in response and often an indefinite hang. A second terminating signal usually completes the shutdown process returning the user's shell to the command prompt.
Final messages from running tasks (Parsl workers) and other Parsl agents are not received and processed into monitoring.db.
The failure to exit cleanly means that it is not possible to determine an accurate status -- or fate -- of the workflow ex post facto. For large workflows (10's of thousands of tasks) this means digging into uncountable log files.
These symptoms occur most of the time but not all of the time. On at least one occasion, I have observed a full shutdown resulting from a single terminating signal. However, in that single example, a lengthy (>30 second) delay occurred part-way through the shutdown.
It is not clear that checkpointing is being properly handled in these situations.
This general topic has been raised in the past, e.g., #641, #1589, #1670 and referenced issues.
The text was updated successfully, but these errors were encountered: