-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polling can incorrectly return a failed task to the running state #4513
Comments
dpmatthews
added
bug
Something is wrong :(
question
Flag this as a question for the next Cylc project meeting.
labels
Nov 15, 2021
Ouch, yes better fix this. |
Duplicate of #4516 Keeping both copies as they are tagged against different Cylc versions. |
8 tasks
8 tasks
oliver-sanders
removed
the
question
Flag this as a question for the next Cylc project meeting.
label
Aug 12, 2022
8 tasks
No - that was a different bug |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue.
The following workflow illustrates the problem:
(note:
err-script
keeps the job in the queue until the workflow manager kills it)I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.
Relevant log messages:
This is clearly a bug.
I think this is the relevant code which causes the problem:
https://github.com/cylc/cylc-flow/blob/8.0b3/cylc/flow/task_job_mgr.py#L846
Note that this only occurs for jobs which exit with TERM (not ERR or EXIT).
The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792.
However, I'm struggling to see why we want to allow this.
It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.
If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.
Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.
I'm confused about the changes made in #2396 (to close #1792).
The documentation change implies you need to use "cylc reset" to handle preempted tasks.
In that case, why was the code changed to allow any task to return from the failed state?
And why allow polling of succeeded and failed tasks?
We need to clarify what we're trying to support before agreeing how to address this bug.
Note that this issue affects Cylc 7 as well.
The text was updated successfully, but these errors were encountered: