Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polling can incorrectly return a failed task to the running state #4513

Open
dpmatthews opened this issue Nov 15, 2021 · 3 comments
Open

Polling can incorrectly return a failed task to the running state #4513

dpmatthews opened this issue Nov 15, 2021 · 3 comments
Labels
bug Something is wrong :(
Milestone

Comments

@dpmatthews
Copy link
Contributor

If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue.
The following workflow illustrates the problem:

[scheduling]
    [[graph]]
        R1 = "timeout:started => poll"
[runtime]
    [[timeout]]
        script = "sleep 600"
        err-script = "sleep 180"
        platform = remote-hpc
        execution time limit = PT30S
    [[poll]]
        # Poll the task after it has failed but before it exits the queue
        script = "sleep 50; cylc poll $CYLC_SUITE_NAME timeout.1"

(note: err-script keeps the job in the queue until the workflow manager kills it)
I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.

Relevant log messages:

2021-11-15T19:15:37Z CRITICAL - [timeout.1 running job:01 flows:1] (received)failed/TERM at 2021-11-15T19:15:36Z
2021-11-15T19:15:37Z INFO - [timeout.1 running job:01 flows:1] => failed
2021-11-15T19:15:46Z DEBUG - [jobs-poll cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no hpc env CYLC_VERSION=8.0b3.dev CYLC_ENV_NAME=cylc-8.0b3.dev bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-poll --debug -- '$HOME/cylc-run/bug.polling3.c8/run8/log/job' 1/timeout/01
    [jobs-poll ret_code] 0
    [jobs-poll out] [TASK JOB SUMMARY]2021-11-15T19:15:46Z|1/timeout/01|{"job_runner_name": "pbs", "job_id": "5698928", "job_runner_exit_polled": 0, "run_status": 1, "run_signal": "TERM", "time_submit_exit": "2021-11-15T19:14:34Z", "time_run": "2021-11-15T19:14:42Z", "time_run_exit": "2021-11-15T19:15:36Z"}
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] (polled)started at 2021-11-15T19:14:42Z
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] => running

This is clearly a bug.

I think this is the relevant code which causes the problem:
https://github.com/cylc/cylc-flow/blob/8.0b3/cylc/flow/task_job_mgr.py#L846
Note that this only occurs for jobs which exit with TERM (not ERR or EXIT).

The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792.
However, I'm struggling to see why we want to allow this.
It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.

If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.

Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.

I'm confused about the changes made in #2396 (to close #1792).
The documentation change implies you need to use "cylc reset" to handle preempted tasks.
In that case, why was the code changed to allow any task to return from the failed state?
And why allow polling of succeeded and failed tasks?

We need to clarify what we're trying to support before agreeing how to address this bug.

Note that this issue affects Cylc 7 as well.

@dpmatthews dpmatthews added bug Something is wrong :( question Flag this as a question for the next Cylc project meeting. labels Nov 15, 2021
@dpmatthews dpmatthews added this to the cylc-8.x milestone Nov 15, 2021
@hjoliver
Copy link
Member

Ouch, yes better fix this.

@oliver-sanders
Copy link
Member

Duplicate of #4516

Keeping both copies as they are tagged against different Cylc versions.

@dpmatthews dpmatthews modified the milestones: cylc-8.x, cylc-8.1.0 Jul 7, 2022
@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.0, cylc-8.0.1 Aug 4, 2022
@wxtim wxtim linked a pull request Aug 8, 2022 that will close this issue
8 tasks
@oliver-sanders oliver-sanders removed the question Flag this as a question for the next Cylc project meeting. label Aug 12, 2022
@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.1, 8.0.2 Aug 16, 2022
@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.2, cylc-8.0.3 Sep 12, 2022
@wxtim wxtim closed this as completed Sep 28, 2022
@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.3, cylc-8.0.4 Oct 12, 2022
@wxtim wxtim assigned wxtim and unassigned wxtim Oct 13, 2022
@dpmatthews dpmatthews modified the milestones: cylc-8.0.4, cylc-8.2.0 Oct 19, 2022
@dpmatthews
Copy link
Contributor Author

Duplicate of #4516

No - that was a different bug

@oliver-sanders oliver-sanders modified the milestones: cylc-8.2.0, cylc-8.3.0 Jul 11, 2023
@oliver-sanders oliver-sanders modified the milestones: cylc-8.3.0, cylc-8.4.0 Feb 22, 2024
@MetRonnie MetRonnie modified the milestones: cylc-8.4.0, 8.3.1 Mar 14, 2024
@oliver-sanders oliver-sanders modified the milestones: 8.3.1, 8.3.x Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

5 participants