Polling can incorrectly return a failed task to the running state #4513

dpmatthews · 2021-11-15T20:09:37Z

If a task fails due to exceeding the wallclock limit it can return to the running state if it is polled before it has exited the queue.
The following workflow illustrates the problem:

[scheduling]
    [[graph]]
        R1 = "timeout:started => poll"
[runtime]
    [[timeout]]
        script = "sleep 600"
        err-script = "sleep 180"
        platform = remote-hpc
        execution time limit = PT30S
    [[poll]]
        # Poll the task after it has failed but before it exits the queue
        script = "sleep 50; cylc poll $CYLC_SUITE_NAME timeout.1"

(note: err-script keeps the job in the queue until the workflow manager kills it)
I performed this test with our HPC using PBS but the same result should occur with, e.g Slurm.

Relevant log messages:

2021-11-15T19:15:37Z CRITICAL - [timeout.1 running job:01 flows:1] (received)failed/TERM at 2021-11-15T19:15:36Z
2021-11-15T19:15:37Z INFO - [timeout.1 running job:01 flows:1] => failed
2021-11-15T19:15:46Z DEBUG - [jobs-poll cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no hpc env CYLC_VERSION=8.0b3.dev CYLC_ENV_NAME=cylc-8.0b3.dev bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-poll --debug -- '$HOME/cylc-run/bug.polling3.c8/run8/log/job' 1/timeout/01
    [jobs-poll ret_code] 0
    [jobs-poll out] [TASK JOB SUMMARY]2021-11-15T19:15:46Z|1/timeout/01|{"job_runner_name": "pbs", "job_id": "5698928", "job_runner_exit_polled": 0, "run_status": 1, "run_signal": "TERM", "time_submit_exit": "2021-11-15T19:14:34Z", "time_run": "2021-11-15T19:14:42Z", "time_run_exit": "2021-11-15T19:15:36Z"}
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] (polled)started at 2021-11-15T19:14:42Z
2021-11-15T19:15:46Z INFO - [timeout.1 failed job:01 flows:1] => running

This is clearly a bug.

I think this is the relevant code which causes the problem:
https://github.com/cylc/cylc-flow/blob/8.0b3/cylc/flow/task_job_mgr.py#L846
Note that this only occurs for jobs which exit with TERM (not ERR or EXIT).

The ability for a failed task to be returned to 'submitted' or 'running' as a result of polling was part of #1792.
However, I'm struggling to see why we want to allow this.
It can never be safe given that the task has already entered the failed state and potentially triggered other tasks as a result.

If the idea is to support tasks which might be rerun by the workload manager then I think we would need to modify Cylc to not change state after receiving the failure message until the failure has been confirmed by a subsequent poll. This might cause undesirable delays so I think we would need to (re)introduce an "allow resurrection" setting if we want to support this.

Alternatively, if supporting resurrection isn't considered important (for the moment at least) then we should disable/remove the relevant code.

I'm confused about the changes made in #2396 (to close #1792).
The documentation change implies you need to use "cylc reset" to handle preempted tasks.
In that case, why was the code changed to allow any task to return from the failed state?
And why allow polling of succeeded and failed tasks?

We need to clarify what we're trying to support before agreeing how to address this bug.

Note that this issue affects Cylc 7 as well.

The text was updated successfully, but these errors were encountered:

hjoliver · 2021-11-15T20:16:24Z

Ouch, yes better fix this.

oliver-sanders · 2021-11-29T14:41:31Z

Duplicate of #4516

Keeping both copies as they are tagged against different Cylc versions.

dpmatthews · 2022-10-19T11:02:08Z

Duplicate of #4516

No - that was a different bug

dpmatthews added bug Something is wrong :( question Flag this as a question for the next Cylc project meeting. labels Nov 15, 2021

dpmatthews added this to the cylc-8.x milestone Nov 15, 2021

oliver-sanders mentioned this issue Nov 29, 2021

job polling broken for failed jobs after restart #4516

Closed

hjoliver mentioned this issue Feb 10, 2022

Only poll non-waiting tasks #4658

Merged

7 tasks

oliver-sanders mentioned this issue Feb 11, 2022

polling: formally define behaviour and simplify logic where possible #4678

Open

dpmatthews modified the milestones: cylc-8.x, cylc-8.1.0 Jul 7, 2022

oliver-sanders modified the milestones: cylc-8.1.0, cylc-8.0.1 Aug 4, 2022

wxtim linked a pull request Aug 8, 2022 that will close this issue

add platform on reload for SUCCEEDED and FAILED tasks #5025

Merged

8 tasks

oliver-sanders assigned wxtim Aug 8, 2022

oliver-sanders mentioned this issue Aug 8, 2022

add platform on reload for SUCCEEDED and FAILED tasks #5025

Merged

8 tasks

oliver-sanders removed the question Flag this as a question for the next Cylc project meeting. label Aug 12, 2022

oliver-sanders modified the milestones: cylc-8.0.1, 8.0.2 Aug 16, 2022

MetRonnie removed a link to a pull request Sep 1, 2022

add platform on reload for SUCCEEDED and FAILED tasks #5025

Merged

8 tasks

MetRonnie unassigned wxtim Sep 1, 2022

oliver-sanders modified the milestones: cylc-8.0.2, cylc-8.0.3 Sep 12, 2022

wxtim closed this as completed Sep 28, 2022

oliver-sanders reopened this Sep 28, 2022

oliver-sanders modified the milestones: cylc-8.0.3, cylc-8.0.4 Oct 12, 2022

wxtim assigned wxtim and unassigned wxtim Oct 13, 2022

dpmatthews modified the milestones: cylc-8.0.4, cylc-8.2.0 Oct 19, 2022

oliver-sanders modified the milestones: cylc-8.2.0, cylc-8.3.0 Jul 11, 2023

oliver-sanders modified the milestones: cylc-8.3.0, cylc-8.4.0 Feb 22, 2024

MetRonnie modified the milestones: cylc-8.4.0, 8.3.1 Mar 14, 2024

oliver-sanders modified the milestones: 8.3.1, 8.3.x Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polling can incorrectly return a failed task to the running state #4513

Polling can incorrectly return a failed task to the running state #4513

dpmatthews commented Nov 15, 2021

hjoliver commented Nov 15, 2021

oliver-sanders commented Nov 29, 2021

dpmatthews commented Oct 19, 2022

Polling can incorrectly return a failed task to the running state #4513

Polling can incorrectly return a failed task to the running state #4513

Comments

dpmatthews commented Nov 15, 2021

hjoliver commented Nov 15, 2021

oliver-sanders commented Nov 29, 2021

dpmatthews commented Oct 19, 2022