Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polling/set: submitted task cannot be set to succeeded #6314

Closed
oliver-sanders opened this issue Aug 21, 2024 · 10 comments · Fixed by #6326
Closed

polling/set: submitted task cannot be set to succeeded #6314

oliver-sanders opened this issue Aug 21, 2024 · 10 comments · Fixed by #6326
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 21, 2024

Spotted in the wild:

INFO - + 20210103T0000Z/nemo_cice_obsoper_EN000 submitted
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted(runahead)] => submitted
INFO - + 20210103T0000Z/nemo_cice_obsoper_EN000 poll_timer
INFO - + 20210103T0000Z/nemo_cice_obsoper_EN000 ['try_timers', 'submission-retry']
INFO - + 20210103T0000Z/nemo_cice_obsoper_EN000 ['try_timers', 'execution-retry']
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] (polled)succeeded
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] setting implied output: submitted
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] submitted to xce:None[None]
WARNING - Unhandled jobs-poll output: 2024-08-03T12:18:26Z|20210103T0000Z/nemo_cice_obsoper_EN000/01|{"job_runner_name": "pbs", "job_id": "3094491", "run_status": 0, "time_submit_exit": "2024-08-03T11:17:32Z", "time_run": "2024-08-03T11:39:00Z", "time_run_exit": "2024-08-03T11:45:30Z"}
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=[], tasks=['20210103T0000Z/nemo_cice_obsoper_EN000'])
INFO - output 20210103T0000Z/nemo_cice_obsoper_EN000:succeeded completed already
    kill_tasks(tasks=['20210103T0000Z/nemo_cice_obsoper_EN000'])
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] => submitted(held)
ERROR - [jobs-kill cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no xcel01 env CYLC_VERSION=8.3.3 CYLC_ENV_NAME=cylc-8.3.3-1 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-kill -- '$HOME/cylc-run/u-di486/run1/log/job' 20210103T0000Z/nemo_cice_obsoper_EN000/01
    [jobs-kill out] 2024-08-06T08:39:43Z|20210103T0000Z/nemo_cice_obsoper_EN000/01|153
WARNING - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted(held)] job kill failed
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=[], tasks=['20210103T0000Z/nemo_cice_obsoper_EN000'])
INFO - output 20210103T0000Z/nemo_cice_obsoper_EN000:succeeded completed already

Two problems here:

  1. The job was polled as succeeded, but the task was left as submitted.
  2. The task was manually set as succeeded, but the task was left as submitted (note: orphaning unkillable jobs is a functional requirement).

And two oddities:

  • "Unhandled jobs-poll output"?
  • submitted to xce:None[None]?!

Reproducible-ish example

This example reproduces the example fairly reliably:

[scheduling]
    [[graph]]
        R1 = """
            foo => fin
        """

[runtime]
    [[foo]]
        script = """
            # force the workflow to crash
            cylc hold "${CYLC_WORKFLOW_ID}//*"

            # run until workflow crashes
            while true; do
                if ! cylc ping "${CYLC_WORKFLOW_ID}"; then
                    exit 0
                fi
            done
        """
        platform = remote-background-platform
    [[fin]]
        script = false

You will also need the following diff:

diff --git a/cylc/flow/commands.py b/cylc/flow/commands.py
index 173984f17..fe149013a 100644
--- a/cylc/flow/commands.py
+++ b/cylc/flow/commands.py
@@ -276,6 +276,7 @@ async def hold(schd: 'Scheduler', tasks: Iterable[str]):
     """Hold specified tasks."""
     validate.is_tasks(tasks)
     yield
+    raise Exception('foo')
     yield schd.pool.hold_tasks(tasks)
 
 
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..bb86afe83 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -947,6 +947,7 @@ class Scheduler:
                 with suppress(StopAsyncIteration):
                     n_warnings = await cmd.__anext__()
             except Exception as exc:
+                raise
                 # Don't let a bad command bring the workflow down.
                 if (
                     cylc.flow.flags.verbosity > 1 or

To reproduce:

$ cylc vip -N .

 ▪ ■  Cylc Workflow Engine 8.4.0.dev
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

...

CRITICAL - Workflow shutting down - foo
WARNING - Orphaned tasks:
    * 1/foo (submitted)
INFO - platform: xcel00-bg - remote tidy (on remote-background-platform)
INFO - DONE


$ cylc play <workflow-id> -N

 ▪ ■  Cylc Workflow Engine 8.4.0.dev
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors

...

INFO - platform: remote-background-platform - remote init (on remote-background-platform)
INFO - [1/foo/01:submitted] (polled)succeeded
INFO - [1/foo/01:submitted] setting implied output: submitted
INFO - [1/foo/01:submitted] submitted to remote-background-platform:None[None]
WARNING - Unhandled jobs-poll output: 2024-08-23T13:08:21Z|1/foo/01|{"job_runner_name": "background", "job_id": "36430",
    "run_status": 0, "time_submit_exit": "2024-08-23T13:08:09Z", "time_run": "2024-08-23T13:08:10Z", "time_run_exit":
    "2024-08-23T13:08:18Z"}
WARNING - list index out of range
INFO - platform: xcel00-bg - remote file install (on remote-background-platform)
INFO - platform: xcel00-bg - remote file install complete
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Aug 21, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Aug 21, 2024
@hjoliver
Copy link
Member

It must be something to do with the initial poll, because setting a plain old submitted task to succeeded works fine (current master):

[scheduling]
    [[graph]]
        R1 = foo
[runtime]
    [[foo]]
        init-script = """
            # cause me to get stuck as submitted
            cylc__job__disable_fail_signals ERR
            exit 1
        """

Run it, and then do cylc set bug//1/foo ...

INFO - [1/foo/01:preparing] => submitted 
INFO - Command "set" received. ID=bab71101-3443-45d3-bbc1-346c990a571b
    set(flow=['all'], flow_wait=False, outputs=['succeeded'], prerequisites=[], tasks=['1/foo'])
INFO - [1/foo/01:submitted] setting implied output: started
INFO - [1/foo/01:submitted] => succeeded
INFO - Command "set" actioned. ID=bab71101-3443-45d3-bbc1-346c990a571b
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE

@hjoliver
Copy link
Member

hjoliver commented Aug 23, 2024

Manually polling the above, instead of cylc set, works fine - it correctly sets the task to submit-failed.

Now trying with communications method = poll so I can have the job succeed before polling it... that works fine too.

@hjoliver
Copy link
Member

hjoliver commented Aug 23, 2024

INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] (polled)succeeded
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] setting implied output: submitted
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] submitted to xce:None[None]  # <----- ?!!!!

Does that imply the job got submitted again after the poll result? And what's with job runner, job ID being None, None ?

Maybe we're missing some log lines here that might help (did you grep for the task name?)

@hjoliver
Copy link
Member

And one oddity:

"Unhandled jobs-poll output"

        else:
            # Unhandled messages. These include:
            #  * general non-output/progress messages
            #  * poll messages that repeat previous results
            # Note that all messages are logged already at the top.
            # No state change.
            LOG.debug(f"[{itask}] unhandled: {message}")

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 23, 2024

Does that imply the job got submitted again after the poll result? And what's with job runner, job ID being None, None ?

No idea, I haven't had the chance to investigate these bugs properly let.

Looking in log/job there is no 02 directory, the 01 the job file is correctly formatted for PBS submission.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 23, 2024

Additional context:

The DB looks fine:

sqlite> select * from task_pool where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
20210103T0000Z|nemo_cice_obsoper_EN000|[1]|submitted|1
sqlite> select * from task_jobs where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
20210103T0000Z|nemo_cice_obsoper_EN000|1|[1]|0|1|2024-08-03T11:17:15Z|2024-08-03T11:17:32Z|0|||||xce|pbs|3094491
sqlite> select * from task_events where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
nemo_cice_obsoper_EN000|20210103T0000Z|2024-08-03T11:17:32Z|1|submitted|

The job.status file on the remote platform records the success of the job:

$ ssh xce cat ~.../log/job/20210103T0000Z/nemo_cice_obsoper_EN000/01/job/.status
CYLC_JOB_RUNNER_NAME=pbs
CYLC_JOB_ID=3094491
CYLC_JOB_RUNNER_SUBMIT_TIME=2024-08-03T11:17:32Z
CYLC_JOB_PID=52312
CYLC_JOB_INIT_TIME=2024-08-03T11:39:00Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2024-08-03T11:45:30Z

And the succeeded message was sent with no error in job.err:

$ ssh xce grep 'INFO -' ~.../log/job/20210103T0000Z/nemo_cice_obsoper_EN000/01/job.out
2024-08-03T11:39:00Z INFO - started
2024-08-03T11:45:30Z INFO - succeeded

(Note the task was run on a platform with zmq comms, not polling)

The job logs were never synced locally because the task state never made it to succeeded.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 23, 2024

Series of events:

1) The job was submitted

INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:preparing] submitted to xce:p
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:preparing] => submitted

2) About 15 minutes later, the workflow crashed with this traceback #6325.

CRITICAL - Workflow shutting down - 'bool' object has no attribute 'flow_nums'
WARNING - Orphaned tasks:
    ...
    * 20210103T0000Z/nemo_cice_obsoper_EN000 (submitted)

3) The workflow was restarted, the task was restored from the database:

INFO - LOADING task proxies
INFO - + 20210103T0000Z/nemo_cice_obsoper_EN000 submitted                                                 
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted(runahead)] => submitted

4) The task polled as succeeded:

INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] setting implied output: submitted
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] submitted to xce:None[None]
WARNING - Unhandled jobs-poll output: 2024-08-03T12:18:26Z|20210103T0000Z/nemo_cice_obsoper_EN000/01|{"job    _runner_name": "pbs", "job_id": "3094491", "run_status": 0, "time_submit_exit": "2024-08-03T11:17:32Z", "time_run": "2024-08-03    T11:39:00Z", "time_run_exit": "2024-08-03T11:45:30Z"}
WARNING - list index out of range
  • setting implied output: submitted - The submitted output should already have been generated, the task was in the submitted state?
  • submitted to xce:None[None] - Is this just getting logged in relation to setting implied output: submitted? If so, we should prevent this?

5) The remote file-install (actioned as a result of the restart) was logged as completed immediately after the above (same timestamp to the second):

INFO - platform: xce - remote file install (on xcel00)
INFO - platform: xce - remote file install complete

6) The user then tried to set the task to succeeded which did nothing:

INFO - Command "set" received. ID=e76c532a-023e-4c40-9d80-761af0010ca2
    set(flow=['all'], flow_wait=False, outputs=[], prerequisites=[], tasks=['20210103T0000Z/nemo_cice_obsoper_EN000'])
INFO - output 20210103T0000Z/nemo_cice_obsoper_EN000:succeeded completed already  

7) The user attempted to kill the submission (which failed as there was no submission to kill):

INFO - Command "kill_tasks" received. ID=13eb922a-32fa-438f-9be3-94b8f0f0823e
    kill_tasks(tasks=['20210103T0000Z/nemo_cice_obsoper_EN000'])                   
INFO - [20210103T0000Z/nemo_cice_obsoper_EN000/01:submitted] => submitted(held)
INFO - Command "kill_tasks" actioned. ID=13eb922a-32fa-438f-9be3-94b8f0f0823e
ERROR - [jobs-kill cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no xcel01 env CYLC_    VERSION=8.3.3 CYLC_ENV_NAME=cylc-8.3.3-1 bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-kill -- '$HOME/cylc-run/u-di486/r    un1/log/job' 20210103T0000Z/nemo_cice_obsoper_EN000/01
    [jobs-kill ret_code] 153                                                       
    [jobs-kill out] 2024-08-06T08:39:43Z|20210103T0000Z/nemo_cice_obsoper_EN000/01|153    

8) The user attempted to stop the workflow (default mode):

INFO - Command "stop" received. ID=cf64df54-c048-41c3-bd60-d142933b358e
    stop(mode=REQUEST(CLEAN))       
INFO - Command "stop" actioned. ID=cf64df54-c048-41c3-bd60-d142933b358e

Due to the submitted task, the workflow just sat there doing nothing.

9) I spotted the problem and killed the workflow :(

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 23, 2024

I have now managed to reproduce this, however, I haven't managed to get it to reproduce reliably yet.

Here's my workflow so far:

[scheduling]
    [[graph]]
        R1 = """
            foo:submitted => stop => fin
            foo => fin
        """

[runtime]
    [[foo]]
        script = """
            while true; do
                if ! cylc ping "${CYLC_WORKFLOW_ID}"; then
                    exit 0
                fi
            done
        """
        platform = xce-bg
    [[stop]]
        script = """
            cylc hold "${CYLC_WORKFLOW_ID}//*"
            sleep 5
            cylc play "${CYLC_WORKFLOW_ID}"
        """
    [[fin]]
        script = false

Combined with the following diff:

diff --git a/cylc/flow/commands.py b/cylc/flow/commands.py
index 173984f17..fe149013a 100644
--- a/cylc/flow/commands.py
+++ b/cylc/flow/commands.py
@@ -276,6 +276,7 @@ async def hold(schd: 'Scheduler', tasks: Iterable[str]):
     """Hold specified tasks."""
     validate.is_tasks(tasks)
     yield
+    raise Exception('foo')
     yield schd.pool.hold_tasks(tasks)
 
 
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..bb86afe83 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -947,6 +947,7 @@ class Scheduler:
                 with suppress(StopAsyncIteration):
                     n_warnings = await cmd.__anext__()
             except Exception as exc:
+                raise
                 # Don't let a bad command bring the workflow down.
                 if (
                     cylc.flow.flags.verbosity > 1 or

I'm not certain that the unexpected shutdown is required to reproduce this though.

The result:

$ cylc play -N tmp.v99NRD4cig/run14

 ▪ ■  Cylc Workflow Engine 8.4.0.dev
 ██   Copyright (C) 2008-2024 NIWA
▝▘    & British Crown (Met Office) & Contributors
...

INFO - [1/foo:waiting(runahead)] => waiting
INFO - [1/foo:waiting] => waiting(queued)
INFO - [1/foo:waiting(queued)] => waiting
INFO - [1/foo:waiting] => preparing
INFO - platform: xcel00-bg - remote init (on xcel00)
INFO - platform: xcel00-bg - remote file install (on xcel00)
INFO - platform: xcel00-bg - remote file install complete
INFO - [1/foo/01:preparing] submitted to xcel00-bg:background[44543]
INFO - [1/foo/01:preparing] => submitted
INFO - [1/stop:waiting(runahead)] => waiting
INFO - [1/stop:waiting] => waiting(queued)
INFO - [1/stop:waiting(queued)] => waiting
INFO - [1/stop:waiting] => preparing
INFO - [1/stop/01:preparing] submitted to localhost:background[114139]
INFO - [1/stop/01:preparing] => submitted
INFO - Command "hold" received. ID=34d0b958-10d5-4d3e-9dd7-d33ba4b85ea8
    hold(tasks=['*'])
CRITICAL - An uncaught error caused Cylc to shut down.
    If you think this was an issue in Cylc, please report the following traceback to the developers.
    https://github.com/cylc/cylc-flow/issues/new?assignees=&labels=bug&template=bug.md&title=;
ERROR - foo
...
    Exception: foo
CRITICAL - Workflow shutting down - foo
WARNING - Orphaned tasks:
    * 1/foo (submitted)
    * 1/stop (submitted)
INFO - platform: xcel00-bg - remote tidy (on xcel00)
INFO - DONE

$ cylc cat-log tmp.v99NRD4cig -m t
...
2024-08-23T13:17:32+01:00 INFO - LOADING task proxies
2024-08-23T13:17:32+01:00 INFO - + 1/foo submitted
2024-08-23T13:17:32+01:00 INFO - [1/foo/01:submitted(runahead)] => submitted
2024-08-23T13:17:32+01:00 INFO - + 1/stop submitted
2024-08-23T13:17:32+01:00 INFO - [1/stop/01:submitted(runahead)] => submitted
2024-08-23T13:17:32+01:00 INFO - LOADING job data
2024-08-23T13:17:32+01:00 INFO - LOADING task action timers
2024-08-23T13:17:32+01:00 INFO - + 1/foo poll_timer
2024-08-23T13:17:32+01:00 INFO - + 1/foo ['try_timers', 'submission-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/foo ['try_timers', 'execution-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/stop poll_timer
2024-08-23T13:17:32+01:00 INFO - + 1/stop ['try_timers', 'submission-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/stop ['try_timers', 'execution-retry']
2024-08-23T13:17:32+01:00 INFO - Flows:
    flow: 1 (original flow from 1) 2024-08-23T13:17:15
2024-08-23T13:17:32+01:00 INFO - platform: xcel00-bg - remote init (on xcel00)
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] (polled)started
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] setting implied output: submitted
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] submitted to localhost:None[None]
2024-08-23T13:17:33+01:00 WARNING - Unhandled jobs-poll output: 2024-08-23T13:17:32+01:00|1/stop/01|{"job_runner_name": "background", "job_id": "114139", "job_runner_exit_polled": 0, "time_submit_exit": "2024-08-23T13:17:23+01:00", "time_run": "2024-08-23T13:17:24+01:00"}
2024-08-23T13:17:33+01:00 WARNING - list index out of range
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] => succeeded
2024-08-23T13:17:33+01:00 INFO - [1/fin:waiting(runahead)] => waiting
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] setting implied output: submitted
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] submitted to xcel00-bg:None[None]
2024-08-23T13:17:34+01:00 WARNING - Unhandled jobs-poll output: 2024-08-23T12:17:34Z|1/foo/01|{"job_runner_name": "background", "job_id": "44543", "run_status": 0, "time_submit_exit": "2024-08-23T12:17:22Z", "time_run": "2024-08-23T12:17:23Z", "time_run_exit": "2024-08-23T12:17:28Z"}
2024-08-23T13:17:34+01:00 WARNING - list index out of range
2024-08-23T13:17:35+01:00 INFO - platform: xcel00-bg - remote file install (on xcel00)
2024-08-23T13:17:36+01:00 INFO - platform: xcel00-bg - remote file install complete

This chunk of the restart log displays the same symptoms as the OP:

2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] setting implied output: submitted
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] submitted to xcel00-bg:None[None]
2024-08-23T13:17:34+01:00 WARNING - Unhandled jobs-poll output: 2024-08-23T12:17:34Z|1/foo/01|{"job_runner_name": "background", "job_id": "44543", "run_status": 0, "time_submit_exit": "2024-08-23T12:17:22Z", "time_run": "2024-08-23T12:17:23Z", "time_run_exit": "2024-08-23T12:17:28Z"}
2024-08-23T13:17:34+01:00 WARNING - list index out of range
2024-08-23T13:17:35+01:00 INFO - platform: xcel00-bg - remote file install (on xcel00)
2024-08-23T13:17:36+01:00 INFO - platform: xcel00-bg - remote file install complete
  • Job is polled as succeeded, but is left in the submitted state.
  • The submitted to xcel00-bg:None[None] message (probably harmless, but confusing?)
  • The Unhandled jobs-poll output message which must surely be erroneous?
  • The list index out of range message, possibly related to the Unhandled jobs-poll output? (edit: yes)
  • Note the remote file-install is logged moments after the poll.

Interestingly, if I request a manual poll on the task, the situation is resolved:

2024-08-23T13:27:10+01:00 INFO - Command "poll_tasks" received. ID=d305cef8-e23c-4d17-a64a-17955d54ba32
    poll_tasks(tasks=['1/foo'])
2024-08-23T13:27:11+01:00 INFO - Command "poll_tasks" actioned. ID=d305cef8-e23c-4d17-a64a-17955d54ba32
2024-08-23T13:27:13+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-08-23T13:27:13+01:00 INFO - [1/foo/01:submitted] setting implied output: started
2024-08-23T13:27:13+01:00 INFO - [1/foo/01:submitted] => running
2024-08-23T13:27:13+01:00 INFO - [1/foo/01:running] => succeeded

However, setting the succeeded output does not work:

2024-08-23T14:00:15+01:00 INFO - Command "set" received. ID=d3d095c8-3f2a-4587-a38e-89094e88e840
    set(flow=['all'], flow_wait=False, outputs=['succeeded'], prerequisites=[], tasks=['1/foo'])
2024-08-23T14:00:15+01:00 INFO - output 1/foo:succeeded completed already
2024-08-23T14:00:15+01:00 INFO - Command "set" actioned. ID=d3d095c8-3f2a-4587-a38e-89094e88e840

And of course the job is not killable because it is not running:

2024-08-23T14:00:45+01:00 INFO - Command "kill_tasks" received. ID=d26c89ef-3403-4cf7-a07b-139ba5b3193f
    kill_tasks(tasks=['1/foo'])
2024-08-23T14:00:46+01:00 INFO - [1/foo/01:submitted] => submitted(held)
2024-08-23T14:00:46+01:00 INFO - Command "kill_tasks" actioned. ID=d26c89ef-3403-4cf7-a07b-139ba5b3193f
2024-08-23T14:00:51+01:00 ERROR - [jobs-kill cmd] ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no xcel00 env CYLC_VERSION=8.4.0.dev CYLC_ENV_NAME=cylc-8.4.dev bash --login -c ''"'"'exec "$0" "$@"'"'"'' cylc jobs-kill -- '$HOME/cylc-run/tmp.v99NRD4cig/run22/log/job' 1/foo/01
    [jobs-kill ret_code] 1
    [jobs-kill out] 2024-08-23T13:00:50Z|1/foo/01|1
2024-08-23T14:00:51+01:00 WARNING - [1/foo/01:submitted(held)] job kill failed

Notes:

  • It seems to be important that the 1/foo:started message is not received before the workflow is killed.

@oliver-sanders
Copy link
Member Author

The restart remote-init seems to always complete after the poll result is received. I wondered if this might possibly be part of the problem, however, it doesn't appear to be.

This diff will make the restart poll wait for the remote-init to complete:

diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..7962f391e 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -633,6 +633,10 @@ class Scheduler:
                 if self.pool.get_tasks():
                     # (If we're not restarting a finished workflow)
                     self.restart_remote_init()
+                    while self.incomplete_ri_map:
+                        self.proc_pool.process()
+                        self.manage_remote_init()
+                        await asyncio.sleep(0.1)
                     await commands.run_cmd(commands.poll_tasks, self, ['*/*'])
 
             self.run_event_handlers(self.EVENT_STARTUP, 'workflow starting')

However, the example reproduces exactly the same irrespective of the order:

INFO - platform: xcel01-bg - remote init (on xcel01)
INFO - platform: xcel01-bg - remote file install (on xcel01)
INFO - platform: xcel01-bg - remote file install complete
INFO - [1/foo/01:submitted] (polled)succeeded
INFO - [1/foo/01:submitted] setting implied output: submitted
INFO - [1/foo/01:submitted] submitted to xcel01-bg:None[None]
WARNING - Unhandled jobs-poll output: 2024-08-23T13:20:19Z|1/foo/01|{"job_runner_name": "background", "job_id": "45840",
    "run_status": 0, "time_submit_exit": "2024-08-23T13:20:04Z", "time_run": "2024-08-23T13:20:06Z", "time_run_exit":
    "2024-08-23T13:20:13Z"}
WARNING - list index out of range

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 23, 2024

Progress!

This error appears to be pertinent:

WARNING - list index out of range

It is the exception that triggers the Unhandled jobs-poll output message, however, it is not benign, and it is neither of these:

  • general non-output/progress messages
  • poll messages that repeat previous results

It is actually a more serious error bubbling up from further down the call chain and being caught by a loose try/except causing it to surface as a less serious error.

It would appear that itask.jobs is empty on restart. Perhaps this is because the job info didn't get into the DB due to the emergency shutdown?

When the succeeded message comes in, it triggers the (implicit) submitted message to be processed. This ultimately triggers TaskEventsManager._insert_task_job to be called in order to insert the job.

However, this results in traceback because the job conf is not present in itask.jobs.

This diff is enough to allow the workflow to continue (although the exact job config will presumably be lost forever?):

diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index bf9c2ba3a..f295816e3 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -661,6 +661,8 @@ class TaskEventsManager():
             True: if polling is required to confirm a reversal of status.
 
         """
+        # if itask.identity == '1/foo' and message == 'succeeded':
+        #     breakpoint()
 
         # Log messages
         if event_time is None:
@@ -1533,7 +1535,10 @@ class TaskEventsManager():
         if (itask.tdef.run_mode == RunMode.SIMULATION) or forced:
             job_conf = {"submit_num": itask.submit_num}
         else:
-            job_conf = itask.jobs[-1]
+            try:
+                job_conf = itask.jobs[-1]
+            except Exception:
+                job_conf = {"submit_num": itask.submit_num}
 
         # Job status should be task status unless task is awaiting a
         # retry:

Turned this bodge into #6326

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Aug 23, 2024
* Closes cylc#6314
* There are niche situations where the job is not stored in
  "TaskProxy.jobs".
* This handles the situation as gracefully as we are able to.
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Aug 29, 2024
* Address a code TODO to reduce the scope of a try/except to the
  individual expressions it was intended to cover
* The overzealous error catching had hidden a genuine error causing it
  to be missed for some time, see
  cylc#6314
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Aug 29, 2024
* Address a code TODO to reduce the scope of a try/except to the
  individual expressions it was intended to cover
* The overzealous error catching had hidden a genuine error causing it
  to be missed for some time, see
  cylc#6314
@oliver-sanders oliver-sanders self-assigned this Sep 3, 2024
@oliver-sanders oliver-sanders modified the milestones: 8.3.x, 8.3.7 Oct 18, 2024
oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this issue Oct 23, 2024
* Closes cylc#6314
* There are niche situations where the job is not stored in
  "TaskProxy.jobs".
* This handles the situation as gracefully as we are able to.
@oliver-sanders oliver-sanders modified the milestones: 8.3.7, 8.3.6 Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants