-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
polling/set: submitted task cannot be set to succeeded #6314
Comments
It must be something to do with the initial poll, because setting a plain old submitted task to succeeded works fine (current master): [scheduling]
[[graph]]
R1 = foo
[runtime]
[[foo]]
init-script = """
# cause me to get stuck as submitted
cylc__job__disable_fail_signals ERR
exit 1
""" Run it, and then do INFO - [1/foo/01:preparing] => submitted
INFO - Command "set" received. ID=bab71101-3443-45d3-bbc1-346c990a571b
set(flow=['all'], flow_wait=False, outputs=['succeeded'], prerequisites=[], tasks=['1/foo'])
INFO - [1/foo/01:submitted] setting implied output: started
INFO - [1/foo/01:submitted] => succeeded
INFO - Command "set" actioned. ID=bab71101-3443-45d3-bbc1-346c990a571b
INFO - Workflow shutting down - AUTOMATIC
INFO - DONE |
Manually polling the above, instead of Now trying with |
Does that imply the job got submitted again after the poll result? And what's with job runner, job ID being None, None ? Maybe we're missing some log lines here that might help (did you grep for the task name?) |
else:
# Unhandled messages. These include:
# * general non-output/progress messages
# * poll messages that repeat previous results
# Note that all messages are logged already at the top.
# No state change.
LOG.debug(f"[{itask}] unhandled: {message}") |
No idea, I haven't had the chance to investigate these bugs properly let. Looking in |
Additional context: The DB looks fine: sqlite> select * from task_pool where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
20210103T0000Z|nemo_cice_obsoper_EN000|[1]|submitted|1
sqlite> select * from task_jobs where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
20210103T0000Z|nemo_cice_obsoper_EN000|1|[1]|0|1|2024-08-03T11:17:15Z|2024-08-03T11:17:32Z|0|||||xce|pbs|3094491
sqlite> select * from task_events where name="nemo_cice_obsoper_EN000" and cycle="20210103T0000Z";
nemo_cice_obsoper_EN000|20210103T0000Z|2024-08-03T11:17:32Z|1|submitted| The job.status file on the remote platform records the success of the job:
And the succeeded message was sent with no error in job.err:
(Note the task was run on a platform with zmq comms, not polling) The job logs were never synced locally because the task state never made it to succeeded. |
Series of events:1) The job was submitted
2) About 15 minutes later, the workflow crashed with this traceback #6325.
3) The workflow was restarted, the task was restored from the database:
4) The task polled as succeeded:
5) The remote file-install (actioned as a result of the restart) was logged as completed immediately after the above (same timestamp to the second):
6) The user then tried to set the task to succeeded which did nothing:
7) The user attempted to kill the submission (which failed as there was no submission to kill):
8) The user attempted to stop the workflow (default mode):
Due to the submitted task, the workflow just sat there doing nothing. 9) I spotted the problem and killed the workflow :( |
I have now managed to reproduce this, however, I haven't managed to get it to reproduce reliably yet. Here's my workflow so far: [scheduling]
[[graph]]
R1 = """
foo:submitted => stop => fin
foo => fin
"""
[runtime]
[[foo]]
script = """
while true; do
if ! cylc ping "${CYLC_WORKFLOW_ID}"; then
exit 0
fi
done
"""
platform = xce-bg
[[stop]]
script = """
cylc hold "${CYLC_WORKFLOW_ID}//*"
sleep 5
cylc play "${CYLC_WORKFLOW_ID}"
"""
[[fin]]
script = false Combined with the following diff: diff --git a/cylc/flow/commands.py b/cylc/flow/commands.py
index 173984f17..fe149013a 100644
--- a/cylc/flow/commands.py
+++ b/cylc/flow/commands.py
@@ -276,6 +276,7 @@ async def hold(schd: 'Scheduler', tasks: Iterable[str]):
"""Hold specified tasks."""
validate.is_tasks(tasks)
yield
+ raise Exception('foo')
yield schd.pool.hold_tasks(tasks)
diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..bb86afe83 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -947,6 +947,7 @@ class Scheduler:
with suppress(StopAsyncIteration):
n_warnings = await cmd.__anext__()
except Exception as exc:
+ raise
# Don't let a bad command bring the workflow down.
if (
cylc.flow.flags.verbosity > 1 or I'm not certain that the unexpected shutdown is required to reproduce this though. The result: $ cylc play -N tmp.v99NRD4cig/run14
▪ ■ Cylc Workflow Engine 8.4.0.dev
██ Copyright (C) 2008-2024 NIWA
▝▘ & British Crown (Met Office) & Contributors
...
INFO - [1/foo:waiting(runahead)] => waiting
INFO - [1/foo:waiting] => waiting(queued)
INFO - [1/foo:waiting(queued)] => waiting
INFO - [1/foo:waiting] => preparing
INFO - platform: xcel00-bg - remote init (on xcel00)
INFO - platform: xcel00-bg - remote file install (on xcel00)
INFO - platform: xcel00-bg - remote file install complete
INFO - [1/foo/01:preparing] submitted to xcel00-bg:background[44543]
INFO - [1/foo/01:preparing] => submitted
INFO - [1/stop:waiting(runahead)] => waiting
INFO - [1/stop:waiting] => waiting(queued)
INFO - [1/stop:waiting(queued)] => waiting
INFO - [1/stop:waiting] => preparing
INFO - [1/stop/01:preparing] submitted to localhost:background[114139]
INFO - [1/stop/01:preparing] => submitted
INFO - Command "hold" received. ID=34d0b958-10d5-4d3e-9dd7-d33ba4b85ea8
hold(tasks=['*'])
CRITICAL - An uncaught error caused Cylc to shut down.
If you think this was an issue in Cylc, please report the following traceback to the developers.
https://github.com/cylc/cylc-flow/issues/new?assignees=&labels=bug&template=bug.md&title=;
ERROR - foo
...
Exception: foo
CRITICAL - Workflow shutting down - foo
WARNING - Orphaned tasks:
* 1/foo (submitted)
* 1/stop (submitted)
INFO - platform: xcel00-bg - remote tidy (on xcel00)
INFO - DONE
$ cylc cat-log tmp.v99NRD4cig -m t
...
2024-08-23T13:17:32+01:00 INFO - LOADING task proxies
2024-08-23T13:17:32+01:00 INFO - + 1/foo submitted
2024-08-23T13:17:32+01:00 INFO - [1/foo/01:submitted(runahead)] => submitted
2024-08-23T13:17:32+01:00 INFO - + 1/stop submitted
2024-08-23T13:17:32+01:00 INFO - [1/stop/01:submitted(runahead)] => submitted
2024-08-23T13:17:32+01:00 INFO - LOADING job data
2024-08-23T13:17:32+01:00 INFO - LOADING task action timers
2024-08-23T13:17:32+01:00 INFO - + 1/foo poll_timer
2024-08-23T13:17:32+01:00 INFO - + 1/foo ['try_timers', 'submission-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/foo ['try_timers', 'execution-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/stop poll_timer
2024-08-23T13:17:32+01:00 INFO - + 1/stop ['try_timers', 'submission-retry']
2024-08-23T13:17:32+01:00 INFO - + 1/stop ['try_timers', 'execution-retry']
2024-08-23T13:17:32+01:00 INFO - Flows:
flow: 1 (original flow from 1) 2024-08-23T13:17:15
2024-08-23T13:17:32+01:00 INFO - platform: xcel00-bg - remote init (on xcel00)
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] (polled)started
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] setting implied output: submitted
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] submitted to localhost:None[None]
2024-08-23T13:17:33+01:00 WARNING - Unhandled jobs-poll output: 2024-08-23T13:17:32+01:00|1/stop/01|{"job_runner_name": "background", "job_id": "114139", "job_runner_exit_polled": 0, "time_submit_exit": "2024-08-23T13:17:23+01:00", "time_run": "2024-08-23T13:17:24+01:00"}
2024-08-23T13:17:33+01:00 WARNING - list index out of range
2024-08-23T13:17:33+01:00 INFO - [1/stop/01:submitted] => succeeded
2024-08-23T13:17:33+01:00 INFO - [1/fin:waiting(runahead)] => waiting
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] (polled)succeeded
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] setting implied output: submitted
2024-08-23T13:17:34+01:00 INFO - [1/foo/01:submitted] submitted to xcel00-bg:None[None]
2024-08-23T13:17:34+01:00 WARNING - Unhandled jobs-poll output: 2024-08-23T12:17:34Z|1/foo/01|{"job_runner_name": "background", "job_id": "44543", "run_status": 0, "time_submit_exit": "2024-08-23T12:17:22Z", "time_run": "2024-08-23T12:17:23Z", "time_run_exit": "2024-08-23T12:17:28Z"}
2024-08-23T13:17:34+01:00 WARNING - list index out of range
2024-08-23T13:17:35+01:00 INFO - platform: xcel00-bg - remote file install (on xcel00)
2024-08-23T13:17:36+01:00 INFO - platform: xcel00-bg - remote file install complete This chunk of the restart log displays the same symptoms as the OP:
Interestingly, if I request a manual poll on the task, the situation is resolved:
However, setting the succeeded output does not work:
And of course the job is not killable because it is not running:
Notes:
|
The restart remote-init seems to always complete after the poll result is received. I wondered if this might possibly be part of the problem, however, it doesn't appear to be. This diff will make the restart poll wait for the remote-init to complete: diff --git a/cylc/flow/scheduler.py b/cylc/flow/scheduler.py
index 92702b0b5..7962f391e 100644
--- a/cylc/flow/scheduler.py
+++ b/cylc/flow/scheduler.py
@@ -633,6 +633,10 @@ class Scheduler:
if self.pool.get_tasks():
# (If we're not restarting a finished workflow)
self.restart_remote_init()
+ while self.incomplete_ri_map:
+ self.proc_pool.process()
+ self.manage_remote_init()
+ await asyncio.sleep(0.1)
await commands.run_cmd(commands.poll_tasks, self, ['*/*'])
self.run_event_handlers(self.EVENT_STARTUP, 'workflow starting') However, the example reproduces exactly the same irrespective of the order:
|
Progress!This error appears to be pertinent:
It is the exception that triggers the
It is actually a more serious error bubbling up from further down the call chain and being caught by a loose try/except causing it to surface as a less serious error. It would appear that When the succeeded message comes in, it triggers the (implicit) submitted message to be processed. This ultimately triggers However, this results in traceback because the job conf is not present in This diff is enough to allow the workflow to continue (although the exact job config will presumably be lost forever?): diff --git a/cylc/flow/task_events_mgr.py b/cylc/flow/task_events_mgr.py
index bf9c2ba3a..f295816e3 100644
--- a/cylc/flow/task_events_mgr.py
+++ b/cylc/flow/task_events_mgr.py
@@ -661,6 +661,8 @@ class TaskEventsManager():
True: if polling is required to confirm a reversal of status.
"""
+ # if itask.identity == '1/foo' and message == 'succeeded':
+ # breakpoint()
# Log messages
if event_time is None:
@@ -1533,7 +1535,10 @@ class TaskEventsManager():
if (itask.tdef.run_mode == RunMode.SIMULATION) or forced:
job_conf = {"submit_num": itask.submit_num}
else:
- job_conf = itask.jobs[-1]
+ try:
+ job_conf = itask.jobs[-1]
+ except Exception:
+ job_conf = {"submit_num": itask.submit_num}
# Job status should be task status unless task is awaiting a
# retry: Turned this bodge into #6326 |
* Closes cylc#6314 * There are niche situations where the job is not stored in "TaskProxy.jobs". * This handles the situation as gracefully as we are able to.
* Address a code TODO to reduce the scope of a try/except to the individual expressions it was intended to cover * The overzealous error catching had hidden a genuine error causing it to be missed for some time, see cylc#6314
* Address a code TODO to reduce the scope of a try/except to the individual expressions it was intended to cover * The overzealous error catching had hidden a genuine error causing it to be missed for some time, see cylc#6314
* Closes cylc#6314 * There are niche situations where the job is not stored in "TaskProxy.jobs". * This handles the situation as gracefully as we are able to.
Spotted in the wild:
Two problems here:
And two oddities:
submitted to xce:None[None]
?!Reproducible-ish example
This example reproduces the example fairly reliably:
You will also need the following diff:
To reproduce:
The text was updated successfully, but these errors were encountered: