You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a task is stuck in the submitted or running state, which can happen if a job host goes down, the job cannot be killed and the task cannot be re-triggered.
Currently I think we should be able get around this with cylc remove and a subsequent cylc trigger however, this isn't the cleanest solution.
A neater solution to the problem might be to add two new behaviours to cylc kill:
Allow the job to be "killed" internally even if Cylc cannot kill it on the remote platform.
Get Cylc to reset the task state to waiting rather than failed.
These behaviours could be represented by a single CLI option, or by separate ones if we find additional use cases for them.
When an operator kills a task with these behaviours enabled:
The task would change from {submitted,running} to waiting.
Similar to the xtrigger auto-retry mechanism, reset to waiting and slap a restraint (held in this case) on the task to stop it running right away.
We just want to orphan the stuck job, we don't want to set task outputs.
The failed output could activate graph branches we don't want.
The job would change from {submitted,running} to {submit-failed,failed}.
We don't want {submitted,running} jobs kicking about as this would be confusing from the UI and could cause problems with polling.
The user has told Cylc they aren't interested in the job any more, it is dead to them (and probably in real life too).
(I think this will suppress any messages coming back from the job if it turns out not to be so dead after all).
The task would be "held" (preventing it from re-submitting immediately).
This is the default behaviour of cylc kill.
Allows the user to broadcast/reload the workflow to change the platform before re-submission.
When manually triggered the task will produce it's next submission and the workflow will continue as normal.
I think cylc kill is a good place to solve this problem as it is often reported as "Cylc couldn't kill this task". Kill is a natural GOTO if there's something marked as active that you would like to make "less active".
In #4727 we also propose being able to deal with tasks stuck in the submitted or running state via set-outputs.
I can't see any harm in supporting both methods.
If a task is stuck in the submitted or running state, which can happen if a job host goes down, the job cannot be killed and the task cannot be re-triggered.
Currently I think we should be able get around this with
cylc remove
and a subsequentcylc trigger
however, this isn't the cleanest solution.A neater solution to the problem might be to add two new behaviours to
cylc kill
:These behaviours could be represented by a single CLI option, or by separate ones if we find additional use cases for them.
When an operator kills a task with these behaviours enabled:
failed
output could activate graph branches we don't want.cylc kill
.I think
cylc kill
is a good place to solve this problem as it is often reported as "Cylc couldn't kill this task". Kill is a natural GOTO if there's something marked as active that you would like to make "less active".Related to:
cylc set-outputs
: use cases and trigger compatibility #4727cylc remove
and/orcylc forget
? #4728Questions:
cylc kill
the right place to solve this?Pull requests welcome!
The text was updated successfully, but these errors were encountered: