kill: add options orphan active jobs #5147

oliver-sanders · 2022-09-20T12:38:39Z

If a task is stuck in the submitted or running state, which can happen if a job host goes down, the job cannot be killed and the task cannot be re-triggered.

Currently I think we should be able get around this with cylc remove and a subsequent cylc trigger however, this isn't the cleanest solution.

A neater solution to the problem might be to add two new behaviours to cylc kill:

Allow the job to be "killed" internally even if Cylc cannot kill it on the remote platform.
Get Cylc to reset the task state to waiting rather than failed.

These behaviours could be represented by a single CLI option, or by separate ones if we find additional use cases for them.

When an operator kills a task with these behaviours enabled:

The task would change from {submitted,running} to waiting.
- Similar to the xtrigger auto-retry mechanism, reset to waiting and slap a restraint (held in this case) on the task to stop it running right away.
- We just want to orphan the stuck job, we don't want to set task outputs.
- The failed output could activate graph branches we don't want.
The job would change from {submitted,running} to {submit-failed,failed}.
- We don't want {submitted,running} jobs kicking about as this would be confusing from the UI and could cause problems with polling.
- The user has told Cylc they aren't interested in the job any more, it is dead to them (and probably in real life too).
- (I think this will suppress any messages coming back from the job if it turns out not to be so dead after all).
The task would be "held" (preventing it from re-submitting immediately).
- This is the default behaviour of cylc kill.
- Allows the user to broadcast/reload the workflow to change the platform before re-submission.
When manually triggered the task will produce it's next submission and the workflow will continue as normal.

I think cylc kill is a good place to solve this problem as it is often reported as "Cylc couldn't kill this task". Kill is a natural GOTO if there's something marked as active that you would like to make "less active".

Related to:

Questions:

Is cylc kill the right place to solve this?
Combine the two behaviours into one option (easier to use) or separate them into two (if there are other valid use cases?).
Haggle over the CLI option name.

Pull requests welcome!

The text was updated successfully, but these errors were encountered:

dpmatthews · 2022-10-10T08:25:09Z

I think these need to be separate options because sometimes you want the task state to be set to failed / submit-failed.
How about:

cylc kill --force - kill the job even if Cylc cannot kill it on the remote platform
cylc kill --forget - reset the task state to waiting (i.e. forget the job was submitted), implies --force

dpmatthews · 2022-10-10T16:08:12Z

In #4727 we also propose being able to deal with tasks stuck in the submitted or running state via set-outputs.
I can't see any harm in supporting both methods.

hjoliver · 2022-10-10T20:33:07Z

Using kill for this seems reasonable to me. And I like @dpmatthews suggested option names.

oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Sep 20, 2022

oliver-sanders added this to the cylc-8.x milestone Sep 20, 2022

dpmatthews mentioned this issue Oct 10, 2022

How to re-run a task stuck in submitted or running? #5177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kill: add options orphan active jobs #5147

kill: add options orphan active jobs #5147

oliver-sanders commented Sep 20, 2022 •

edited

Loading

dpmatthews commented Oct 10, 2022

dpmatthews commented Oct 10, 2022

hjoliver commented Oct 10, 2022

kill: add options orphan active jobs #5147

kill: add options orphan active jobs #5147

Comments

oliver-sanders commented Sep 20, 2022 • edited Loading

dpmatthews commented Oct 10, 2022

dpmatthews commented Oct 10, 2022

hjoliver commented Oct 10, 2022

oliver-sanders commented Sep 20, 2022 •

edited

Loading