Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill: add options orphan active jobs #5147

Open
oliver-sanders opened this issue Sep 20, 2022 · 3 comments
Open

kill: add options orphan active jobs #5147

oliver-sanders opened this issue Sep 20, 2022 · 3 comments
Labels
question Flag this as a question for the next Cylc project meeting.
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 20, 2022

If a task is stuck in the submitted or running state, which can happen if a job host goes down, the job cannot be killed and the task cannot be re-triggered.

Currently I think we should be able get around this with cylc remove and a subsequent cylc trigger however, this isn't the cleanest solution.

A neater solution to the problem might be to add two new behaviours to cylc kill:

  1. Allow the job to be "killed" internally even if Cylc cannot kill it on the remote platform.
  2. Get Cylc to reset the task state to waiting rather than failed.

These behaviours could be represented by a single CLI option, or by separate ones if we find additional use cases for them.

When an operator kills a task with these behaviours enabled:

  • The task would change from {submitted,running} to waiting.
    • Similar to the xtrigger auto-retry mechanism, reset to waiting and slap a restraint (held in this case) on the task to stop it running right away.
    • We just want to orphan the stuck job, we don't want to set task outputs.
    • The failed output could activate graph branches we don't want.
  • The job would change from {submitted,running} to {submit-failed,failed}.
    • We don't want {submitted,running} jobs kicking about as this would be confusing from the UI and could cause problems with polling.
    • The user has told Cylc they aren't interested in the job any more, it is dead to them (and probably in real life too).
    • (I think this will suppress any messages coming back from the job if it turns out not to be so dead after all).
  • The task would be "held" (preventing it from re-submitting immediately).
    • This is the default behaviour of cylc kill.
    • Allows the user to broadcast/reload the workflow to change the platform before re-submission.
  • When manually triggered the task will produce it's next submission and the workflow will continue as normal.

I think cylc kill is a good place to solve this problem as it is often reported as "Cylc couldn't kill this task". Kill is a natural GOTO if there's something marked as active that you would like to make "less active".

Related to:

Questions:

  1. Is cylc kill the right place to solve this?
  2. Combine the two behaviours into one option (easier to use) or separate them into two (if there are other valid use cases?).
  3. Haggle over the CLI option name.

Pull requests welcome!

@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Sep 20, 2022
@oliver-sanders oliver-sanders added this to the cylc-8.x milestone Sep 20, 2022
@dpmatthews
Copy link
Contributor

I think these need to be separate options because sometimes you want the task state to be set to failed / submit-failed.
How about:

  • cylc kill --force - kill the job even if Cylc cannot kill it on the remote platform
  • cylc kill --forget - reset the task state to waiting (i.e. forget the job was submitted), implies --force

@dpmatthews
Copy link
Contributor

In #4727 we also propose being able to deal with tasks stuck in the submitted or running state via set-outputs.
I can't see any harm in supporting both methods.

@hjoliver
Copy link
Member

Using kill for this seems reasonable to me. And I like @dpmatthews suggested option names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Flag this as a question for the next Cylc project meeting.
Projects
None yet
Development

No branches or pull requests

3 participants