Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct stall diagnosis #3892

Closed
hjoliver opened this issue Oct 27, 2020 · 0 comments · Fixed by #3823
Closed

Correct stall diagnosis #3892

hjoliver opened this issue Oct 27, 2020 · 0 comments · Fixed by #3823
Assignees
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Oct 27, 2020

Better workflow completion handling (SoD Proposal)

Long story short:

Pre-SoD stall was pragmatically rather than conceptually grounded: the scheduler had literally got stuck and didn't know what to do about it. There were no more tasks to run and one or more failed or unsatisfied waiting tasks in the pool.

The unsatisfied waiting tasks part could lead to normal workflow completion being incorrectly identified as a stall, because of all the wholly-unsatisfied waiting tasks spawned ahead even though they might not be needed.

Post-SoD there are no wholly-unsatisfied waiting tasks and there will soon (#3822) be no partially-satisfied ones either
(just partially-satisfied prerequisites in a hidden pool, and as the example in the doc section linked to above shows they can't be used to reliably identify a stall).

What stall should mean: the scheduler can't do anything more, but it knows that the flow is not finished.

The only way valid to make that determination now is if there are unhandled failed tasks in the pool. They are, by definition, task outcomes that were not meant to happen.

So:

  • if the active pool is empty:
    • completed
  • else if the active pool contains only unhandled failed tasks:
    • stalled
  • else:
    • still running

At normal shutdown or stall log any partially satisfied prerequisites in case they point to a flow design error, but in general we can't assume they were "meant" to be completed.

(Note special treatment of unhandled failed tasks is still under discussion; if that special treatment is revoked there will be no stall concept at all anymore).

@hjoliver hjoliver self-assigned this Oct 27, 2020
@hjoliver hjoliver changed the title Workflow stall vs completion Correct stall diagnosis Oct 27, 2020
@hjoliver hjoliver added this to the cylc-8.0.0 milestone Oct 27, 2020
@oliver-sanders oliver-sanders modified the milestones: cylc-8.0.0, cylc-8.0a3 Nov 10, 2020
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants