Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-implement the task retry state using xtriggers #3423

Merged
merged 17 commits into from
Oct 7, 2020

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Oct 24, 2019

  • Remove the following task states:
    • TASK_STATUS_RETRYING
    • TASK_STATUS_SUBMIT_RETRYING
  • Re-implement retry logic using xtriggers.
    • Create an xtirigger called cylc[_submission]_retry_<task>.<cycle>
      • Use the cylc namespace for special triggers.
    • Create only one xtrigger per task per retry type.
  • Reset "retrying" tasks to waiting.

Caveats:

  • Can no longer select tasks by retrying states on the CLI (e.g. <cyclepoint>/*:retrying)
  • Retrying tasks are now waiting so have a different position in TASK_STATUS_ORDERED
    • TODO need to check this doesn't break polling logic

These changes partially address cylc/cylc-admin#47

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • I have opened a documentation Issue at Document how retries work in Cylc 8 cylc-doc#162.

@oliver-sanders oliver-sanders self-assigned this Oct 24, 2019
@oliver-sanders oliver-sanders added this to the cylc-8.0a2 milestone Oct 24, 2019
@hjoliver
Copy link
Member

That was quick!

cylc/flow/task_events_mgr.py Show resolved Hide resolved
cylc/flow/task_pool.py Show resolved Hide resolved
cylc/flow/xtriggers/wall_clock.py Show resolved Hide resolved
@oliver-sanders oliver-sanders marked this pull request as ready for review October 25, 2019 16:53
@cylc cylc deleted a comment Oct 25, 2019
@cylc cylc deleted a comment Oct 28, 2019
@cylc cylc deleted a comment Nov 1, 2019
@hjoliver
Copy link
Member

hjoliver commented Nov 4, 2019

I'm part way through reviewing this. I think it's all good, but there's a few caveats to think through, as you've pointed out.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tests as working. Trying to think through the polling logic question...

@kinow
Copy link
Member

kinow commented Oct 7, 2020

bash failures are related to docker and memory, see #3856 for fix

2020-10-07T04:25:41.9392152Z    write (2, "virtual memory exhausted\n", 25);

@hjoliver hjoliver requested a review from kinow October 7, 2020 05:00
@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

All functional and unit tests passing now.

The bash test tests/functional/broadcast/00-simple.t fails intermittently on other branches too.

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

(The last 3 commits just modify comments, via the GH UI "commit suggestion" button).

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

(All tests passing 😁 )

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few typos and pydocs issues. No major issues with rest of the code (unless that change in rundb is a legit issue). I did find a few interesting things in other parts of the code, but they were not in this pull request (which means I've got more post-its on my screen now to investigate later 😬 )

@hjoliver I tested it with Cylc UIServer, and found no exceptions in the logs. With Cylc UI, however, the complex workflow logged that error due to missing parents.

image

image

I tested master and the errors didn't occur. But given how little we know about that error, it could be that the same happens on master, but intermittently. I have no idea if complex has anything special about it with regards to the change of retry state in this PR 👍

Thanks!
B

cylc/flow/rundb.py Outdated Show resolved Hide resolved
cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved
cylc/flow/task_events_mgr.py Show resolved Hide resolved
cylc/flow/task_events_mgr.py Outdated Show resolved Hide resolved
cylc/flow/xtrigger_mgr.py Outdated Show resolved Hide resolved
cylc/flow/rundb.py Show resolved Hide resolved
@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

I have no idea if complex has anything special about it with regards to the change of retry state in this PR

I just checked, it doesn't look like any tasks in the complex suite will retry so it must be a coincidence.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look at the files that GH UI showed as changed since my last review (thanks GH for such handy functionality). Looking good to me, +1

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

@kinow - the same flaky bash test failed again. I think you can approve and merge this if the functional tests all pass.

@kinow
Copy link
Member

kinow commented Oct 7, 2020

(@hjoliver I'm guessing CI is still a bit unstable, but approved since the conflict changed files were only 3, from a PR that I reviewed yesterday; no issues that I could see)

@kinow
Copy link
Member

kinow commented Oct 7, 2020

@kinow - the same flaky bask test failed again. I think you can approve and merge this if the functional tests all pass.

Missed this comment by 23 seconds, while I was typing. But I managed to approve before your comment 😄

@hjoliver
Copy link
Member

hjoliver commented Oct 7, 2020

(Two approvals 😁 and the bash test fail is a known flaky ... merging)

@hjoliver hjoliver merged commit f23ce67 into cylc:master Oct 7, 2020
@oliver-sanders oliver-sanders deleted the retry branch October 8, 2020 07:50
@oliver-sanders
Copy link
Member Author

Wow, did that really happen.

@oliver-sanders oliver-sanders modified the milestones: cylc-8.0a4, cylc-8.0a3 Oct 9, 2020
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants