Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout on restarting completed workflows #5231

Merged
merged 12 commits into from
Jul 18, 2023

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Nov 16, 2022

Close #5078

  • a new configurable restart timer (default PT2M) starts if the task pool is found to be empty at restart
  • the timer stops if any task gets triggered before timeout
  • on timeout the scheduler shuts down again
  • does not need associated event handlers - if it times out, nothing has changed
  • (also tried to disentangle some touched code a little, mainly scheduler.update_data_structure())

Screenshot from 2022-11-17 11-19-35

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Documentation issue - How to "continue" a completed workflow cylc-doc#525
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@hjoliver hjoliver added this to the cylc-8.0.4 milestone Nov 16, 2022
@hjoliver hjoliver self-assigned this Nov 16, 2022
@hjoliver hjoliver modified the milestones: cylc-8.0.4, cylc-8.1.0 Nov 16, 2022
@oliver-sanders
Copy link
Member

It might be worth prising apart the restart timer work from the update data store changes included on this branch so that we can get this in sooner. I think the data store changes will require more work, reviewing & testing.

@oliver-sanders oliver-sanders modified the milestones: cylc-8.1.0, cylc-8.2.0 Dec 7, 2022
@hjoliver
Copy link
Member Author

hjoliver commented Jan 6, 2023

It might be worth prising apart the restart timer work from the update data store changes included on this branch so that we can get this in sooner. I think the data store changes will require more work, reviewing & testing.

I must have made those changes for a reason, I don't think they were entirely incidental to the main change. Unfortunately, I've forgotten by now. I'll take a look ...

@hjoliver
Copy link
Member Author

(Rebased)

@hjoliver hjoliver marked this pull request as ready for review May 3, 2023 04:39
@hjoliver
Copy link
Member Author

hjoliver commented May 3, 2023

It might be worth prising apart the restart timer work from the update data store changes included on this branch so that we can get this in sooner. I think the data store changes will require more work, reviewing & testing.

I must have made those changes for a reason, I don't think they were entirely incidental to the main change. Unfortunately, I've forgotten by now. I'll take a look ...

@oliver-sanders - the new timeout required tweaking the scheduler's datastore-update call slightly. The problem was, the datastore update method itself contained code to figure out if the datastore actually needs updating. It was easier to accommodate the new change (and it makes more sense too) to have the method just update the datastore, and decide whether or not to call it one level up in the stack. So it's a small change - I just moved that code out into the caller.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One minor issue is that after the "this workflow already ran to completion" message, we get a subsequent "not active tasks" message which buries this more important information.

Screenshot from 2023-05-11 13-32-30

@hjoliver
Copy link
Member Author

One minor issue is that after the "this workflow already ran to completion" message, we get a subsequent "not active tasks" message which buries this more important information.

Fixed - by not attempting to poll tasks on restart if the task pool is empty.

@hjoliver
Copy link
Member Author

(Temporarily reverting to DRAFT while I investigate something ...)

@hjoliver hjoliver marked this pull request as draft May 15, 2023 09:04
@oliver-sanders
Copy link
Member

(Temporarily reverting to DRAFT while I investigate something ...)

Find anything?

@oliver-sanders
Copy link
Member

Added an integration test here - hjoliver#33

@hjoliver
Copy link
Member Author

hjoliver commented Jul 17, 2023

(Temporarily reverting to DRAFT while I investigate something ...)

Sorry, I no longer recall what in hell I was talking about there. I have a vague feeling I was testing something else that I thought might affect this branch, but I guess I got distracted and didn't get back to this. I took another quick scan through this change though and I can't see anything wrong with it. Plus, it works.

Merged your side PR @oliver-sanders - plz take over this PR while I'm away...

cylc/flow/scheduler.py Outdated Show resolved Hide resolved
tests/integration/test_scheduler.py Outdated Show resolved Hide resolved
@oliver-sanders oliver-sanders marked this pull request as ready for review July 18, 2023 09:58
cylc/flow/cfgspec/globalcfg.py Outdated Show resolved Hide resolved
cylc/flow/scheduler.py Outdated Show resolved Hide resolved
cylc/flow/scheduler.py Outdated Show resolved Hide resolved
CHANGES.md Show resolved Hide resolved
@wxtim
Copy link
Member

wxtim commented Jul 18, 2023

cancelled checks on dfde64f because changes to CHANGES.md only.

Previous test run 512e96b

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
@oliver-sanders
Copy link
Member

This has the potential to interact with #5592 which was merged after this branch was forked. Likely test failures, should do a manual test jamming in sleeps to make the reload appear slow (see PR for details) to ensure that is still working.

@oliver-sanders
Copy link
Member

Nope, went through fine, I've tested reload with sleeps jammed in and it appears to have worked well.

@oliver-sanders
Copy link
Member

(plz squash merge)

@MetRonnie
Copy link
Member

1 unresolved comment: #5231 (comment)

Co-authored-by: Ronnie Dutta <61982285+MetRonnie@users.noreply.github.com>
@oliver-sanders
Copy link
Member

(Pressed the commit button)

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wxtim
Copy link
Member

wxtim commented Jul 18, 2023

I have carried out the tests described, and also run the integration and functional/reload tests locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

keep-alive timeout for restarted completed workflows
4 participants