Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giving an invalid --stopcp on start "corrupts" database #4637

Closed
MetRonnie opened this issue Jan 31, 2022 · 9 comments · Fixed by #4827
Closed

Giving an invalid --stopcp on start "corrupts" database #4637

MetRonnie opened this issue Jan 31, 2022 · 9 comments · Fixed by #4827
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@MetRonnie
Copy link
Member

If you have a not yet run workflow and give an invalid value for --stopcp when starting it for the first time, after the workflow shuts down due to the error, you are left with a workflow database with empty tables.

Because the workflow_params table is empty, if you try to restart the workflow, Cylc thinks the database is corrupted and refuses to play ball.

ERROR - Workflow shutting down - ServiceFileError: Cannot restart - Workflow database is incompatible with Cylc 8.0rc1.dev, or is corrupted

And you can't clean the database because of #4450

$ cylc clean myflow/ --rm .service
ServiceFileError: Cannot clean - Workflow database is incompatible with Cylc 8.0rc1.dev, or is corrupted

Pull requests welcome!
This is an Open Source project - please consider contributing a bug fix
yourself (please read CONTRIBUTING.md before starting any work though).

@MetRonnie MetRonnie added the bug Something is wrong :( label Jan 31, 2022
@MetRonnie MetRonnie added this to the cylc-8.0rc2 milestone Jan 31, 2022
@wxtim wxtim self-assigned this Feb 10, 2022
@wxtim
Copy link
Member

wxtim commented Feb 10, 2022

Can confirm that I can duplicate this bug.

@hjoliver
Copy link
Member

I guess we have to make the scheduler deal with this problem at start-up (as apposed to deleting the DB on early shutdown ... or as well as, at least) because it could happen at an unclean shutdown.

@MetRonnie
Copy link
Member Author

I think it has to happen on shutdown because you could still keyboard interrupt in between DB creation and population? (Just seemed to happen to me actually)

@hjoliver
Copy link
Member

Not sure I follow you. Or if you didn't follow me 🤣 ... I should have said "I guess we have to make the scheduler deal with this at restart" (not at the original start-up). I mean, even if we tried to fix this at the source - i.e. delete the "corrupted" database just before premature shutdown, so that it does not affect a later restart - that might not be sufficient, because the scheduler (or the host it runs on) can be killed without a chance to clean up after itself.

@wxtim
Copy link
Member

wxtim commented Feb 11, 2022

I had originally been entirely unclear whether to handle this at shutdown or restart.

Why do it at shutdown

  • Seems to me like shutdown shouldn't leave the database in a state we're not happy with.
  • Risk of future accrual of a long-winded set of checks on the database as we discover other db states where this happens.
  • [Human factor, not a good reason] Implementation seems fairly clear to me.

Why do it at restart

  • Problem is caused by workflow missing a step during shutdown.
  • Scheduler might shutdown horribly and not do any cleanup.

My Verdict - shutdown (certainty: low)

I wonder whether the scheduler shutting down horribly is a different problem to the one described where the scheduler never really gets started because the config is broken. I think it's reasonable to rely on the shutdown logic to clean the database in this case.

@MetRonnie
Copy link
Member Author

MetRonnie commented Feb 11, 2022

I guess we have to make the scheduler deal with this at restart

Ah, that makes more sense! So if it finds the workflow_params table exists and is empty at start, delete the whole database (logging a warning presumably) and do a start as opposed to restart. (Need to check the table exists because in previous versions of Cylc it was called suite_params, so if the database exists and workflow_params table does not, it means it's an incompatible DB rather than corrupted)

If this functionality is in a self-contained function we could do this both on shutdown (because we ought to) and restart (because shutdown itself can be interrupted too)

@wxtim
Copy link
Member

wxtim commented Feb 11, 2022

That makes sense to me as a proposal @MetRonnie - by making the check self contained I can test it easily and call it wherever if we change our minds later. 😄

@wxtim
Copy link
Member

wxtim commented Feb 11, 2022

From team meeting discussion 20220211T1400Z

  • Does this apply to cases where the config is malformed as well as opts?
    • No
  • Does this apply to any other opts?
    • No
  • What happens to the database when you reinstall and re-start?
  • It looks like we should validate options and configs before creating the database.

@wxtim wxtim added the BLOCKED This can't happen until something else does label Mar 18, 2022
@wxtim
Copy link
Member

wxtim commented Mar 18, 2022

Blocked pending resolution of #4709

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment