Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database refactor #3872

Open
2 of 11 tasks
oliver-sanders opened this issue Oct 15, 2020 · 10 comments
Open
2 of 11 tasks

database refactor #3872

oliver-sanders opened this issue Oct 15, 2020 · 10 comments
Labels
efficiency For notable efficiency improvements
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Oct 15, 2020

Note: Tagged against 8.0.0 as this milestone would be a good time to introduce breaking changes, however, this is not required for 8.0.0 and the interface change is less significant now that we have the GraphQL interface.

Note: We have already made breaking changes to the DB in Cylc8.

Note: We will soon need to start building functionality into the UIS to read the DB to provide offline data.

The run database can get quite large >750MB, note we have two copies of the database so that's 1.5GB. It should be pretty easy to bring this down in size.

Suggestions:

  • Tidy the prereqs table. Store task prerequisites in their own DB table #3863
  • Drop the checkpoint tables. Remove checkpointing #3906
  • Record timestamps as unix epoch integers rather than text fields.
  • Enumerate task/job statuses to store them as integers rather than text fields.
    • Same pros/cons as above
    • Could use a single character for clarity e.g. 'w', 'r', 's', 'f'.
  • Move away from compound primary keys e.g. (name, cycle_point).
    • These text fields are duplicated across multiple tables which is inefficient.
    • E.G. create a table of tasks (id INTEGER AUTONUMBER, name TEXT, cycle_point TEXT)
    • Compound text fields could then be replaced with integers.
    • Querying by task name would still work by table relations.
    • E.G. the task pool should only be adding state information, everything else should be stored elsewhere.
  • Have a quick check to see if any fields can be removed:
    • For example I'm pretty sure we don't need the time_created field in task_states.
  • Consider refactoring task_events and task_jobs
    • These tables contain duplicate information in different formats.
    • The task_events table is functionless.
    • These tables account for 60-70% of memory usage.
    • It would be good if this information was sufficient to reconstruct the evolution of the graph after the fact.
  • Consider using "write ahead log".
  • Remove "task job" terminology - Retire the term "task jobs" cylc-doc#352

Other related matters:

Since the DB functionality is pretty self-contained and we don't require upgraders at this point these changes shouldn't be too much work. As we are now approaching 8.0.rc1 release we will need to provide full DB back-compat support for any changes.

@oliver-sanders oliver-sanders added the efficiency For notable efficiency improvements label Oct 15, 2020
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Oct 15, 2020
@oliver-sanders oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Oct 15, 2020
@hjoliver
Copy link
Member

Most or all of the above are desirable, but maybe not on the critical path to Cylc 8 - except that, as you note, now is a good time to make breaking changes. I'd still rather get the 100% critical bits in the bag first though.

@kinow
Copy link
Member

kinow commented Oct 15, 2020

+1 to Cylc 8 or Cylc 9.

For other databases, like Postgres or Oracle, it would be simpler if the Cylc database supported multiple workflows, instead of one database per workflow.

That would require modifying the database and adding more tables, more relationships. But without doing this, it would be quite hard to implement #3360.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Oct 15, 2020

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

@kinow
Copy link
Member

kinow commented Oct 15, 2020

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

I think that makes sense. And probably the simplest way to implement that. Later we can add other tables/fields/indexes/etc.

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

👍 good idea.

@oliver-sanders
Copy link
Member Author

#864

@oliver-sanders
Copy link
Member Author

If we address this one further down the line it might be worth considering moving to a graph DB such as Dgraph. Would simplify UIS offline data and allow DB structure to more closely match the live data store.

@hjoliver
Copy link
Member

hjoliver commented Jun 3, 2021

Nice idea.

@MetRonnie
Copy link
Member

MetRonnie commented Aug 24, 2021

Record timestamps as unix epoch integers rather than text fields

We should check that we will not run into the year 2038 problem. According to the sqlite3 docs the integer datatype is capable of being 64-bit (8 bytes), so should be okay.

INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.

@oliver-sanders oliver-sanders removed the question Flag this as a question for the next Cylc project meeting. label Apr 22, 2022
@oliver-sanders
Copy link
Member Author

oliver-sanders commented May 13, 2022

We should also look at finding an efficient way to perpetual logging prerequisites/outputs into this refactor - see #4036 (comment)

[HO] note to store the state of prerequisites at trigger time (which is probably the important thing for past tasks) requires some additional information not currently available from the DB, because (for conditionally triggered tasks) that is not the same as the final state of the upstream task outputs.

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Dec 22, 2022

I think we are currently missing job messages from the DB, might be a good opportunity to get them in now that messages are visible in the GUI - #2394 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

No branches or pull requests

4 participants