database refactor #3872

oliver-sanders · 2020-10-15T10:05:04Z

Note: Tagged against 8.0.0 as this milestone would be a good time to introduce breaking changes, however, this is not required for 8.0.0 and the interface change is less significant now that we have the GraphQL interface.

Note: We have already made breaking changes to the DB in Cylc8.

Note: We will soon need to start building functionality into the UIS to read the DB to provide offline data.

The run database can get quite large >750MB, note we have two copies of the database so that's 1.5GB. It should be pretty easy to bring this down in size.

Suggestions:

Tidy the prereqs table. Store task prerequisites in their own DB table #3863
Drop the checkpoint tables. Remove checkpointing #3906
- checkpoint: re-evaluate purpose of checkpointing #3864
- These are fairly small so there is little space gain here.
- Fewer tables would make the database a bit clearer though.
Record timestamps as unix epoch integers rather than text fields.
- This should bring a significant reduction in storage requirement.
- Sqlite has date/time functions to make DB queries with integer timestamps nicer.
- We can also define our own functions for use on the Python side if we want - https://docs.python.org/3/library/sqlite3.html?highlight=sqlite3#sqlite3.Connection.create_function
Enumerate task/job statuses to store them as integers rather than text fields.
- Same pros/cons as above
- Could use a single character for clarity e.g. 'w', 'r', 's', 'f'.
Move away from compound primary keys e.g. (name, cycle_point).
- These text fields are duplicated across multiple tables which is inefficient.
- E.G. create a table of tasks (id INTEGER AUTONUMBER, name TEXT, cycle_point TEXT)
- Compound text fields could then be replaced with integers.
- Querying by task name would still work by table relations.
- E.G. the task pool should only be adding state information, everything else should be stored elsewhere.
Have a quick check to see if any fields can be removed:
- For example I'm pretty sure we don't need the time_created field in task_states.
Consider refactoring task_events and task_jobs
- These tables contain duplicate information in different formats.
- The task_events table is functionless.
- These tables account for 60-70% of memory usage.
- It would be good if this information was sufficient to reconstruct the evolution of the graph after the fact.
  - Useful for reconstructing prerequisite objects for historical tasks.
  - data-store n > 0 window DB loads #4581 (comment)
Consider using "write ahead log".
Remove "task job" terminology - Retire the term "task jobs" cylc-doc#352

Other related matters:

Consider the remaining points from suite database: task and job status #864
Consider suite database: more suite parameters and suite events #1032

~~Since the DB functionality is pretty self-contained and we don't require upgraders at this point these changes shouldn't be too much work.~~ As we are now approaching 8.0.rc1 release we will need to provide full DB back-compat support for any changes.

The text was updated successfully, but these errors were encountered:

hjoliver · 2020-10-15T10:34:00Z

Most or all of the above are desirable, but maybe not on the critical path to Cylc 8 - except that, as you note, now is a good time to make breaking changes. I'd still rather get the 100% critical bits in the bag first though.

kinow · 2020-10-15T11:06:51Z

+1 to Cylc 8 or Cylc 9.

For other databases, like Postgres or Oracle, it would be simpler if the Cylc database supported multiple workflows, instead of one database per workflow.

That would require modifying the database and adding more tables, more relationships. But without doing this, it would be quite hard to implement #3360.

oliver-sanders · 2020-10-15T11:42:20Z

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

kinow · 2020-10-15T12:14:22Z

Could do that by adding a workflow table then adding a workflow id (integer) field to each entry?

I think that makes sense. And probably the simplest way to implement that. Later we can add other tables/fields/indexes/etc.

Would need to check the integer value limit, if numbering job submissions for multiple workflows in one table Cylc could very quickly rack up some pretty big numbers!

👍 good idea.

oliver-sanders · 2020-10-29T11:22:42Z

#864

oliver-sanders · 2021-06-03T11:17:12Z

If we address this one further down the line it might be worth considering moving to a graph DB such as Dgraph. Would simplify UIS offline data and allow DB structure to more closely match the live data store.

hjoliver · 2021-06-03T21:51:31Z

Nice idea.

MetRonnie · 2021-08-24T12:05:35Z

Record timestamps as unix epoch integers rather than text fields

We should check that we will not run into the year 2038 problem. According to the sqlite3 docs the integer datatype is capable of being 64-bit (8 bytes), so should be okay.

INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value.

oliver-sanders · 2022-05-13T10:31:46Z

We should also look at finding an efficient way to perpetual logging prerequisites/outputs into this refactor - see #4036 (comment)

[HO] note to store the state of prerequisites at trigger time (which is probably the important thing for past tasks) requires some additional information not currently available from the DB, because (for conditionally triggered tasks) that is not the same as the final state of the upstream task outputs.

oliver-sanders · 2022-12-22T14:52:26Z

I think we are currently missing job messages from the DB, might be a good opportunity to get them in now that messages are visible in the GUI - #2394 (comment)

oliver-sanders added the efficiency For notable efficiency improvements label Oct 15, 2020

oliver-sanders added this to the cylc-8.0.0 milestone Oct 15, 2020

oliver-sanders added the question Flag this as a question for the next Cylc project meeting. label Oct 15, 2020

oliver-sanders mentioned this issue Jun 3, 2021

Allow cylc hold to hold future tasks that haven't spawned yet #4238

Merged

7 tasks

hjoliver modified the milestones: cylc-8.0.0, cylc-8.x Aug 4, 2021

oliver-sanders mentioned this issue Aug 13, 2021

Cylc 8 Cylc Review-Timings #4351

Closed

oliver-sanders mentioned this issue Jan 25, 2022

DB history table housekeeping option #4608

Open

oliver-sanders removed the question Flag this as a question for the next Cylc project meeting. label Apr 22, 2022

hjoliver mentioned this issue May 16, 2022

cylc show: ignore past task prereqs #4874

Merged

7 tasks

oliver-sanders mentioned this issue Dec 22, 2022

suite database: task and job status #864

Closed

34 tasks

oliver-sanders mentioned this issue Dec 22, 2022

new task state 'killed'? #2394

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database refactor #3872

database refactor #3872

oliver-sanders commented Oct 15, 2020 •

edited

Loading

hjoliver commented Oct 15, 2020

kinow commented Oct 15, 2020

oliver-sanders commented Oct 15, 2020 •

edited

Loading

kinow commented Oct 15, 2020

oliver-sanders commented Oct 29, 2020

oliver-sanders commented Jun 3, 2021

hjoliver commented Jun 3, 2021

MetRonnie commented Aug 24, 2021 •

edited

Loading

oliver-sanders commented May 13, 2022 •

edited by hjoliver

Loading

oliver-sanders commented Dec 22, 2022 •

edited

Loading

database refactor #3872

database refactor #3872

Comments

oliver-sanders commented Oct 15, 2020 • edited Loading

hjoliver commented Oct 15, 2020

kinow commented Oct 15, 2020

oliver-sanders commented Oct 15, 2020 • edited Loading

kinow commented Oct 15, 2020

oliver-sanders commented Oct 29, 2020

oliver-sanders commented Jun 3, 2021

hjoliver commented Jun 3, 2021

MetRonnie commented Aug 24, 2021 • edited Loading

oliver-sanders commented May 13, 2022 • edited by hjoliver Loading

oliver-sanders commented Dec 22, 2022 • edited Loading

oliver-sanders commented Oct 15, 2020 •

edited

Loading

oliver-sanders commented Oct 15, 2020 •

edited

Loading

MetRonnie commented Aug 24, 2021 •

edited

Loading

oliver-sanders commented May 13, 2022 •

edited by hjoliver

Loading

oliver-sanders commented Dec 22, 2022 •

edited

Loading