Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data store avoid graph re-walk #5660

Merged
merged 14 commits into from
Nov 2, 2023

Conversation

dwsutherland
Copy link
Member

@dwsutherland dwsutherland commented Aug 1, 2023

addresses #5485, #5609

At present we've resolved an efficiency problem at n=1 window size by restricting the graph walk to only parents or only children (i.e. 'c' 'p', or 'ccc' 'ppp' for n=3) .. However this excludes cousins, which is evident at n>1 (i.e. 'cp' or 'pc' at n=2)

This restriction was a cheap way to avoid graph re-walks.. #5475

Here the problem is actually fixed at any window size, by changing the graph walk algorithm from a recursive to a walk location approach.. From the method's description:

Fills in graph walk from existing walks if possible, otherwise expands
the graph front from whereever hasn't been walked.
Walk nodes are grouped into locations which are tag according to
parent child path, i.e. 'cpc' would be children-parents-children away
from active/start task. Which not only provide a way to cheaply rewalk,
but also the edge distance from origin.
The futherest child boundary nodes are registered as prune triggers for
the origin's parents, so when they become active the parents are
assessed for pruning eligibility.

Window resizing will now happen as soon as it's set (not as the graph progresses)

N-window edge depth of nodes is now found/updated with new field graph_depth.. A good example workflow:

[scheduling]
    initial cycle point = 2020
   [[graph]]
      P1Y = """
         a => b => c => d => e => f => g
         b => h => i => f
         g[-P1Y] => a
      """

[runtime]
    [[root]]
        script = sleep 2

    [[b, c, d, e, f, i]]

    [[a]]
        script = """
set +e

read -r -d '' gqlDoc <<_DOC_
{"request_string": "
mutation {
  setGraphWindowExtent (
    workflows: [\"${CYLC_WORKFLOW_ID}\"],
    nEdgeDistance: 4) {
    result
  }
}",
"variables": null}
_DOC_

echo "${gqlDoc}"

cylc client "$CYLC_WORKFLOW_ID" graphql < <(echo ${gqlDoc}) 2>/dev/null


set -e
"""

    [[g]]
        script = """
# Slow 2nd branch down
sleep 20
"""

    [[h]]
        script = """
# Slow 2nd branch down
sleep 20

"""

image
Gives:

{
  "data": {
    "workflows": [
      {
        "id": "~sutherlander/pug/run1",
        "taskProxies": [
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/a",
            "state": "succeeded",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/b",
            "state": "succeeded",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/h",
            "state": "running",
            "graphDepth": 0
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/c",
            "state": "running",
            "graphDepth": 0
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/i",
            "state": "waiting",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/d",
            "state": "waiting",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/f",
            "state": "waiting",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/e",
            "state": "waiting",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/g",
            "state": "waiting",
            "graphDepth": 3
          },
          {
            "id": "~sutherlander/pug/run1//20210101T0000Z/a",
            "state": "waiting",
            "graphDepth": 4
          }
        ]
      }
    ]
  }
}

Performance
Taken from #5435

[task parameters]
   m = 0..7000
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a, b<m>]]
$ cylc play <id> --profile --no-detach

Run this for a few mins, when you get bored ctrl+c it, then:

$ pip install snakeviz
$ snakeviz ~/cylc-run/<id>/log/scheduler/profile.prof

This PR is even faster than current (#5475) at n=1:

     7001    0.156    0.000    1.738    0.000 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:695(increment_graph_window)

image

At n=2 there's a bit more load:

     7001    0.185    0.000    4.161    0.001 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:695(increment_graph_window)

but clearly not squared or cubed.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes.
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included.
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.

@dwsutherland dwsutherland added the efficiency For notable efficiency improvements label Aug 1, 2023
@dwsutherland dwsutherland self-assigned this Aug 1, 2023
@dwsutherland dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 3 times, most recently from f10aa7b to 65cb9fc Compare August 1, 2023 09:38
@dwsutherland
Copy link
Member Author

Coverage should improve with a n=2 window test.

@dwsutherland dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 4 times, most recently from 99a86e8 to f110f63 Compare August 3, 2023 11:49
@oliver-sanders oliver-sanders added this to the cylc-8.3.0 milestone Aug 7, 2023
@dwsutherland dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 4 times, most recently from 714f758 to 58dbf6e Compare August 9, 2023 11:35
@oliver-sanders
Copy link
Member

I've been having a play with this branch with the n=1 window in cylc tui, looks good.

I've been playing with the following diff in order to track which children/parents are being walked over:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index b6a2960fb..e78cd767c 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -732,6 +732,7 @@ class DataStoreMgr:
             None
 
         """
+        print(f'# increment_graph_window({source_tokens.relative_id})')
 
         # common refrences
         active_id = source_tokens.id
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..d2517383b 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -36,6 +36,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# generate_graph_children({point}/{tdef.name})')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +79,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# generate_graph_parents({point}/{tdef.name})')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):

Using the graph from your example above with the default n=1 window:

flow.cylc
[scheduler]
    cycle point format = CCYY

[scheduling]
    runahead limit = P1
    initial cycle point = 2020
   [[graph]]
      P1Y = """
         a => b => c => d => e => f => g
         b => h => i => f
         g[-P1Y] => a
      """

[runtime]
    [[root]]
        script = sleep 2
    [[b, c, d, e, f, i]]
    [[a]]
    [[g,h]]

It looks like the algorithm is looking one edge further infront (c) than behind (p).

E.G. here's a snippet from --no-detach output:

2023-08-10T09:33:54+01:00 INFO - New flow: 1 (original flow from 2020) 2023-08-10 09:33:54
# generate_graph_children(2020/a)
# increment_graph_window(2020/a)
# generate_graph_children(2020/a)
# generate_graph_children(2020/b)    <= b is n=1 so its children would be n=2
# generate_graph_parents(2020/a)
2023-08-10T09:33:54+01:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-10T09:33:54+01:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-10T09:33:55+01:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-10T09:33:55+01:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-10T09:33:56+01:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[34318]
2023-08-10T09:33:56+01:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-10T09:33:56+01:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T09:33:58+01:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-10T09:33:58+01:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T09:34:00+01:00 INFO - [2020/a running job:01 flows:1] => succeeded
# generate_graph_children(2020/b)
# increment_graph_window(2020/b)
# generate_graph_children(2020/b)   <= duplicate method call from previous walk
# generate_graph_children(2020/h)
# generate_graph_children(2020/c)   <= c is n=1 so its children would be n=2
# generate_graph_parents(2020/b)

E.G. in the above output, when the n=0 window was {1/a} the algorithm iterated over the graph children of 1/b which are outside the n-window. These edges aren't needed for the n-window but might be needed for the algorithm?

When the children of 1/b entered the n=1 window the iteration was repeated which might mean the cache was removed too early?

@oliver-sanders
Copy link
Member

Moving onto the n=2 window (and dumping out the n-window after each call) with this diff:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index b6a2960fb..3a21bbbea 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -473,7 +473,7 @@ class DataStoreMgr:
         self.updated_state_families = set()
         # Update workflow state totals once more post delta application.
         self.state_update_follow_on = False
-        self.n_edge_distance = 1
+        self.n_edge_distance = 2
         self.next_n_edge_distance = None
         self.latest_state_tasks = {
             state: deque(maxlen=LATEST_STATE_TASKS_QUEUE_SIZE)
@@ -732,6 +732,7 @@ class DataStoreMgr:
             None
 
         """
+        print(f'# increment_graph_window({source_tokens.relative_id})')
 
         # common refrences
         active_id = source_tokens.id
@@ -1018,6 +1019,10 @@ class DataStoreMgr:
                 self.prune_trigger_nodes[active_id])
             del self.prune_trigger_nodes[active_id]
 
+        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
+        for tp_id in sorted(list(task_proxies) + list(self.added[TASK_PROXIES])):
+            print(f'$ {tp_id}')
+
     def generate_edge(
         self,
         parent_tokens: Tokens,
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..d2517383b 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -36,6 +36,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# generate_graph_children({point}/{tdef.name})')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +79,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# generate_graph_parents({point}/{tdef.name})')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):

Here's the log:

2023-08-10T10:17:30+01:00 INFO - New flow: 1 (original flow from 2020) 2023-08-10 10:17:30
# generate_graph_children(2020/a)
# increment_graph_window(2020/a)
# generate_graph_children(2020/a)
# generate_graph_children(2020/b)
# generate_graph_parents(2020/a)
# generate_graph_children(2020/b)   <= duplicate call
# generate_graph_children(2020/h)   <= h is n=2 so its children would be n=3
# generate_graph_children(2020/c)   <= c is n=2 so its children would be n=3
# generate_graph_parents(2020/b)
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-10T10:17:32+01:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[71918]
2023-08-10T10:17:32+01:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-10T10:17:32+01:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T10:17:34+01:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-10T10:17:34+01:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T10:17:36+01:00 INFO - [2020/a running job:01 flows:1] => succeeded
# generate_graph_children(2020/b)
# increment_graph_window(2020/b)
# generate_graph_children(2020/h)   <= duplicate call
# generate_graph_children(2020/i)   <= i is n=2 so its children would be n=3
# generate_graph_parents(2020/h)
# generate_graph_children(2020/c)   <= duplicate call
# generate_graph_children(2020/d)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/c)   <= duplicate call
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i
2023-08-10T10:17:36+01:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:36+01:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-10T10:17:37+01:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-10T10:17:37+01:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-10T10:17:38+01:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[72234]
2023-08-10T10:17:38+01:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-10T10:17:38+01:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T10:17:40+01:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-10T10:17:40+01:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T10:17:42+01:00 INFO - [2020/b running job:01 flows:1] => succeeded
# generate_graph_children(2020/h)
# increment_graph_window(2020/h)
# generate_graph_children(2020/d)   <= duplicate call
# generate_graph_children(2020/e)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/d)
# generate_graph_children(2020/i)   <= duplicate call
# generate_graph_children(2020/f)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/i)
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/e
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/f
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i
2023-08-10T10:17:42+01:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:42+01:00 INFO - [2020/h waiting job:00 flows:1] => waiting(queued)
# generate_graph_children(2020/c)
# increment_graph_window(2020/c)   <= fully cached graph walk, nice!
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/e
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/f
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i

Looks good, gives the right result, some possible savings to be made avoiding duplicate generate_graph_children calls.

@oliver-sanders
Copy link
Member

There are duplicate generate_ghost_task calls, but this isn't a problem because the method returns early if the task already exists in the store.

I spotted a [very] small optimisation moving that return a little higher up in the code:

--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -1108,11 +1108,15 @@ class DataStoreMgr:
         Orphan tasks with no children return (True, None) respectively.
 
         """
+        tp_id = tokens.id
+        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
+        if tp_id in task_proxies or tp_id in self.added[TASK_PROXIES]:
+            # ghost task already created
+            return
+
         name = tokens['task']
         point_string = tokens['cycle']
         t_id = self.definition_id(name)
-        tp_id = tokens.id
-        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
 
         is_orphan = False
         if name not in self.schd.config.taskdefs:
@@ -1120,8 +1124,6 @@ class DataStoreMgr:
 
         if itask is None:
             itask = self.schd.pool.get_task(point_string, name)
-        if tp_id in task_proxies or tp_id in self.added[TASK_PROXIES]:
-            return
 
         if itask is None:
             itask = TaskProxy(

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 10, 2023

Tried out the new graphDepth field, the numbers came out as expected so this should close #5609 which will allow us to unlock the higher n-window views in the GUI!

Is the graph window that the data store is currently configured to track available in the schema? That would be handy for the UI.

@oliver-sanders oliver-sanders linked an issue Aug 10, 2023 that may be closed by this pull request
@dwsutherland
Copy link
Member Author

dwsutherland commented Aug 10, 2023

It looks like the algorithm is looking one edge further infront (c) than behind (p).

Yes, I rely on all but the first node in a branch having c and p already generated (because the first walk generates the walk stub for all walked nodes in it's graph).. Then the nodes of c and p are used to look up the "one edge(s) further".
Sorry, misunderstood the question.. I think that method is used elsewhere:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index 03c77eb94..a085f2f3f 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -754,6 +754,7 @@ class DataStoreMgr:
         #   walk filling, that may not have been the entire walk).
         # If walk already completed, must have gone from non-active to active
         # again.. So redo walk (as walk nodes may be pruned).
+        print(f'# Active ID: {active_id}')
         if (
             active_id not in all_walks
             or active_id in self.n_window_completed_walks
@@ -899,6 +900,7 @@ class DataStoreMgr:
                             )
                         continue
 
+                    print(f'# Generate Children/Parents of {node_tokens.relative_id}')
                     # Children/downstream nodes
                     # TODO: xtrigger is workflow_state edges too
                     # Reference set for workflow relations
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..daae255f4 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -15,6 +15,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 """Task definition."""
+import inspect
 
 from collections import deque
 from typing import TYPE_CHECKING
@@ -36,6 +37,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# Children of {point}/{tdef.name} - Called by {inspect.currentframe().f_back.f_code.co_name}')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +80,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# parents of {point}/{tdef.name} - Called by {inspect.currentframe().f_back.f_code.co_name}')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):
2023-08-11T16:46:42+12:00 INFO - New flow: 1 (original flow from 2020) 2023-08-11 16:46:42
# Children of 2020/a - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/a
# Generate Children/Parents of 2020/a
# Children of 2020/a - Called by increment_graph_window
# Children of 2020/b - Called by __init__
# parents of 2020/a - Called by increment_graph_window
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-11T16:46:43+12:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[16340]
2023-08-11T16:46:43+12:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-11T16:46:43+12:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:44+12:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-11T16:46:44+12:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:46+12:00 INFO - [2020/a running job:01 flows:1] => succeeded
# Children of 2020/b - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/b
# Generate Children/Parents of 2020/b
# Children of 2020/b - Called by increment_graph_window
# Children of 2020/h - Called by __init__
# Children of 2020/c - Called by __init__
# parents of 2020/b - Called by increment_graph_window
2023-08-11T16:46:46+12:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:46+12:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-11T16:46:47+12:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-11T16:46:47+12:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-11T16:46:48+12:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[16365]
2023-08-11T16:46:48+12:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-11T16:46:48+12:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:49+12:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-11T16:46:49+12:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:51+12:00 INFO - [2020/b running job:01 flows:1] => succeeded
# Children of 2020/h - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/h
# Generate Children/Parents of 2020/h
# Children of 2020/h - Called by increment_graph_window
# Children of 2020/i - Called by __init__
# parents of 2020/h - Called by increment_graph_window
2023-08-11T16:46:51+12:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:51+12:00 INFO - [2020/h waiting job:00 flows:1] => waiting(queued)
# Children of 2020/c - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/c
# Generate Children/Parents of 2020/c
# Children of 2020/c - Called by increment_graph_window
# Children of 2020/d - Called by __init__
# parents of 2020/c - Called by increment_graph_window
2023-08-11T16:46:51+12:00 INFO - [2020/c waiting(runahead) job:00 flows:1] => waiting

There are duplicate generate_ghost_task calls, but this isn't a problem because the method returns early if the task already exists in the store.

As you've spotted, I've been relying on generate_ghost_task to return early (it has the same block at the top), but to be honest, the fact a walk stub exists would be enough to not call that initial one too (if it's that expensive).

Is the graph window that the data store is currently configured to track available in the schema? That would be handy for the UI.

I think we can add that as a workflow field (if it's not already).
Also I think instead of changing the window size for a request (as that would change all views), we need to have those min/max depth args, so we can just filter the request/sub by depth.

I've also thought of an optimization for the depth finder, so will put that up in the morning.

@dwsutherland
Copy link
Member Author

I changed it to reuse the TaskProxy call to generate_graph_children:

2023-08-11T17:00:37+12:00 INFO - New flow: 1 (original flow from 2020) 2023-08-11 17:00:37
# Children of 2020/a - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/a
# Generate Children/Parents of 2020/a
# Children of 2020/b - Called by __init__
# parents of 2020/a - Called by increment_graph_window
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-11T17:00:38+12:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[16606]
2023-08-11T17:00:38+12:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-11T17:00:38+12:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:39+12:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-11T17:00:39+12:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:41+12:00 INFO - [2020/a running job:01 flows:1] => succeeded
# Children of 2020/b - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/b
# Generate Children/Parents of 2020/b
# Children of 2020/c - Called by __init__
# Children of 2020/h - Called by __init__
# parents of 2020/b - Called by increment_graph_window
2023-08-11T17:00:41+12:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:41+12:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-11T17:00:42+12:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-11T17:00:42+12:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-11T17:00:43+12:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[16631]
2023-08-11T17:00:43+12:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-11T17:00:43+12:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:44+12:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-11T17:00:44+12:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:46+12:00 INFO - [2020/b running job:01 flows:1] => succeeded
# Children of 2020/c - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/c
# Generate Children/Parents of 2020/c
# Children of 2020/d - Called by __init__
# parents of 2020/c - Called by increment_graph_window
2023-08-11T17:00:46+12:00 INFO - [2020/c waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:46+12:00 INFO - [2020/c waiting job:00 flows:1] => waiting(queued)
# Children of 2020/h - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/h
# Generate Children/Parents of 2020/h
# Children of 2020/i - Called by __init__
# parents of 2020/h - Called by increment_graph_window
2023-08-11T17:00:46+12:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting

@dwsutherland dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 3 times, most recently from 471ebc4 to 529c3ca Compare August 11, 2023 05:48
@dwsutherland
Copy link
Member Author

dwsutherland commented Aug 11, 2023

There were still some duplicate generate_graph_children calls (as shown above) from the creation of TaskProxy objects at n > 0 (which are used for data store info).. I've excluded generating children for these (re-using the data_mode option)

dwsutherland pushed a commit to dwsutherland/cylc-flow that referenced this pull request Oct 31, 2023
* Sometimes tasks may appear in the n-window, then dissappear only to
  reappear at a later time.
* This is a difficult case for the `increment_graph_window` to handle.
* See cylc#5660
@dwsutherland
Copy link
Member Author

dwsutherland commented Oct 31, 2023

The task "c" should be n=2 for the first 6 cycles, but I actually got:

This one turned out to be an easy fix:

--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -1788,6 +1788,7 @@ class DataStoreMgr:
                 if k in self.all_task_pool
             ))
         )
+        self.update_window_depths = True
 
     def window_depth_finder(self):
         """Recalculate window depths, creating depth deltas."""

(needed to flag depth to be recalculated after window size change)
image

@dwsutherland

This comment was marked as resolved.

@oliver-sanders

This comment was marked as resolved.

@dwsutherland

This comment was marked as resolved.

@MetRonnie

This comment was marked as resolved.

dwsutherland and others added 14 commits November 1, 2023 10:14
* Sometimes tasks may appear in the n-window, then dissappear only to
  reappear at a later time.
* This is a difficult case for the `increment_graph_window` to handle.
* See cylc#5660
Co-authored-by: Oliver Sanders <oliver.sanders@metoffice.gov.uk>
* Test the added, updated and pruned deltas created
  by increment_graph_window as the window changes.
* Test window resizing with n=1,2,3
Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked well in my testing

@oliver-sanders
Copy link
Member

Been back through the test workflow above from n=1 to n=4, no problems!

Re-ran the one which showed the longest-path issues before, perfect (and kinda pretty):

Screenshot from 2023-11-02 10-04-53

Tried a bunch of workflows from my cylc-src dir, the algorithm is now producing the correct results for all of them 🎉!

@oliver-sanders
Copy link
Member

Profiling results for the following workflow in simulation mode:

[task parameters]
    x = 1..100
    y = 1..100
[scheduling]
    cycling mode = integer
    initial cycle point = 1
    final cycle point = 1
    [[graph]]
        P1 = d[-P1] => a => b<x> => c<y> => d
  • The number of increment_graph_window calls is down from 50501, to 201!
  • The total time taken by increment_graph_window is down from 7.6s to 4.3s (10% of total runtime down to 6%).
  • The number of generate_graph_(parents|children) calls are down from 10301 to 201.
  • Total Cylc runtime is down by 5%.
  • The number of calls to Prerequisite.api_dump (now the biggest offender) is unchanged.

Good positive change, increases functionality whilst improving performance 💯!

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your hard work @dwsutherland, this looked like a tricky one!

@oliver-sanders oliver-sanders merged commit fd1b820 into cylc:8.2.x Nov 2, 2023
23 of 25 checks passed
MetRonnie added a commit to MetRonnie/cylc-flow that referenced this pull request Nov 2, 2023
oliver-sanders added a commit that referenced this pull request Nov 2, 2023
@MetRonnie MetRonnie added the schema change Change to the Cylc GraphQL schema label Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements schema change Change to the Cylc GraphQL schema
Projects
None yet
Development

Successfully merging this pull request may close these issues.

data_store_mgr: n-window depth field
3 participants