efficiency: increment_graph_window #5435

oliver-sanders · 2023-03-28T14:35:09Z

Re-issue of: #5315

The scaling of increment_graph_window has improved, but is still severe enough to prevent some Cylc 7 workflows from being able to run with Cylc8.

See Element for discussion:

https://matrix.to/#/!gMuCeOJBhYqgWDGwnR:matrix.org/$1679999387176196Xelev:matrix.org?via=matrix.org

Description

For the workflow a => b<x> there are x + 1 tasks and x dependencies.

However, for x=700 the increment_graph_window function is called in excess of 3'000'000 times.

Reproducible Example

[task parameters]
   m = 0..7000
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a, b<m>]]

$ cylc play <id> --profile --no-detach

Run this for a few mins, when you get bored ctrl+c it, then:

$ pip install snakeviz
$ snakeviz ~/cylc-run/<id>/log/scheduler/profile.prof

Expected Behaviour

It should take seconds, not minutes to spawn the 700 tasks in this workflow.

The text was updated successfully, but these errors were encountered:

oliver-sanders · 2023-03-28T14:45:24Z

I left this example running for ~20 mins, then ctrl+c'ed it. The increment_graph_window function was called 30'276'650 times in this period.

Naively, I would guess that this should be ok?

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index f47cfcbbb..8d13cf598 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -741,6 +741,15 @@ class DataStoreMgr:
                 source_tokens.id
             )
 
+        if self.n_window_nodes.get(active_id):
+            # we've already walked this node
+            if edge_distance == 0:
+                if active_id in self.prune_trigger_nodes:
+                    self.prune_flagged_nodes.update(
+                        self.prune_trigger_nodes[active_id])
+                    del self.prune_trigger_nodes[active_id]
+            return
+
         # Setup and check if active node is another's boundary node
         # to flag its paths for pruning.
         if edge_distance == 0:

With this change the tasks spawned much faster, with 14002 calls.

The GUI seems ok, but there is a housekeeping problem with some nodes not getting removed.

In this screenshot, the graph isolates are erroneous:

oliver-sanders · 2023-03-29T13:18:03Z

Here's another failed attempt which attempts to merge add_pool_node and increment_graph_window, then use self.all_task_pool to track which nodes have been visited:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index f47cfcbbb..da462e193 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -496,7 +496,7 @@ class DataStoreMgr:
         self.delta_queues = {self.workflow_id: {}}
         self.publish_deltas = []
         # internal n-window
-        self.all_task_pool = set()
+        self.all_task_pool = {}
         self.n_window_nodes = {}
         self.n_window_edges = {}
         self.n_window_boundary_nodes = {}
@@ -735,6 +735,14 @@ class DataStoreMgr:
             source_tokens = self.id_.duplicate(source_tokens)
             active_id = source_tokens.id
 
+        visited = False
+        if edge_distance==0 and active_id not in self.all_task_pool:
+            self.all_task_pool[active_id] = False
+        elif self.all_task_pool[active_id]:
+            visited = True
+
+        self.all_task_pool[active_id] = True
+
         # flag manual triggers for pruning on deletion.
         if is_manual_submit:
             self.prune_trigger_nodes.setdefault(active_id, set()).add(
@@ -761,6 +769,9 @@ class DataStoreMgr:
                 ].setdefault(edge_distance, set()).add(source_tokens.id)
             return
 
+        if visited:
+            return
+
         # Generate task proxy node
         is_orphan, graph_children = self.generate_ghost_task(
             source_tokens,
@@ -904,7 +915,7 @@ class DataStoreMgr:
             task=name,
         ).id
         if tp_id in self.all_task_pool:
-            self.all_task_pool.remove(tp_id)
+            self.all_task_pool.pop(tp_id)
             self.updates_pending = True
         # flagged isolates/end-of-branch nodes for pruning on removal
         if (
@@ -921,14 +932,6 @@ class DataStoreMgr:
             self.prune_flagged_nodes.add(tp_id)
             self.updates_pending = True
 
-    def add_pool_node(self, name, point):
-        """Add external ID reference for internal task pool node."""
-        tp_id = self.id_.duplicate(
-            cycle=str(point),
-            task=name,
-        ).id
-        self.all_task_pool.add(tp_id)
-
     def generate_ghost_task(
         self,
         tokens: Tokens,
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index 8bfa1d923..0ad9c90fd 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -239,8 +239,6 @@ class TaskPool:
 
     def create_data_store_elements(self, itask):
         """Create the node window elements about given task proxy."""
-        # Register pool node reference
-        self.data_store_mgr.add_pool_node(itask.tdef.name, itask.point)
         # Create new data-store n-distance graph window about this task
         self.data_store_mgr.increment_graph_window(
             itask.tokens,

It almost works, but the n=-1 nodes don't show and there's a housekeeping problem with items not getting removed from the nodes/edges.

@dwsutherland I'm just hacking here, I'm not quite sure how this works. We need to stop this code from re-walking the same bit of graph over and over somehow. Any pointers?

dwsutherland · 2023-04-20T04:43:06Z

Makes sense that it would get that large .. As the window walks back n+1 edges, so 7000 tasks it that workflow means every b<m> visits every other b<m>.. so 7000^2 = 49,000,000 ..

I'll have a look at your solution now, however (and it might be similar), we can reduce computation by:

Not revisiting nodes
- we can do this by storing graph relations in an easily referenced way (if we don't already)
- adopting graph sections visited, and skipping walks to head of adopted section (if required)
Perhaps restricting the walk to direct relatives (up only and down only) of active tasks, which would also solve this issue (but may mean a less branchy/leafy look for large n window)... (i.e. instead of b1 going to a then going to b2 in the walk).. Should be a substantial efficiency gain.

Anyway, I'll have a look at what you've done first.

dwsutherland · 2023-04-20T05:40:02Z

It almost works, but the n=-1 nodes don't show and there's a housekeeping problem with items not getting removed from the nodes/edges.

@dwsutherland I'm just hacking here, I'm not quite sure how this works. We need to stop this code from re-walking the same bit of graph over and over somehow. Any pointers?

The main reason why we do so much walking is to have a set of nodes associated with each active node.. consider

a => b => d

if b becomes active we want {a, b, c} to be associated (and the edges) or a will get pruned.. So we need to keep the associations, and I think your changes probably return before making these.

Also, we walk one further in order to register d as a "boundary" node to a (so if d is active we check if a needs pruning)..

What we should do (in order not to re-walk), is to register all the associations of skipped sections as if we walked them .. And if we need to walk further, pick up from where we left off.. This means we need to store parents and children (IDs) of every node/edge that has been walked until they are pruned.

I'll put up a PR.. With both the above changes.. This should nail it.

dwsutherland · 2023-04-20T08:08:07Z

2.82sec .. this is what we want (7001 tasks):

14002/7001 0.174 0.000 2.823 0.000 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:693(increment_graph_window)

This is only changing the walk to move up or down the edges (not both), i.e. don't include cousins...
We still want to not re-walk, but this is the first commit...

Will create the PR, and see if I broke anything else

oliver-sanders · 2023-04-20T09:25:55Z

14002/7001 🎉

oliver-sanders · 2023-04-24T15:55:12Z

Happy that #5475 closes this one, I've raised #5485 to document the remembering-nodes suggestion above.

hjoliver · 2023-04-26T01:12:37Z

Closed by #5475

oliver-sanders added bug Something is wrong :( efficiency For notable efficiency improvements labels Mar 28, 2023

oliver-sanders added this to the cylc-8.2.0 milestone Mar 28, 2023

oliver-sanders mentioned this issue Mar 28, 2023

task_pool: avoid listing tasks where not necessary #5436

Merged

8 tasks

dwsutherland mentioned this issue Apr 20, 2023

data-store graph walk restricted #5475

Merged

8 tasks

oliver-sanders mentioned this issue Apr 24, 2023

increment_graph_window: don't revisit nodes #5485

Closed

oliver-sanders linked a pull request Apr 24, 2023 that will close this issue

data-store graph walk restricted #5475

Merged

8 tasks

hjoliver closed this as completed Apr 26, 2023

hjoliver mentioned this issue Apr 26, 2023

Spawning + datastore with thousands of "ready" tasks #5437

Closed

dwsutherland mentioned this issue Aug 1, 2023

Data store avoid graph re-walk #5660

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efficiency: increment_graph_window #5435

efficiency: increment_graph_window #5435

oliver-sanders commented Mar 28, 2023

oliver-sanders commented Mar 28, 2023 •

edited

Loading

oliver-sanders commented Mar 29, 2023

dwsutherland commented Apr 20, 2023 •

edited

Loading

dwsutherland commented Apr 20, 2023 •

edited

Loading

dwsutherland commented Apr 20, 2023 •

edited

Loading

oliver-sanders commented Apr 20, 2023

oliver-sanders commented Apr 24, 2023

hjoliver commented Apr 26, 2023

efficiency: increment_graph_window #5435

efficiency: increment_graph_window #5435

Comments

oliver-sanders commented Mar 28, 2023

Description

Reproducible Example

Expected Behaviour

oliver-sanders commented Mar 28, 2023 • edited Loading

oliver-sanders commented Mar 29, 2023

dwsutherland commented Apr 20, 2023 • edited Loading

dwsutherland commented Apr 20, 2023 • edited Loading

dwsutherland commented Apr 20, 2023 • edited Loading

oliver-sanders commented Apr 20, 2023

oliver-sanders commented Apr 24, 2023

hjoliver commented Apr 26, 2023

oliver-sanders commented Mar 28, 2023 •

edited

Loading

dwsutherland commented Apr 20, 2023 •

edited

Loading

dwsutherland commented Apr 20, 2023 •

edited

Loading

dwsutherland commented Apr 20, 2023 •

edited

Loading