Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

efficiency: increment_graph_window #5435

Closed
oliver-sanders opened this issue Mar 28, 2023 · 8 comments · Fixed by #5475
Closed

efficiency: increment_graph_window #5435

oliver-sanders opened this issue Mar 28, 2023 · 8 comments · Fixed by #5475
Labels
bug Something is wrong :( efficiency For notable efficiency improvements
Milestone

Comments

@oliver-sanders
Copy link
Member

Re-issue of: #5315

The scaling of increment_graph_window has improved, but is still severe enough to prevent some Cylc 7 workflows from being able to run with Cylc8.

See Element for discussion:

https://matrix.to/#/!gMuCeOJBhYqgWDGwnR:matrix.org/$1679999387176196Xelev:matrix.org?via=matrix.org

Description

For the workflow a => b<x> there are x + 1 tasks and x dependencies.

However, for x=700 the increment_graph_window function is called in excess of 3'000'000 times.

Reproducible Example

[task parameters]
   m = 0..7000
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a, b<m>]]
$ cylc play <id> --profile --no-detach

Run this for a few mins, when you get bored ctrl+c it, then:

$ pip install snakeviz
$ snakeviz ~/cylc-run/<id>/log/scheduler/profile.prof

Expected Behaviour

It should take seconds, not minutes to spawn the 700 tasks in this workflow.

@oliver-sanders oliver-sanders added bug Something is wrong :( efficiency For notable efficiency improvements labels Mar 28, 2023
@oliver-sanders oliver-sanders added this to the cylc-8.2.0 milestone Mar 28, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Mar 28, 2023

I left this example running for ~20 mins, then ctrl+c'ed it. The increment_graph_window function was called 30'276'650 times in this period.

Screenshot from 2023-03-28 15-41-03

Naively, I would guess that this should be ok?

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index f47cfcbbb..8d13cf598 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -741,6 +741,15 @@ class DataStoreMgr:
                 source_tokens.id
             )
 
+        if self.n_window_nodes.get(active_id):
+            # we've already walked this node
+            if edge_distance == 0:
+                if active_id in self.prune_trigger_nodes:
+                    self.prune_flagged_nodes.update(
+                        self.prune_trigger_nodes[active_id])
+                    del self.prune_trigger_nodes[active_id]
+            return
+
         # Setup and check if active node is another's boundary node
         # to flag its paths for pruning.
         if edge_distance == 0:

With this change the tasks spawned much faster, with 14002 calls.

The GUI seems ok, but there is a housekeeping problem with some nodes not getting removed.

In this screenshot, the graph isolates are erroneous:

Screenshot from 2023-03-28 15-46-01

@oliver-sanders
Copy link
Member Author

Here's another failed attempt which attempts to merge add_pool_node and increment_graph_window, then use self.all_task_pool to track which nodes have been visited:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index f47cfcbbb..da462e193 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -496,7 +496,7 @@ class DataStoreMgr:
         self.delta_queues = {self.workflow_id: {}}
         self.publish_deltas = []
         # internal n-window
-        self.all_task_pool = set()
+        self.all_task_pool = {}
         self.n_window_nodes = {}
         self.n_window_edges = {}
         self.n_window_boundary_nodes = {}
@@ -735,6 +735,14 @@ class DataStoreMgr:
             source_tokens = self.id_.duplicate(source_tokens)
             active_id = source_tokens.id
 
+        visited = False
+        if edge_distance==0 and active_id not in self.all_task_pool:
+            self.all_task_pool[active_id] = False
+        elif self.all_task_pool[active_id]:
+            visited = True
+
+        self.all_task_pool[active_id] = True
+
         # flag manual triggers for pruning on deletion.
         if is_manual_submit:
             self.prune_trigger_nodes.setdefault(active_id, set()).add(
@@ -761,6 +769,9 @@ class DataStoreMgr:
                 ].setdefault(edge_distance, set()).add(source_tokens.id)
             return
 
+        if visited:
+            return
+
         # Generate task proxy node
         is_orphan, graph_children = self.generate_ghost_task(
             source_tokens,
@@ -904,7 +915,7 @@ class DataStoreMgr:
             task=name,
         ).id
         if tp_id in self.all_task_pool:
-            self.all_task_pool.remove(tp_id)
+            self.all_task_pool.pop(tp_id)
             self.updates_pending = True
         # flagged isolates/end-of-branch nodes for pruning on removal
         if (
@@ -921,14 +932,6 @@ class DataStoreMgr:
             self.prune_flagged_nodes.add(tp_id)
             self.updates_pending = True
 
-    def add_pool_node(self, name, point):
-        """Add external ID reference for internal task pool node."""
-        tp_id = self.id_.duplicate(
-            cycle=str(point),
-            task=name,
-        ).id
-        self.all_task_pool.add(tp_id)
-
     def generate_ghost_task(
         self,
         tokens: Tokens,
diff --git a/cylc/flow/task_pool.py b/cylc/flow/task_pool.py
index 8bfa1d923..0ad9c90fd 100644
--- a/cylc/flow/task_pool.py
+++ b/cylc/flow/task_pool.py
@@ -239,8 +239,6 @@ class TaskPool:
 
     def create_data_store_elements(self, itask):
         """Create the node window elements about given task proxy."""
-        # Register pool node reference
-        self.data_store_mgr.add_pool_node(itask.tdef.name, itask.point)
         # Create new data-store n-distance graph window about this task
         self.data_store_mgr.increment_graph_window(
             itask.tokens,

It almost works, but the n=-1 nodes don't show and there's a housekeeping problem with items not getting removed from the nodes/edges.

@dwsutherland I'm just hacking here, I'm not quite sure how this works. We need to stop this code from re-walking the same bit of graph over and over somehow. Any pointers?

@dwsutherland
Copy link
Member

dwsutherland commented Apr 20, 2023

Makes sense that it would get that large .. As the window walks back n+1 edges, so 7000 tasks it that workflow means every b<m> visits every other b<m>.. so 7000^2 = 49,000,000 ..

I'll have a look at your solution now, however (and it might be similar), we can reduce computation by:

  • Not revisiting nodes
    • we can do this by storing graph relations in an easily referenced way (if we don't already)
    • adopting graph sections visited, and skipping walks to head of adopted section (if required)
  • Perhaps restricting the walk to direct relatives (up only and down only) of active tasks, which would also solve this issue (but may mean a less branchy/leafy look for large n window)... (i.e. instead of b1 going to a then going to b2 in the walk).. Should be a substantial efficiency gain.

Anyway, I'll have a look at what you've done first.

@dwsutherland
Copy link
Member

dwsutherland commented Apr 20, 2023

It almost works, but the n=-1 nodes don't show and there's a housekeeping problem with items not getting removed from the nodes/edges.

@dwsutherland I'm just hacking here, I'm not quite sure how this works. We need to stop this code from re-walking the same bit of graph over and over somehow. Any pointers?

The main reason why we do so much walking is to have a set of nodes associated with each active node.. consider

a => b => d

if b becomes active we want {a, b, c} to be associated (and the edges) or a will get pruned.. So we need to keep the associations, and I think your changes probably return before making these.

Also, we walk one further in order to register d as a "boundary" node to a (so if d is active we check if a needs pruning)..

What we should do (in order not to re-walk), is to register all the associations of skipped sections as if we walked them .. And if we need to walk further, pick up from where we left off.. This means we need to store parents and children (IDs) of every node/edge that has been walked until they are pruned.

I'll put up a PR.. With both the above changes.. This should nail it.

@dwsutherland
Copy link
Member

dwsutherland commented Apr 20, 2023

2.82sec .. this is what we want (7001 tasks):
image

14002/7001 0.174 0.000 2.823 0.000 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:693(increment_graph_window)

This is only changing the walk to move up or down the edges (not both), i.e. don't include cousins...
We still want to not re-walk, but this is the first commit...

Will create the PR, and see if I broke anything else

@oliver-sanders
Copy link
Member Author

14002/7001 🎉

@oliver-sanders
Copy link
Member Author

Happy that #5475 closes this one, I've raised #5485 to document the remembering-nodes suggestion above.

@hjoliver
Copy link
Member

Closed by #5475

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants