n-distance data-store graph window generation/pruning #3811

dwsutherland · 2020-09-13T11:50:55Z

These changes close #3747

N-Distance node/edge generation:
The cycle point nodes/edges (i.e. task/family proxies) generation is triggered
individually on transition from staging to active task pool. Each active task
is generated along with any children and parents out recursively out to a
specified maximum graph distance (n_edge_distance), that can be externally
altered (via API). Collectively this forms the N-Distance-Window on the
workflow graph.

n-distance nodes/edges generation.

N-Distance nodes/edges pruning:
Pruning of data-store elements is done using both the collection/set of nodes
generated through the associated graph paths of the the active nodes and the
tracking of the boundary nodes (n_edge_distance+1) of those active nodes.
Once a boundary node becomes active the original node is flagged for prunning.
Set operations are used to do a diff between the nodes of active paths
(paths whose node is in the active task pool) and the nodes of flagged paths
(whose boundary node(s) have become active).

example one

a* => b => c => d

(* as active).. With n=1 window
With a active we have
In paths

{a: {a, b}}

boundary info at n+1

{c: {a}}

When c becomes active we have:

{a: {a, b}, b: {a, b, c}, c: {b, c, d}}

{c: {a}, d: {b, c, d}}

(yes d is the prune trigger for itself and c, because it is the furthest distance before n+1 ... this works for isolates too)
and since c became active we flag the now out of window a for pruning..
The pruning mechanism will then take all the active path nodes:
{b, c, d}
and do a diff with all the inactive path nodes:
{a, b} + {a, b, c}
so we are cleared to prune {a}, resulting in:

{b: {a, b, c}, c: {b, c, d}}

{d: {b, c, d}}

example two

a => b* => c => d
e => f => g
b => g*

(* as active).. With n=1 window
When b & g turned active, e is pruned but not a, as a is in the path of active b (even though g flagged it), then when c becomes active a is pruned.

example three

a => b => c* => d
a:failed => f

Pruning works here because when c becomes active, the paths of a include f, so f is pruned (as long as it's not in the active task pool)..

(other convoluted graphs with conditionals and absolute triggers are also handled by this scheme)

So when we generate the window about an active task, we also find/associate the conditions for the pruning of/with it.
This ensures a change of window size won't orphan the task from being pruned, which would trip up alternate pruning schemes.. i.e reducing the window size using a scheme based flagging the ancestors/parents of active tasks would miss nodes formerly in the window.

n-distance nodes/edges pruning.

Animated examples:

Given the following graph with simple chain, and isolate branch:

[scheduling]
    initial cycle point = 20190101T00
    [[special tasks]]
        sequential = bin
    [[dependencies]]
        [[[P1M]]]
            graph = """
bin => foo => bar
baa => qux
"""
[runtime]
    [[root]]
        script = sleep 5

    [[baa, bin]]

    [[foo, bar, qux]]
        script = sleep 3

At n=0 we have:

And n=1:

But we can also alter this with a mutation:

Need to:

work this option into TUI
n=1 is default, but perhaps we can set n=0 dynamically when the active pool gets beyond some configurable limit.

(future PRs?)

Requirements check-list

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Appropriate tests are included (unit and/or functional).
Appropriate change log entry included.
No documentation required (alpha release candidate, visual but (yet) interactive change).
No dependency changes.

oliver-sanders · 2020-09-14T13:25:48Z

Cylc Tui is looking much better 👍

I noticed that waiting tasks that are ready to run are considered to be active which confused me slightly at first but kinda makes sense as they are quite transitory.

dwsutherland · 2020-09-14T23:50:01Z

I noticed that waiting tasks that are ready to run are considered to be active

Yeah, I call the window creation on any task added to the runahead pool ... But this could be done elsewhere.

We can actually make n_max configurable with a mutation.. n=0 is definitely more minimalist, and would be better for gigantic workflows. We could also make it a subscription argument (choosing the largest n_max to decide the data-store size), and filter independently..

It will mean the front end can effect the data-store size on the back end workflow server, but perhaps we can set limits according authorization (i.e. read only n=0, ... , full control n=?)

hjoliver · 2020-09-15T00:52:33Z

I noticed that waiting tasks that are ready to run are considered to be active

Yeah, I call the window creation on any task added to the runahead pool ... But this could be done elsewhere.

I think n=0 (if that's what you're talking about) should include all tasks in the main scheduler pool, but not the runahead pool.

That will include "active waiting" tasks, as I think I called them in the SoD proposal: tasks that have at least one, but not all prerequisites satisfied (they are spawned by the first upstream, output that they depend on). Note these aren't necessarily transitory @oliver-sanders . As also discussed in the proposal, we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

hjoliver · 2020-09-15T01:00:06Z

We can actually make n_max configurable with a mutation.. n=0 is definitely more minimalist, and would be better for gigantic workflows. We could also make it a subscription argument (choosing the largest n_max to decide the data-store size), and filter independently..

It will mean the front end can effect the data-store size on the back end workflow server, ...

Remember the scheduler only needs to store n=0, but we decided it can keep up to n=1 for the convenience of cylc tui and other direct-connecting clients.

So the front end should determine the n-size of the UIS datastore, but not the scheduler datastore.

We could optionally allow the scheduler to store arbitrary n windows, up to some max, if you want to see loads of tasks in cylc tui without going through the UIS, but I don't see that as a primary use case. We can come back and do that later if there's a need for it. (I suppose if the UIS and scheduler code will be the same, for windowing, there's no harm in playing with it, or developing it, in the scheduler, but keep in mind the n=1 scheduler end game).

dwsutherland · 2020-09-15T21:43:16Z

I think n=0 (if that's what you're talking about) should include all tasks in the main scheduler pool, but not the runahead pool.

Done.

oliver-sanders · 2020-09-16T17:42:24Z

That will include "active waiting" tasks

we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

It might be a little bit difficult explaining the concept of "active waiting" tasks to users, it's easier to explain the n=x window as being x edges out from an active task where active is {preparing,submitted,running}, however, it's not that big an issue.

hjoliver · 2020-09-17T01:01:12Z

That will include "active waiting" tasks

we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

It might be a little bit difficult explaining the concept of "active waiting" tasks to users, it's easier to explain the n=x window as being x edges out from an active task where active is {preparing,submitted,running}, however, it's not that big an issue.

Yeah I agree. As argued in the SoD proposal, spawning of these as waiting task proxies doesn't matter - users will see a "waiting" task out front whether or not it is backed by a task proxy in the scheduler; but they probably shouldn't appear in n=0.

New Issue: #3822

dwsutherland · 2020-09-17T01:08:54Z

Yeah I agree. As argued in the SoD proposal, spawning of these as waiting task proxies doesn't matter - users will see a "waiting" task out front whether or not it is backed by a task proxy in the scheduler; but they probably shouldn't appear in n=0.

We can place the self.data_store_mgr.increment_graph_window(itask) anywhere you like (itask being a n=0 node).. At the moment it's when a task gets released from the runahead pool.

dwsutherland · 2020-09-17T01:12:20Z

The hard challenge I have at the moment is working out an efficient pruning scheme.. (I have a couple of things in mind, bound to evolve as I attempt)

hjoliver · 2020-09-17T01:16:05Z

The hard challenge I have at the moment is working out an efficient pruning scheme.. (I have a couple of things in mind, bound to evolve as I attempt)

We can have a brainstorming session about that, so it's not all on you.

dwsutherland · 2020-09-27T01:22:48Z

I've added a pruning mechanism that satisfies the need to remove potential future workflow paths that are not taken
(i.e.

foo => bar
foo:failed => qux

).
I'll write up more in the description soon.

There is still a little buggy behaviour, so I'll need to tidy things up somewhat.

dwsutherland · 2020-09-27T01:54:23Z

The other issue is; I've had to revert back to n=0 window (node/edge creation) to be triggered on tasks being added/created to the runahead pool.. The reason is; tasks get removed before their children get released from the runahead pool, so creating nodes of the release from runahead creates a gap in the data-store (i.e. children get removed as a potential workflow path that was never followed, and have to get recreated again)...

So the only other options:

Not remove from the active task pool until a child is release from the runahead (which undermines SoD a bit)..
Or not remove from data-store until a all children are not in within nmax or something like this ..
Or have a node/edge field to record edge distance.. And prune according to that (which would be annoying to increment, and edge lookup is better for queries so probably wouldn't have another use).

I'm leaning towards option 2, as it's just a set operation/lookup, and nodes are soon to know about parents to n-max anyway... Watch this space...

hjoliver · 2020-10-05T01:56:27Z

@dwsutherland - not sure if you got any further with this yet, perhaps we can have a chat tomorrow?

My initial thought was the first order attempt might be something like this:

if a task becomes active (i.e. if it enters n=0) follow its edges out to n=N and add those tasks to the pool (if not already in it)
if a task leaves n=0, recompute the window from scratch, by following edges out from remaining n=0 tasks, and "prune" by differencing the old and new windows.

This would be super-easy and it might be enough, at least for low-ish n windows, because following edges is pretty easy.

Then the challenge would be to beat this with an efficient incremental pruning scheme (i.e. if a task leaves n=0, can we decide what tasks to remove from n=N window without recomputing from scratch again). But in the meantime, we would have a working system 😁 - what do you think? (Maybe there's a flaw in my logic, or you already have a better schemed lined up).

dwsutherland · 2020-10-05T05:22:18Z

if a task leaves n=0, recompute the window from scratch, by following edges out from remaining n=0 tasks, and "prune" by differencing the old and new windows.

@hjoliver - Yeah I already have something more efficient in place, but I've got another idea in the works which avoids knowing when something has left the pool (i.e. knows when a child at n=N becomes n=0), and does a diff to prune (which should also hit paths not taken)... Anyway I'm just about implement this.

dwsutherland · 2020-11-10T00:28:51Z

n=0 now treated as isolates (pruned as soon as finished)..

[scheduling]
  [[graph]]
    R1 = foo => bar & baz => qux
[runtime]
  [[root]]
    script = true
  [[baz]]
    script = sleep 10

hjoliver

I have played with various workflow graphs at several n values, and this seems to do the right thing - a nice improvement 👍

Noticed scheduler reload wipes out jobs from the datastore (probably not just on this branch though; fix with a follow-up?)

Some follow-ups already planned (e.g. waiting-but-held-back tasks in n=1 not n=0) and some other tweaks might be needed in due course, but I think this is a good basis for n-windows in the scheduler.

I did a pretty once-over-lightly code review, but it looked fine at that level ... so let's keep the ball rolling here.

oliver-sanders · 2020-11-10T16:43:09Z

I've not done an especially deep code review, I think I have a vague idea of what's going on, I know @dwsutherland and @hjoliver have debated the implementation to death so I'm happy to pass over that.

I've had a play with a few graphs and it seems to be giving me the expected results.

I've found two things which were unexpected to me, though I may just be demonstrating my ignorance of the discussions over recent weeks.

I wasn't expecting the scheduler n-window to be configurable on the fly like that.
- Not a problem, however, I guess we will need to add some warning notices advising users not to play
  with this as it might have the potential to mess up the UI server?
Pre-initial tasks now show in the n=1 window.
- We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.
- Again apologies if I've missed a discussion.
I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

This capture shows (2) [task t.0] and (3) [two waiting tasks in the n=1 window]:

P1 = z[-P1] => a => b => c => d ...

If these things are known about, or if I'm just really out of date please merge!

hjoliver · 2020-11-10T20:06:24Z

I wasn't expecting the scheduler n-window to be configurable on the fly like that.

Not a problem, however, I guess we will need to add some warning notices advising users not to play with this as it might have the potential to mess up the UI server?

Huh, I was expecting that, and that we'd just have to make it work with the UIS. If I'm viewing a flow in the UI with the default n=1 window, and want to see more tasks (n=4 say), surely I shouldn't have to restart my UIS to achieve that?

Pre-initial tasks now show in the n=1 window.

We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.

I also noticed this but forgot to comment on it (definitely not intended from my perspective, but you didn't miss a discussion on it). Fix as a follow-up is fine.

I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

@dwsutherland can confirm but I think that's because you're holding task b, and a ready-to-run-but-held task still looks like n=0 even though it isn't active. The datastore is currently assuming that the scheduler task pool content is all n=0/active. Fixing this is another SoD-follow-up (if we take those tasks out of the scheduler task pool) or a datastore follow-up (if we get the datastore manager to deliberately exclude those tasks.

hjoliver · 2020-11-10T20:09:43Z

So, no merge-blockers there I think, but will wait for @dwsutherland to respond as well.

dwsutherland · 2020-11-10T20:46:47Z

I wasn't expecting the scheduler n-window to be configurable on the fly like that.

Not a problem, however, I guess we will need to add some warning notices advising users not to play with this as it might have the potential to mess up the UI server?

Huh, I was expecting that, and that we'd just have to make it work with the UIS. If I'm viewing a flow in the UI with the default n=1 window, and want to see more tasks (n=4 say), surely I shouldn't have to restart my UIS to achieve that?

Actually, I don't think it will mess up the UIS, because the boundary pruning isn't effected by it (only new nodes register the new boundary, so each node "knows" when it should be pruned in effect) .. And even if they get out of sync, the UIS will reconcile (via the stamp). (though in the future they may be independent)

Pre-initial tasks now show in the n=1 window.

We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.

I also noticed this but forgot to comment on it (definitely not intended from my perspective, but you didn't miss a discussion on it). Fix as a follow-up is fine.

Huh? tasks before the initial cycle point? I haven't seen this (maybe I'm not running the right flows)

I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

@dwsutherland can confirm but I think that's because you're holding task b, and a ready-to-run-but-held task still looks like n=0 even though it isn't active. The datastore is currently assuming that the scheduler task pool content is all n=0/active. Fixing this is another SoD-follow-up (if we take those tasks out of the scheduler task pool) or a datastore follow-up (if we get the datastore manager to deliberately exclude those tasks.

Correct, if it's in the pool then it's n=0 .. However, we may also get that boundary pruning gap smoothing with n>0, where tasks aren't pruned until an active child (at n+1) becomes active.

hjoliver · 2020-11-10T20:53:21Z

Huh? tasks before the initial cycle point? I haven't seen this (maybe I'm not running the right flows)

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  final cycle point = 6
  [[graph]]
    P1 = "foo[-P1] => foo"
[runtime]
  [[root]]
    script = sleep 20

dwsutherland · 2020-11-10T21:31:24Z

Oh, it's the integer cycling, as the following doesn't have that issue:

[scheduling]
  #cycling mode = integer
  initial cycle point = 20200101
  final cycle point = 20200107
  [[graph]]
    P1D = "foo[-P1D] => foo"
[runtime]
  [[root]]
    script = sleep 20

Will have a look.

dwsutherland · 2020-11-10T22:29:58Z

Hmm.. This works fine:

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  final cycle point = 6
  [[special tasks]]
    sequential = foo
  [[graph]]
    P1 = """
#foo[-P1] => foo
foo
"""
[runtime]
  [[root]]
    script = sleep 20

So might be the trigger logic

hjoliver · 2020-11-10T22:45:29Z

Yes it seems to only happen with intercycle dependence. But the pre-initial task proxies aren't appearing in the scheduler task pool, right? So is it something to do with cycling sequence methods accessed by the datastore?

dwsutherland · 2020-11-10T23:25:17Z

The cycle point needed to be checked that it is on sequence and valid, for iso8601 sequences the is_on_sequence method already included a bounds check, where integer bounds checking is only in is_valid (iso8601 is_valid only references the is_on_sequence result).

Have switched to is_valid for point checking while generating children and parents.

dwsutherland · 2020-11-11T01:15:44Z

@oliver-sanders (BTW changing the window size doesn't require a reload.. Reloads will wipe and restart the whole data-store)

hjoliver · 2020-11-11T02:10:35Z

(My similar fix here was also for integer cycling: #3756).

Posted an issue to check up on these cycling sequence methods: #3934

hjoliver · 2020-11-11T02:13:58Z

Checked it works. Thanks @dwsutherland 🎉 💐

dwsutherland added efficiency For notable efficiency improvements POC Proof of Concept labels Sep 13, 2020

dwsutherland added this to the cylc-8.0a3 milestone Sep 13, 2020

dwsutherland self-assigned this Sep 13, 2020

dwsutherland force-pushed the n-window-data-store branch from 016e1b0 to d58fa81 Compare September 15, 2020 21:39

dwsutherland mentioned this pull request Sep 22, 2020

tree: showing more cycle points than expected cylc/cylc-ui#496

Closed

dwsutherland force-pushed the n-window-data-store branch 2 times, most recently from 1a18a3e to bd317f6 Compare September 27, 2020 01:11

dwsutherland force-pushed the n-window-data-store branch from bd317f6 to 9097601 Compare October 3, 2020 04:55

dwsutherland force-pushed the n-window-data-store branch 2 times, most recently from a7bab88 to 851015a Compare October 14, 2020 10:13

dwsutherland force-pushed the n-window-data-store branch 5 times, most recently from 161cd41 to 193bfba Compare October 16, 2020 02:36

dwsutherland force-pushed the n-window-data-store branch 3 times, most recently from 2e475b0 to ef40e92 Compare November 10, 2020 00:23

hjoliver approved these changes Nov 10, 2020

View reviewed changes

dwsutherland added 8 commits November 11, 2020 12:18

graph gen functional

01c31b8

graph prune functional

462d04d

new prune/gen scheme & mutation added

416240f

test fixes

28763c6

Review fixes

c369620

post cylc#3823 fixes

c755676

n=0 special case

36cc95a

Change log entry added

553f1a0

dwsutherland force-pushed the n-window-data-store branch from ef40e92 to 553f1a0 Compare November 10, 2020 23:20

hjoliver merged commit 9b157fa into cylc:master Nov 11, 2020

hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021

oliver-sanders mentioned this pull request May 14, 2024

data store: add absolute graph edges #6103

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n-distance data-store graph window generation/pruning #3811

n-distance data-store graph window generation/pruning #3811

dwsutherland commented Sep 13, 2020 •

edited

Loading

oliver-sanders commented Sep 14, 2020

dwsutherland commented Sep 14, 2020 •

edited

Loading

hjoliver commented Sep 15, 2020

hjoliver commented Sep 15, 2020

dwsutherland commented Sep 15, 2020

oliver-sanders commented Sep 16, 2020 •

edited

Loading

hjoliver commented Sep 17, 2020

dwsutherland commented Sep 17, 2020 •

edited

Loading

dwsutherland commented Sep 17, 2020

hjoliver commented Sep 17, 2020

dwsutherland commented Sep 27, 2020

dwsutherland commented Sep 27, 2020 •

edited

Loading

hjoliver commented Oct 5, 2020 •

edited

Loading

dwsutherland commented Oct 5, 2020

dwsutherland commented Nov 10, 2020

hjoliver left a comment

oliver-sanders commented Nov 10, 2020 •

edited

Loading

hjoliver commented Nov 10, 2020 •

edited

Loading

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020 •

edited

Loading

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020

dwsutherland commented Nov 10, 2020

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020 •

edited

Loading

dwsutherland commented Nov 11, 2020

hjoliver commented Nov 11, 2020

hjoliver commented Nov 11, 2020

n-distance data-store graph window generation/pruning #3811

n-distance data-store graph window generation/pruning #3811

Conversation

dwsutherland commented Sep 13, 2020 • edited Loading

oliver-sanders commented Sep 14, 2020

dwsutherland commented Sep 14, 2020 • edited Loading

hjoliver commented Sep 15, 2020

hjoliver commented Sep 15, 2020

dwsutherland commented Sep 15, 2020

oliver-sanders commented Sep 16, 2020 • edited Loading

hjoliver commented Sep 17, 2020

dwsutherland commented Sep 17, 2020 • edited Loading

dwsutherland commented Sep 17, 2020

hjoliver commented Sep 17, 2020

dwsutherland commented Sep 27, 2020

dwsutherland commented Sep 27, 2020 • edited Loading

hjoliver commented Oct 5, 2020 • edited Loading

dwsutherland commented Oct 5, 2020

dwsutherland commented Nov 10, 2020

hjoliver left a comment

Choose a reason for hiding this comment

oliver-sanders commented Nov 10, 2020 • edited Loading

hjoliver commented Nov 10, 2020 • edited Loading

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020 • edited Loading

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020

dwsutherland commented Nov 10, 2020

hjoliver commented Nov 10, 2020

dwsutherland commented Nov 10, 2020 • edited Loading

dwsutherland commented Nov 11, 2020

hjoliver commented Nov 11, 2020

hjoliver commented Nov 11, 2020

dwsutherland commented Sep 13, 2020 •

edited

Loading

dwsutherland commented Sep 14, 2020 •

edited

Loading

oliver-sanders commented Sep 16, 2020 •

edited

Loading

dwsutherland commented Sep 17, 2020 •

edited

Loading

dwsutherland commented Sep 27, 2020 •

edited

Loading

hjoliver commented Oct 5, 2020 •

edited

Loading

oliver-sanders commented Nov 10, 2020 •

edited

Loading

hjoliver commented Nov 10, 2020 •

edited

Loading

dwsutherland commented Nov 10, 2020 •

edited

Loading

dwsutherland commented Nov 10, 2020 •

edited

Loading