Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

n-distance data-store graph window generation/pruning #3811

Merged
merged 8 commits into from
Nov 11, 2020

Conversation

dwsutherland
Copy link
Member

@dwsutherland dwsutherland commented Sep 13, 2020

These changes close #3747

Closes cylc/cylc-ui#496

N-Distance node/edge generation:
The cycle point nodes/edges (i.e. task/family proxies) generation is triggered
individually on transition from staging to active task pool. Each active task
is generated along with any children and parents out recursively out to a
specified maximum graph distance (n_edge_distance), that can be externally
altered (via API). Collectively this forms the N-Distance-Window on the
workflow graph.

  • n-distance nodes/edges generation.

N-Distance nodes/edges pruning:
Pruning of data-store elements is done using both the collection/set of nodes
generated through the associated graph paths of the the active nodes and the
tracking of the boundary nodes (n_edge_distance+1) of those active nodes.
Once a boundary node becomes active the original node is flagged for prunning.
Set operations are used to do a diff between the nodes of active paths
(paths whose node is in the active task pool) and the nodes of flagged paths
(whose boundary node(s) have become active).

example one

a* => b => c => d

(* as active).. With n=1 window
With a active we have
In paths

{a: {a, b}}

boundary info at n+1

{c: {a}}

When c becomes active we have:

{a: {a, b}, b: {a, b, c}, c: {b, c, d}}
{c: {a}, d: {b, c, d}} 

(yes d is the prune trigger for itself and c, because it is the furthest distance before n+1 ... this works for isolates too)
and since c became active we flag the now out of window a for pruning..
The pruning mechanism will then take all the active path nodes:
{b, c, d}
and do a diff with all the inactive path nodes:
{a, b} + {a, b, c}
so we are cleared to prune {a}, resulting in:

{b: {a, b, c}, c: {b, c, d}}
{d: {b, c, d}} 

example two

a => b* => c => d
e => f => g
b => g*

(* as active).. With n=1 window
When b & g turned active, e is pruned but not a, as a is in the path of active b (even though g flagged it), then when c becomes active a is pruned.

example three

a => b => c* => d
a:failed => f

Pruning works here because when c becomes active, the paths of a include f, so f is pruned (as long as it's not in the active task pool)..

(other convoluted graphs with conditionals and absolute triggers are also handled by this scheme)

So when we generate the window about an active task, we also find/associate the conditions for the pruning of/with it.
This ensures a change of window size won't orphan the task from being pruned, which would trip up alternate pruning schemes.. i.e reducing the window size using a scheme based flagging the ancestors/parents of active tasks would miss nodes formerly in the window.

  • n-distance nodes/edges pruning.

Animated examples:

Given the following graph with simple chain, and isolate branch:

[scheduling]
    initial cycle point = 20190101T00
    [[special tasks]]
        sequential = bin
    [[dependencies]]
        [[[P1M]]]
            graph = """
bin => foo => bar
baa => qux
"""
[runtime]
    [[root]]
        script = sleep 5

    [[baa, bin]]

    [[foo, bar, qux]]
        script = sleep 3

At n=0 we have:
n0_window_tui

And n=1:
n1_window_tui

But we can also alter this with a mutation:
n_window_wui

Need to:

  • work this option into TUI
  • n=1 is default, but perhaps we can set n=0 dynamically when the active pool gets beyond some configurable limit.

(future PRs?)

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation required (alpha release candidate, visual but (yet) interactive change).
  • No dependency changes.

@dwsutherland dwsutherland added efficiency For notable efficiency improvements POC Proof of Concept labels Sep 13, 2020
@dwsutherland dwsutherland added this to the cylc-8.0a3 milestone Sep 13, 2020
@dwsutherland dwsutherland self-assigned this Sep 13, 2020
@oliver-sanders
Copy link
Member

Cylc Tui is looking much better 👍

I noticed that waiting tasks that are ready to run are considered to be active which confused me slightly at first but kinda makes sense as they are quite transitory.

@dwsutherland
Copy link
Member Author

dwsutherland commented Sep 14, 2020

I noticed that waiting tasks that are ready to run are considered to be active

Yeah, I call the window creation on any task added to the runahead pool ... But this could be done elsewhere.

We can actually make n_max configurable with a mutation.. n=0 is definitely more minimalist, and would be better for gigantic workflows. We could also make it a subscription argument (choosing the largest n_max to decide the data-store size), and filter independently..

It will mean the front end can effect the data-store size on the back end workflow server, but perhaps we can set limits according authorization (i.e. read only n=0, ... , full control n=?)

@hjoliver
Copy link
Member

I noticed that waiting tasks that are ready to run are considered to be active

Yeah, I call the window creation on any task added to the runahead pool ... But this could be done elsewhere.

I think n=0 (if that's what you're talking about) should include all tasks in the main scheduler pool, but not the runahead pool.

That will include "active waiting" tasks, as I think I called them in the SoD proposal: tasks that have at least one, but not all prerequisites satisfied (they are spawned by the first upstream, output that they depend on). Note these aren't necessarily transitory @oliver-sanders . As also discussed in the proposal, we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

@hjoliver
Copy link
Member

We can actually make n_max configurable with a mutation.. n=0 is definitely more minimalist, and would be better for gigantic workflows. We could also make it a subscription argument (choosing the largest n_max to decide the data-store size), and filter independently..

It will mean the front end can effect the data-store size on the back end workflow server, ...

Remember the scheduler only needs to store n=0, but we decided it can keep up to n=1 for the convenience of cylc tui and other direct-connecting clients.

So the front end should determine the n-size of the UIS datastore, but not the scheduler datastore.

We could optionally allow the scheduler to store arbitrary n windows, up to some max, if you want to see loads of tasks in cylc tui without going through the UIS, but I don't see that as a primary use case. We can come back and do that later if there's a need for it. (I suppose if the UIS and scheduler code will be the same, for windowing, there's no harm in playing with it, or developing it, in the scheduler, but keep in mind the n=1 scheduler end game).

@dwsutherland
Copy link
Member Author

I think n=0 (if that's what you're talking about) should include all tasks in the main scheduler pool, but not the runahead pool.

Done.

@oliver-sanders
Copy link
Member

oliver-sanders commented Sep 16, 2020

That will include "active waiting" tasks

we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

It might be a little bit difficult explaining the concept of "active waiting" tasks to users, it's easier to explain the n=x window as being x edges out from an active task where active is {preparing,submitted,running}, however, it's not that big an issue.

@hjoliver
Copy link
Member

That will include "active waiting" tasks

we could decide these waiting tasks are not "active" (i.e. not in n=0) but I don't think it matters too much right now.

It might be a little bit difficult explaining the concept of "active waiting" tasks to users, it's easier to explain the n=x window as being x edges out from an active task where active is {preparing,submitted,running}, however, it's not that big an issue.

Yeah I agree. As argued in the SoD proposal, spawning of these as waiting task proxies doesn't matter - users will see a "waiting" task out front whether or not it is backed by a task proxy in the scheduler; but they probably shouldn't appear in n=0.

New Issue: #3822

@dwsutherland
Copy link
Member Author

dwsutherland commented Sep 17, 2020

Yeah I agree. As argued in the SoD proposal, spawning of these as waiting task proxies doesn't matter - users will see a "waiting" task out front whether or not it is backed by a task proxy in the scheduler; but they probably shouldn't appear in n=0.

We can place the self.data_store_mgr.increment_graph_window(itask) anywhere you like (itask being a n=0 node).. At the moment it's when a task gets released from the runahead pool.

@dwsutherland
Copy link
Member Author

The hard challenge I have at the moment is working out an efficient pruning scheme.. (I have a couple of things in mind, bound to evolve as I attempt)

@hjoliver
Copy link
Member

The hard challenge I have at the moment is working out an efficient pruning scheme.. (I have a couple of things in mind, bound to evolve as I attempt)

We can have a brainstorming session about that, so it's not all on you.

@dwsutherland
Copy link
Member Author

I've added a pruning mechanism that satisfies the need to remove potential future workflow paths that are not taken
(i.e.

foo => bar
foo:failed => qux

).
I'll write up more in the description soon.

There is still a little buggy behaviour, so I'll need to tidy things up somewhat.

@dwsutherland
Copy link
Member Author

dwsutherland commented Sep 27, 2020

The other issue is; I've had to revert back to n=0 window (node/edge creation) to be triggered on tasks being added/created to the runahead pool.. The reason is; tasks get removed before their children get released from the runahead pool, so creating nodes of the release from runahead creates a gap in the data-store (i.e. children get removed as a potential workflow path that was never followed, and have to get recreated again)...

So the only other options:

  1. Not remove from the active task pool until a child is release from the runahead (which undermines SoD a bit)..
  2. Or not remove from data-store until a all children are not in within nmax or something like this ..
  3. Or have a node/edge field to record edge distance.. And prune according to that (which would be annoying to increment, and edge lookup is better for queries so probably wouldn't have another use).

I'm leaning towards option 2, as it's just a set operation/lookup, and nodes are soon to know about parents to n-max anyway... Watch this space...

@hjoliver
Copy link
Member

hjoliver commented Oct 5, 2020

@dwsutherland - not sure if you got any further with this yet, perhaps we can have a chat tomorrow?

My initial thought was the first order attempt might be something like this:

  • if a task becomes active (i.e. if it enters n=0) follow its edges out to n=N and add those tasks to the pool (if not already in it)
  • if a task leaves n=0, recompute the window from scratch, by following edges out from remaining n=0 tasks, and "prune" by differencing the old and new windows.

This would be super-easy and it might be enough, at least for low-ish n windows, because following edges is pretty easy.

Then the challenge would be to beat this with an efficient incremental pruning scheme (i.e. if a task leaves n=0, can we decide what tasks to remove from n=N window without recomputing from scratch again). But in the meantime, we would have a working system 😁 - what do you think? (Maybe there's a flaw in my logic, or you already have a better schemed lined up).

@dwsutherland
Copy link
Member Author

if a task leaves n=0, recompute the window from scratch, by following edges out from remaining n=0 tasks, and "prune" by differencing the old and new windows.

@hjoliver - Yeah I already have something more efficient in place, but I've got another idea in the works which avoids knowing when something has left the pool (i.e. knows when a child at n=N becomes n=0), and does a diff to prune (which should also hit paths not taken)... Anyway I'm just about implement this.

@dwsutherland dwsutherland force-pushed the n-window-data-store branch 2 times, most recently from a7bab88 to 851015a Compare October 14, 2020 10:13
@dwsutherland dwsutherland force-pushed the n-window-data-store branch 5 times, most recently from 161cd41 to 193bfba Compare October 16, 2020 02:36
@dwsutherland dwsutherland force-pushed the n-window-data-store branch 3 times, most recently from 2e475b0 to ef40e92 Compare November 10, 2020 00:23
@dwsutherland
Copy link
Member Author

n=0 now treated as isolates (pruned as soon as finished)..

[scheduling]
  [[graph]]
    R1 = foo => bar & baz => qux
[runtime]
  [[root]]
    script = true
  [[baz]]
    script = sleep 10

n_0_isolates

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have played with various workflow graphs at several n values, and this seems to do the right thing - a nice improvement 👍

Noticed scheduler reload wipes out jobs from the datastore (probably not just on this branch though; fix with a follow-up?)

Some follow-ups already planned (e.g. waiting-but-held-back tasks in n=1 not n=0) and some other tweaks might be needed in due course, but I think this is a good basis for n-windows in the scheduler.

I did a pretty once-over-lightly code review, but it looked fine at that level ... so let's keep the ball rolling here.

@oliver-sanders
Copy link
Member

oliver-sanders commented Nov 10, 2020

I've not done an especially deep code review, I think I have a vague idea of what's going on, I know @dwsutherland and @hjoliver have debated the implementation to death so I'm happy to pass over that.

I've had a play with a few graphs and it seems to be giving me the expected results.

I've found two things which were unexpected to me, though I may just be demonstrating my ignorance of the discussions over recent weeks.

  1. I wasn't expecting the scheduler n-window to be configurable on the fly like that.
    • Not a problem, however, I guess we will need to add some warning notices advising users not to play
      with this as it might have the potential to mess up the UI server?
  2. Pre-initial tasks now show in the n=1 window.
    • We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.
    • Again apologies if I've missed a discussion.
  3. I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

This capture shows (2) [task t.0] and (3) [two waiting tasks in the n=1 window]:

P1 = z[-P1] => a => b => c => d ...

Kapture 2020-11-10 at 16 33 29

If these things are known about, or if I'm just really out of date please merge!

@hjoliver
Copy link
Member

hjoliver commented Nov 10, 2020

I wasn't expecting the scheduler n-window to be configurable on the fly like that.

  • Not a problem, however, I guess we will need to add some warning notices advising users not to play with this as it might have the potential to mess up the UI server?

Huh, I was expecting that, and that we'd just have to make it work with the UIS. If I'm viewing a flow in the UI with the default n=1 window, and want to see more tasks (n=4 say), surely I shouldn't have to restart my UIS to achieve that?

Pre-initial tasks now show in the n=1 window.

  • We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.

I also noticed this but forgot to comment on it (definitely not intended from my perspective, but you didn't miss a discussion on it). Fix as a follow-up is fine.

I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

@dwsutherland can confirm but I think that's because you're holding task b, and a ready-to-run-but-held task still looks like n=0 even though it isn't active. The datastore is currently assuming that the scheduler task pool content is all n=0/active. Fixing this is another SoD-follow-up (if we take those tasks out of the scheduler task pool) or a datastore follow-up (if we get the datastore manager to deliberately exclude those tasks.

@hjoliver
Copy link
Member

So, no merge-blockers there I think, but will wait for @dwsutherland to respond as well.

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 10, 2020

I wasn't expecting the scheduler n-window to be configurable on the fly like that.

  • Not a problem, however, I guess we will need to add some warning notices advising users not to play with this as it might have the potential to mess up the UI server?

Huh, I was expecting that, and that we'd just have to make it work with the UIS. If I'm viewing a flow in the UI with the default n=1 window, and want to see more tasks (n=4 say), surely I shouldn't have to restart my UIS to achieve that?

Actually, I don't think it will mess up the UIS, because the boundary pruning isn't effected by it (only new nodes register the new boundary, so each node "knows" when it should be pruned in effect) .. And even if they get out of sync, the UIS will reconcile (via the stamp). (though in the future they may be independent)

Pre-initial tasks now show in the n=1 window.

  • We should block these from the view as they are potentially quite confusing, but happy to see that in a follow up.

I also noticed this but forgot to comment on it (definitely not intended from my perspective, but you didn't miss a discussion on it). Fix as a follow-up is fine.

Huh? tasks before the initial cycle point? I haven't seen this (maybe I'm not running the right flows)

I think that I'm getting the n+1 window (e.g. I ask for n=1 but get n=2).

@dwsutherland can confirm but I think that's because you're holding task b, and a ready-to-run-but-held task still looks like n=0 even though it isn't active. The datastore is currently assuming that the scheduler task pool content is all n=0/active. Fixing this is another SoD-follow-up (if we take those tasks out of the scheduler task pool) or a datastore follow-up (if we get the datastore manager to deliberately exclude those tasks.

Correct, if it's in the pool then it's n=0 .. However, we may also get that boundary pruning gap smoothing with n>0, where tasks aren't pruned until an active child (at n+1) becomes active.

@hjoliver
Copy link
Member

Huh? tasks before the initial cycle point? I haven't seen this (maybe I'm not running the right flows)

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  final cycle point = 6
  [[graph]]
    P1 = "foo[-P1] => foo"
[runtime]
  [[root]]
    script = sleep 20

Screenshot from 2020-11-11 09-51-57

@dwsutherland
Copy link
Member Author

Oh, it's the integer cycling, as the following doesn't have that issue:

[scheduling]
  #cycling mode = integer
  initial cycle point = 20200101
  final cycle point = 20200107
  [[graph]]
    P1D = "foo[-P1D] => foo"
[runtime]
  [[root]]
    script = sleep 20

Will have a look.

@dwsutherland
Copy link
Member Author

Hmm.. This works fine:

[scheduling]
  cycling mode = integer
  initial cycle point = 1
  final cycle point = 6
  [[special tasks]]
    sequential = foo
  [[graph]]
    P1 = """
#foo[-P1] => foo
foo
"""
[runtime]
  [[root]]
    script = sleep 20

So might be the trigger logic

@hjoliver
Copy link
Member

Yes it seems to only happen with intercycle dependence. But the pre-initial task proxies aren't appearing in the scheduler task pool, right? So is it something to do with cycling sequence methods accessed by the datastore?

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 10, 2020

The cycle point needed to be checked that it is on sequence and valid, for iso8601 sequences the is_on_sequence method already included a bounds check, where integer bounds checking is only in is_valid (iso8601 is_valid only references the is_on_sequence result).

Have switched to is_valid for point checking while generating children and parents.

@dwsutherland
Copy link
Member Author

@oliver-sanders (BTW changing the window size doesn't require a reload.. Reloads will wipe and restart the whole data-store)

@hjoliver
Copy link
Member

(My similar fix here was also for integer cycling: #3756).

Posted an issue to check up on these cycling sequence methods: #3934

@hjoliver
Copy link
Member

Checked it works. Thanks @dwsutherland 🎉 💐

@hjoliver hjoliver merged commit 9b157fa into cylc:master Nov 11, 2020
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tree: showing more cycle points than expected n=1 scheduler data-store
4 participants