pool: task from previous run not retrieved from database #5952

oliver-sanders · 2024-01-31T11:29:59Z

A niche situation where outputs from task which was previously run are not injected into the pool on restart.

This is a cut-down version of the complex example in #5947:

#!Jinja2

{% set stop_cycle = stop_cycle | default(4) %}

[scheduler]
    allow implicit tasks = True

[scheduling]
    initial cycle point = 1
    stop after cycle point = {{ stop_cycle }}
    cycling mode = integer
    [[graph]]
        # main sequence
        P1 = a
        # transition
        R/{{ stop_cycle - 2 }} = a => b
        # spin-down sequence
        R2/P1/{{ stop_cycle }} = """
            a => b
            b[-P1] => b
        """

To replicate:

# run to the end of cycle 4
cylc vip -s stop_cycle=4

# then, extend on to cycle 6
cylc play -s stop_cycle=6

The workflow will stall due to 4/b:succeeded:

WARNING - Partially satisfied prerequisites:
      * 5/b is waiting on ['4/b:succeeded']
      * 6/b is waiting on ['5/b:succeeded']

However, this task had succeeded in the initial run:

INFO - [4/b running job:01 flows:1] => succeeded

It should have been re-inserted into the pool on restart which would have resulted in its succeeded output satisfying its downstream task 5/b.

Overview in diagrams:

cylc vip -s stop_cycle=4

cylc play -s stop_cycle=6

The text was updated successfully, but these errors were encountered:

oliver-sanders · 2024-01-31T15:05:17Z

Suggest investigating with #5658 as there have been a lot of changes in this part of the code.

hjoliver · 2024-02-15T02:11:46Z

This isn't a bug, although we might still want to handle it automatically for convenience.

The reason for "pool: task from previous run not retrieved from database" is that, from an event-driven spawn-on-demand perspective, under normal circumstances (i.e., if you don't change the structure of the graph in the active window mid-run) we should never have to do that.

However, this task had succeeded in the initial run:
INFO - [4/b running job:01 flows:1] => succeeded
It should have been re-inserted into the pool on restart which would have resulted in its succeeded output satisfying its downstream task 5/b.

No it shouldn't - it got legitimately removed from the pool as a completed task.

First, all the green tasks finish complete, and we shut down with only 5/a waiting in the task pool:

Then, we restart (so, load 5/a from the db ready to run) , but the later stop-cycle changes the graph to this:

That is a very different graph, and the change directly impacts the active window - dangerous!
Specifically, the dependency 4/b => 5/b did not exist when 4/b completed and its children (at the time, none) were spawned.

So this is not a bug.

a restart, by definition, means to initialize a run from the previous workflow state, and
that state here does not naturally lead to the modified post-restart graph

... so it's perfectly reasonable to expect that some manual intervention might be required:

$ cylc set --pre=4/b <wflow-id>//5/b

i.e. manually satisfy the NEW prerequisite, which did not exist when the output was generated.

hjoliver · 2024-02-15T02:58:18Z

* 5/b is waiting on ['4/b:succeeded']

Unfortunately the simple stall message makes this look like a rather trivial bug, but it should be interpreted like this:

* the dependence of 5/b on 4/b:succeeded is not satisfied
THEREFORE either 4/b did not succeed, OR this dependence did not exist when it did succeed

We need to educate users that Cylc 8 scheduling is event-driven anyway: dependencies get satisfied automatically when the corresponding outputs are generated.

Once an event has passed you shouldn't really expect to be able to trigger new stuff off of it. So I don't think it would unreasonable to explain that if you change the graph mid-run, you may need to manually satisfy any NEW dependence on old tasks, and leave it at that.

However, I suppose if we can do it automatically and efficiently, even better.

Automatic solutions?

Unfortunately this problem can't be detected by simply examining the task pool at restart. NEW dependencies can connect tasks that pre-date and post-date the restart task pool.

Looking up all parent outputs in the DB every time a task is spawned, just in case of this, sounds like a potential performance hit. At the moment the only past outputs we need to retrieve at runtime are those of incomplete flow-wait tasks that need to be re-run when the flow encounters them.

However, I guess we could check the DB for the upstream outputs of the partially satisfied tasks that cause a stall.
That would automatically un-stall the example without causing gazillions of additional unnecessary DB look-ups.
The new prerequisite would not get satisfied until the workflow stalls, but that's a lesser evil.

Any other ideas?

oliver-sanders · 2024-02-15T10:09:13Z

The reason for "pool: task from previous run not retrieved from database" is that, from an event-driven spawn-on-demand perspective, under normal circumstances (i.e., if you don't change the structure of the graph in the active window mid-run) we should never have to do that.

I understand what you are saying from the perspective of the SoD implementation, but I don't accept this argument from the perspective of the Cylc model. Note, if this were true, then we wouldn't perform DB checks when we insert tasks:

cylc-flow/cylc/flow/task_pool.py

Lines 1503 to 1506 in 8ab57ad

    
           # Get submit number by flow_nums {flow_nums: submit_num, ...} 
        
           snums = self.workflow_db_mgr.pri_dao.select_submit_nums( 
        
               name, str(point) 
        
           )

cylc-flow/cylc/flow/task_pool.py

Lines 1580 to 1582 in 8ab57ad

    
               self.workflow_db_mgr.pri_dao.select_task_outputs( 
        
                   itask.tdef.name, str(itask.point)) 
        
           ).items():

Moreover, Cylc tells the user that a task which has previously run and succeeded, has not, both in the log:

WARNING - Partially satisfied prerequisites:
      * 5/b is waiting on ['4/b:succeeded']

The cylc show command also displays incorrect information.

And that is unarguably a bug, even if it's a failure of the SoD model rather than a bug in its implementation!

Note also that this is not actually a graph change. The "stop after cycle point" is merely a flag that tells the scheduler when to shut down, the recurrence is not terminated until the "final cycle point" so this dependency does conceptually exist in the first run, the problem occurs because the workflow doesn't spawn tasks beyond this point so the satisfied prereq is not recorded in the DB. I.E. this behaviour is an SoD implementation choice and not really "correct" even by SoD logic.

Any other ideas?

On possibility might be to spawn tasks beyond the "stop after cycle point" but use the "runahead limit" to prevent them from being run.

By this approach, this issue would only occur when new dependencies are added to the graph on restart/reload.

Still an irritation as the log and cylc show would continue to display incorrect information for these cases, but at least these cases are rarer.

hjoliver · 2024-02-15T23:17:38Z

I understand what you are saying from the perspective of the SoD implementation, but I don't accept this argument from the perspective of the Cylc model

The current behaviour is to be expected under our current event-driven conceptual model, it's not merely implementation: when an event occurs, dependent stuff happens. A system that can automatically handle dynamically added new dependence on past events may well be desirable, but that's something additional that is not "event driven" in the usual sense of the concept.

Anyhow, there's no point in arguing about the semantics of that because you disagree that there is a graph change at all, and you claim that there is a genuine bug in how we spawn tasks beyond a stop point - so let's sort that out.

Note, if this were true, then we wouldn't perform DB checks when we insert tasks:

(That's a different sort of check - of the spawned task, not the outputs of its parents - but let's get to the claimed bug).

Moreover, Cylc tells the user that a task which has previously run and succeeded, has not, both in the log:

I do agree it will look that way to users, and hence (again) that we should do this if we can. But point of fact, the wording does not literally say that - it just says that the task is waiting on that output (and why is it waiting? - I've explained that).

Note also that this is not actually a graph change.

Sorry, that's in-arguably false - just look at my graphs above!! Your original graphs at the top are wrong - they don't show the critical inter-cycle dependencies or what happens after the stop cycle in each case.

The "stop after cycle point" is merely a flag that tells the scheduler when to shut down, the recurrence is not terminated until the "final cycle point" so this dependency does conceptually exist in the first run,

Ah, no - the graph structure here is literally determined (via Jinja2 code) by the current value of the stop point. Here is the complete graph for the initial run, with stop 4 and final point 10 (note the glaring absence of the problem dependency):

the problem occurs because the workflow doesn't spawn tasks beyond this point so the satisfied prereq is not recorded in the DB. I.E. this behaviour is an SoD implementation choice and not really "correct" even by SoD logic.

You must not have the correct graph in mind?? The scheduler DOES spawn beyond the stop point exactly as it should: 5/a ends up in the task pool table ready for restart. There is nothing else that should be spawned for the graph as it is at that time.

You seem to be arguing that the following recurrence does not actually terminate at {{stop_cycle}}and therefore we should spawn 5/b before stopping ???

        # for {% set stop_cycle = 4 %}
        R2/P1/4 = """
            b[-P1] => b
        """

But it does terminate at 4. Spawning 5/b is simply wrong, and in many cases it would actually cause bad things to happen.
I'm sure you know that, which is why I'm wondering if you have the wrong graphs in your head? We can't infer that the point 4 in that recurrence is supposed to be a movable stop point, because it might not be. It's just a templated number.

oliver-sanders · 2024-06-14T13:56:27Z

If we agree to go ahead with checking prereqs for all tasks on spawn, then this issue will be superseded by #6143.

oliver-sanders · 2024-10-18T12:03:11Z

We are now in agreement that we should handle this situation, marking this issue as superseeded by #6143 that will check for prereqs on task spawn ensuring the DB and task_pool remain in sync.

hjoliver · 2024-10-21T02:38:01Z

All good, although for the record I never disagreed that we should handle this, I just disagreed that it was a bug. In other words I stand by my initial comment above on this issue:

This isn't a bug, although we might still want to handle it automatically for convenience.

Supporting automatic satisfaction of newly added dependence on past events is a choice. It probably is what users would expect (all the time?) but it may have performance consequences (which also affects users).

oliver-sanders added the bug Something is wrong :( label Jan 31, 2024

oliver-sanders added this to the 8.2.x milestone Jan 31, 2024

oliver-sanders mentioned this issue Jan 31, 2024

example: extending workflow #5947

Draft

12 tasks

oliver-sanders mentioned this issue Feb 1, 2024

Implement "cylc set" command #5658

Merged

16 tasks

hjoliver added bug? Not sure if this is a bug or not and removed bug Something is wrong :( labels Feb 15, 2024

hjoliver added could be better Not exactly a bug, but not ideal. and removed bug? Not sure if this is a bug or not labels Feb 15, 2024

oliver-sanders added bug Something is wrong :( and removed could be better Not exactly a bug, but not ideal. labels Feb 15, 2024

oliver-sanders mentioned this issue Jun 14, 2024

pool: check prereqs on task spawn #6143

Open

oliver-sanders modified the milestones: 8.2.x, 8.3.x Jun 14, 2024

oliver-sanders mentioned this issue Jul 31, 2024

Intercycle Dependencies not honoured at FCP change & Restart #6268

Closed

oliver-sanders closed this as completed Oct 18, 2024

oliver-sanders added the superseded label Oct 18, 2024

oliver-sanders removed this from the 8.3.x milestone Oct 18, 2024

hjoliver added could be better Not exactly a bug, but not ideal. and removed bug Something is wrong :( labels Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pool: task from previous run not retrieved from database #5952

pool: task from previous run not retrieved from database #5952

oliver-sanders commented Jan 31, 2024

oliver-sanders commented Jan 31, 2024

hjoliver commented Feb 15, 2024 •

edited

Loading

hjoliver commented Feb 15, 2024 •

edited

Loading

oliver-sanders commented Feb 15, 2024 •

edited

Loading

hjoliver commented Feb 15, 2024 •

edited

Loading

oliver-sanders commented Jun 14, 2024

oliver-sanders commented Oct 18, 2024

hjoliver commented Oct 21, 2024

pool: task from previous run not retrieved from database #5952

pool: task from previous run not retrieved from database #5952

Comments

oliver-sanders commented Jan 31, 2024

Overview in diagrams:

oliver-sanders commented Jan 31, 2024

hjoliver commented Feb 15, 2024 • edited Loading

hjoliver commented Feb 15, 2024 • edited Loading

Automatic solutions?

oliver-sanders commented Feb 15, 2024 • edited Loading

hjoliver commented Feb 15, 2024 • edited Loading

oliver-sanders commented Jun 14, 2024

oliver-sanders commented Oct 18, 2024

hjoliver commented Oct 21, 2024

hjoliver commented Feb 15, 2024 •

edited

Loading

hjoliver commented Feb 15, 2024 •

edited

Loading

oliver-sanders commented Feb 15, 2024 •

edited

Loading

hjoliver commented Feb 15, 2024 •

edited

Loading