Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pool: task from previous run not retrieved from database #5952

Closed
oliver-sanders opened this issue Jan 31, 2024 · 8 comments
Closed

pool: task from previous run not retrieved from database #5952

oliver-sanders opened this issue Jan 31, 2024 · 8 comments
Labels
could be better Not exactly a bug, but not ideal. superseded

Comments

@oliver-sanders
Copy link
Member

A niche situation where outputs from task which was previously run are not injected into the pool on restart.

This is a cut-down version of the complex example in #5947:

#!Jinja2

{% set stop_cycle = stop_cycle | default(4) %}

[scheduler]
    allow implicit tasks = True

[scheduling]
    initial cycle point = 1
    stop after cycle point = {{ stop_cycle }}
    cycling mode = integer
    [[graph]]
        # main sequence
        P1 = a
        # transition
        R/{{ stop_cycle - 2 }} = a => b
        # spin-down sequence
        R2/P1/{{ stop_cycle }} = """
            a => b
            b[-P1] => b
        """

To replicate:

# run to the end of cycle 4
cylc vip -s stop_cycle=4

# then, extend on to cycle 6
cylc play -s stop_cycle=6

The workflow will stall due to 4/b:succeeded:

WARNING - Partially satisfied prerequisites:
      * 5/b is waiting on ['4/b:succeeded']
      * 6/b is waiting on ['5/b:succeeded']

However, this task had succeeded in the initial run:

INFO - [4/b running job:01 flows:1] => succeeded

It should have been re-inserted into the pool on restart which would have resulted in its succeeded output satisfying its downstream task 5/b.

Overview in diagrams:

cylc vip -s stop_cycle=4

Screenshot from 2024-01-31 11-25-17

cylc play -s stop_cycle=6

Screenshot from 2024-01-31 11-25-28

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Jan 31, 2024
@oliver-sanders oliver-sanders added this to the 8.2.x milestone Jan 31, 2024
@oliver-sanders
Copy link
Member Author

Suggest investigating with #5658 as there have been a lot of changes in this part of the code.

@hjoliver
Copy link
Member

hjoliver commented Feb 15, 2024

This isn't a bug, although we might still want to handle it automatically for convenience.

The reason for "pool: task from previous run not retrieved from database" is that, from an event-driven spawn-on-demand perspective, under normal circumstances (i.e., if you don't change the structure of the graph in the active window mid-run) we should never have to do that.

However, this task had succeeded in the initial run:
INFO - [4/b running job:01 flows:1] => succeeded
It should have been re-inserted into the pool on restart which would have resulted in its succeeded output satisfying its downstream task 5/b.

No it shouldn't - it got legitimately removed from the pool as a completed task.

First, all the green tasks finish complete, and we shut down with only 5/a waiting in the task pool:
gr1

Then, we restart (so, load 5/a from the db ready to run) , but the later stop-cycle changes the graph to this:
gr2

That is a very different graph, and the change directly impacts the active window - dangerous!
Specifically, the dependency 4/b => 5/b did not exist when 4/b completed and its children (at the time, none) were spawned.

So this is not a bug.

  • a restart, by definition, means to initialize a run from the previous workflow state, and
  • that state here does not naturally lead to the modified post-restart graph

... so it's perfectly reasonable to expect that some manual intervention might be required:

$ cylc set --pre=4/b <wflow-id>//5/b

i.e. manually satisfy the NEW prerequisite, which did not exist when the output was generated.

@hjoliver hjoliver added bug? Not sure if this is a bug or not and removed bug Something is wrong :( labels Feb 15, 2024
@hjoliver
Copy link
Member

hjoliver commented Feb 15, 2024

* 5/b is waiting on ['4/b:succeeded']

Unfortunately the simple stall message makes this look like a rather trivial bug, but it should be interpreted like this:

* the dependence of 5/b on 4/b:succeeded is not satisfied
THEREFORE either 4/b did not succeed, OR this dependence did not exist when it did succeed

We need to educate users that Cylc 8 scheduling is event-driven anyway: dependencies get satisfied automatically when the corresponding outputs are generated.

Once an event has passed you shouldn't really expect to be able to trigger new stuff off of it. So I don't think it would unreasonable to explain that if you change the graph mid-run, you may need to manually satisfy any NEW dependence on old tasks, and leave it at that.

However, I suppose if we can do it automatically and efficiently, even better.

Automatic solutions?

Unfortunately this problem can't be detected by simply examining the task pool at restart. NEW dependencies can connect tasks that pre-date and post-date the restart task pool.

Looking up all parent outputs in the DB every time a task is spawned, just in case of this, sounds like a potential performance hit. At the moment the only past outputs we need to retrieve at runtime are those of incomplete flow-wait tasks that need to be re-run when the flow encounters them.

However, I guess we could check the DB for the upstream outputs of the partially satisfied tasks that cause a stall.
That would automatically un-stall the example without causing gazillions of additional unnecessary DB look-ups.
The new prerequisite would not get satisfied until the workflow stalls, but that's a lesser evil.

Any other ideas?

@hjoliver hjoliver added could be better Not exactly a bug, but not ideal. and removed bug? Not sure if this is a bug or not labels Feb 15, 2024
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Feb 15, 2024

The reason for "pool: task from previous run not retrieved from database" is that, from an event-driven spawn-on-demand perspective, under normal circumstances (i.e., if you don't change the structure of the graph in the active window mid-run) we should never have to do that.

I understand what you are saying from the perspective of the SoD implementation, but I don't accept this argument from the perspective of the Cylc model. Note, if this were true, then we wouldn't perform DB checks when we insert tasks:

# Get submit number by flow_nums {flow_nums: submit_num, ...}
snums = self.workflow_db_mgr.pri_dao.select_submit_nums(
name, str(point)
)

self.workflow_db_mgr.pri_dao.select_task_outputs(
itask.tdef.name, str(itask.point))
).items():

Moreover, Cylc tells the user that a task which has previously run and succeeded, has not, both in the log:

WARNING - Partially satisfied prerequisites:
      * 5/b is waiting on ['4/b:succeeded']

The cylc show command also displays incorrect information.

And that is unarguably a bug, even if it's a failure of the SoD model rather than a bug in its implementation!

Note also that this is not actually a graph change. The "stop after cycle point" is merely a flag that tells the scheduler when to shut down, the recurrence is not terminated until the "final cycle point" so this dependency does conceptually exist in the first run, the problem occurs because the workflow doesn't spawn tasks beyond this point so the satisfied prereq is not recorded in the DB. I.E. this behaviour is an SoD implementation choice and not really "correct" even by SoD logic.

Any other ideas?

On possibility might be to spawn tasks beyond the "stop after cycle point" but use the "runahead limit" to prevent them from being run.

By this approach, this issue would only occur when new dependencies are added to the graph on restart/reload.

Still an irritation as the log and cylc show would continue to display incorrect information for these cases, but at least these cases are rarer.

@oliver-sanders oliver-sanders added bug Something is wrong :( and removed could be better Not exactly a bug, but not ideal. labels Feb 15, 2024
@hjoliver
Copy link
Member

hjoliver commented Feb 15, 2024

I understand what you are saying from the perspective of the SoD implementation, but I don't accept this argument from the perspective of the Cylc model

The current behaviour is to be expected under our current event-driven conceptual model, it's not merely implementation: when an event occurs, dependent stuff happens. A system that can automatically handle dynamically added new dependence on past events may well be desirable, but that's something additional that is not "event driven" in the usual sense of the concept.

Anyhow, there's no point in arguing about the semantics of that because you disagree that there is a graph change at all, and you claim that there is a genuine bug in how we spawn tasks beyond a stop point - so let's sort that out.

Note, if this were true, then we wouldn't perform DB checks when we insert tasks:

(That's a different sort of check - of the spawned task, not the outputs of its parents - but let's get to the claimed bug).

Moreover, Cylc tells the user that a task which has previously run and succeeded, has not, both in the log:

I do agree it will look that way to users, and hence (again) that we should do this if we can. But point of fact, the wording does not literally say that - it just says that the task is waiting on that output (and why is it waiting? - I've explained that).

Note also that this is not actually a graph change.

Sorry, that's in-arguably false - just look at my graphs above!! Your original graphs at the top are wrong - they don't show the critical inter-cycle dependencies or what happens after the stop cycle in each case.

The "stop after cycle point" is merely a flag that tells the scheduler when to shut down, the recurrence is not terminated until the "final cycle point" so this dependency does conceptually exist in the first run,

Ah, no - the graph structure here is literally determined (via Jinja2 code) by the current value of the stop point. Here is the complete graph for the initial run, with stop 4 and final point 10 (note the glaring absence of the problem dependency):

image

the problem occurs because the workflow doesn't spawn tasks beyond this point so the satisfied prereq is not recorded in the DB. I.E. this behaviour is an SoD implementation choice and not really "correct" even by SoD logic.

You must not have the correct graph in mind?? The scheduler DOES spawn beyond the stop point exactly as it should: 5/a ends up in the task pool table ready for restart. There is nothing else that should be spawned for the graph as it is at that time.

You seem to be arguing that the following recurrence does not actually terminate at {{stop_cycle}}and therefore we should spawn 5/b before stopping ???

        # for {% set stop_cycle = 4 %}
        R2/P1/4 = """
            b[-P1] => b
        """

But it does terminate at 4. Spawning 5/b is simply wrong, and in many cases it would actually cause bad things to happen.
I'm sure you know that, which is why I'm wondering if you have the wrong graphs in your head? We can't infer that the point 4 in that recurrence is supposed to be a movable stop point, because it might not be. It's just a templated number.

@oliver-sanders
Copy link
Member Author

If we agree to go ahead with checking prereqs for all tasks on spawn, then this issue will be superseded by #6143.

@oliver-sanders
Copy link
Member Author

We are now in agreement that we should handle this situation, marking this issue as superseeded by #6143 that will check for prereqs on task spawn ensuring the DB and task_pool remain in sync.

@oliver-sanders oliver-sanders removed this from the 8.3.x milestone Oct 18, 2024
@hjoliver
Copy link
Member

All good, although for the record I never disagreed that we should handle this, I just disagreed that it was a bug. In other words I stand by my initial comment above on this issue:

This isn't a bug, although we might still want to handle it automatically for convenience.

Supporting automatic satisfaction of newly added dependence on past events is a choice. It probably is what users would expect (all the time?) but it may have performance consequences (which also affects users).

@hjoliver hjoliver added could be better Not exactly a bug, but not ideal. and removed bug Something is wrong :( labels Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
could be better Not exactly a bug, but not ideal. superseded
Projects
None yet
Development

No branches or pull requests

2 participants