-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pool: task from previous run not retrieved from database #5952
Comments
Suggest investigating with #5658 as there have been a lot of changes in this part of the code. |
Unfortunately the simple stall message makes this look like a rather trivial bug, but it should be interpreted like this:
We need to educate users that Cylc 8 scheduling is event-driven anyway: dependencies get satisfied automatically when the corresponding outputs are generated. Once an event has passed you shouldn't really expect to be able to trigger new stuff off of it. So I don't think it would unreasonable to explain that if you change the graph mid-run, you may need to manually satisfy any NEW dependence on old tasks, and leave it at that. However, I suppose if we can do it automatically and efficiently, even better. Automatic solutions?Unfortunately this problem can't be detected by simply examining the task pool at restart. NEW dependencies can connect tasks that pre-date and post-date the restart task pool. Looking up all parent outputs in the DB every time a task is spawned, just in case of this, sounds like a potential performance hit. At the moment the only past outputs we need to retrieve at runtime are those of incomplete flow-wait tasks that need to be re-run when the flow encounters them. However, I guess we could check the DB for the upstream outputs of the partially satisfied tasks that cause a stall. Any other ideas? |
I understand what you are saying from the perspective of the SoD implementation, but I don't accept this argument from the perspective of the Cylc model. Note, if this were true, then we wouldn't perform DB checks when we insert tasks: cylc-flow/cylc/flow/task_pool.py Lines 1503 to 1506 in 8ab57ad
cylc-flow/cylc/flow/task_pool.py Lines 1580 to 1582 in 8ab57ad
Moreover, Cylc tells the user that a task which has previously run and succeeded, has not, both in the log:
The And that is unarguably a bug, even if it's a failure of the SoD model rather than a bug in its implementation! Note also that this is not actually a graph change. The "stop after cycle point" is merely a flag that tells the scheduler when to shut down, the recurrence is not terminated until the "final cycle point" so this dependency does conceptually exist in the first run, the problem occurs because the workflow doesn't spawn tasks beyond this point so the satisfied prereq is not recorded in the DB. I.E. this behaviour is an SoD implementation choice and not really "correct" even by SoD logic.
On possibility might be to spawn tasks beyond the "stop after cycle point" but use the "runahead limit" to prevent them from being run. By this approach, this issue would only occur when new dependencies are added to the graph on restart/reload. Still an irritation as the log and |
The current behaviour is to be expected under our current event-driven conceptual model, it's not merely implementation: when an event occurs, dependent stuff happens. A system that can automatically handle dynamically added new dependence on past events may well be desirable, but that's something additional that is not "event driven" in the usual sense of the concept. Anyhow, there's no point in arguing about the semantics of that because you disagree that there is a graph change at all, and you claim that there is a genuine bug in how we spawn tasks beyond a stop point - so let's sort that out.
(That's a different sort of check - of the spawned task, not the outputs of its parents - but let's get to the claimed bug).
I do agree it will look that way to users, and hence (again) that we should do this if we can. But point of fact, the wording does not literally say that - it just says that the task is waiting on that output (and why is it waiting? - I've explained that).
Sorry, that's in-arguably false - just look at my graphs above!! Your original graphs at the top are wrong - they don't show the critical inter-cycle dependencies or what happens after the stop cycle in each case.
Ah, no - the graph structure here is literally determined (via Jinja2 code) by the current value of the stop point. Here is the complete graph for the initial run, with stop 4 and final point 10 (note the glaring absence of the problem dependency):
You must not have the correct graph in mind?? The scheduler DOES spawn beyond the stop point exactly as it should: You seem to be arguing that the following recurrence does not actually terminate at
But it does terminate at 4. Spawning |
If we agree to go ahead with checking prereqs for all tasks on spawn, then this issue will be superseded by #6143. |
We are now in agreement that we should handle this situation, marking this issue as superseeded by #6143 that will check for prereqs on task spawn ensuring the DB and task_pool remain in sync. |
All good, although for the record I never disagreed that we should handle this, I just disagreed that it was a bug. In other words I stand by my initial comment above on this issue:
Supporting automatic satisfaction of newly added dependence on past events is a choice. It probably is what users would expect (all the time?) but it may have performance consequences (which also affects users). |
A niche situation where outputs from task which was previously run are not injected into the pool on restart.
This is a cut-down version of the complex example in #5947:
To replicate:
The workflow will stall due to
4/b:succeeded
:However, this task had succeeded in the initial run:
It should have been re-inserted into the pool on restart which would have resulted in its succeeded output satisfying its downstream task
5/b
.Overview in diagrams:
cylc vip -s stop_cycle=4
cylc play -s stop_cycle=6
The text was updated successfully, but these errors were encountered: