Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flow merge into no-flow task. #4645

Merged
merged 9 commits into from
Feb 14, 2022
Merged

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Feb 3, 2022

These changes close #4644

  • parented tasks
  • parentless tassk

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg and conda-environment.yml.
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • No documentation update required.

@hjoliver hjoliver added the bug Something is wrong :( label Feb 3, 2022
@hjoliver hjoliver added this to the cylc-8.0rc2 milestone Feb 3, 2022
@hjoliver hjoliver self-assigned this Feb 3, 2022
@hjoliver
Copy link
Member Author

hjoliver commented Feb 3, 2022

This wasn't difficult to fix once I understood the problem, so bumping up to 8.0rc1.

Copy link
Member

@MetRonnie MetRonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note this is not currently compatible with #4640): Edit: now resolved

  • unheld -> not-held in cylc dump
  • spawn_successor -> spawn_successor_if_parentless

Comment on lines +1272 to +1301
if completed_only:
c_task.state.satisfy_me({
(str(itask.point), itask.tdef.name, output)
})
self.data_store_mgr.delta_task_prerequisite(c_task)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch isn't covered, need to manually test...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run the test from #4651 against a flow.cylc and a suite.rc variants and god the same result, happy.

Copy link
Member Author

@hjoliver hjoliver Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something wrong with code coverage here. The new functional test here should hit this...

... brute force test: I put a print statement in there and it does get executed. 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea what could cause coverage to miss this?

Copy link
Member

@MetRonnie MetRonnie Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the coverage is accurate in this case. I put a LOG.critical("Here") at L1273 and it didn't show up in the workflow log for tests/functional/spawn-on-demand/14-trigger-flow-blocker.t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, completed_only is True for retroactive spawning on completed outputs (when an ongoing flow merges with a one-off manually-triggered task (thus making it part of the ongoing flow) ... i.e. what this PR is about). In back-compat mode, completed_only is False and we spawn on all outputs ahead of time (before they are completed) to approximate Cylc 7 spawning of tasks before they are needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the coverage is accurate in this case. I put a LOG.critical("Here") at L1273 and it didn't show up in the workflow log for tests/functional/spawn-on-demand/14-trigger-flow-blocker.t

Strange. I just tried the same, to double check, and it does show up in my workflow log. (I did have to disable the purge in the reftest shell function, to keep the log for inspection).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe the test is not actually working correctly in some environments, namely yours and GitHub actions ... I'll investigate later...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a play but I can't get this branch to activate either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test fixed. It was flaky.

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 3, 2022

I think this functionality is broken when merged with #4640, see #4651

Ignore me, I think I messed up the merge.

@oliver-sanders
Copy link
Member

Fixes the reported bug, however, is missing a DB check which causes case (4) to fail in the #4651 tests.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 4, 2022

Rebased and deconflicted, post merge of #4640

@hjoliver
Copy link
Member Author

hjoliver commented Feb 4, 2022

Fixes the reported bug, however, is missing a DB check which causes case (4) to fail in the #4651 tests.

Not needed - see #4651 (comment)

@hjoliver
Copy link
Member Author

hjoliver commented Feb 4, 2022

OK, with my linked explanation of the controversial 4/x case, hopefully this is good to go as-is.

(Note also: cylc trigger --help rewrite, to clarify the basics of force triggering, for users)

cylc/flow/scripts/trigger.py Outdated Show resolved Hide resolved
Comment on lines +1272 to +1301
if completed_only:
c_task.state.satisfy_me({
(str(itask.point), itask.tdef.name, output)
})
self.data_store_mgr.delta_task_prerequisite(c_task)
Copy link
Member

@MetRonnie MetRonnie Feb 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the coverage is accurate in this case. I put a LOG.critical("Here") at L1273 and it didn't show up in the workflow log for tests/functional/spawn-on-demand/14-trigger-flow-blocker.t

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 4, 2022

Fixes the reported bug, however, is missing a DB check which causes case (4) to fail in the #4651 tests.

Not needed - see #4651 (comment)

I think this is needed to maintain some consistencies / logic see #4651 (comment)

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work across a restart:

[scheduling]                                                  
    initial cycle point = 1                                   
    final cycle point = 5                                     
    cycling mode = integer                                    
    runahead limit = P1                                       
    [[graph]]                                                 
        P1 = x                                                
                                                              
[runtime]                                                     
    [[x]]                                                     
        script = """                                          
            if (( CYLC_TASK_CYCLE_POINT == 1 )); then         
                cylc trigger "${CYLC_WORKFLOW_ID}//4/x"       
            elif (( CYLC_TASK_CYCLE_POINT == 3 )); then       
                cylc stop "${CYLC_WORKFLOW_ID}" --now --now    
                contact="${HOME}/cylc-run/${CYLC_WORKFLOW_ID}/.service/contact"    
                while [[ -f "${contact}" ]]; do    
                    sleep 0.1                                                        
                done                                          
                cylc play "${CYLC_WORKFLOW_ID}"               
            fi                                                
        """  

Log from the second run:

2022-02-04T12:20:38Z INFO - Scheduler: url=tcp://...:43024/ pid=50836
2022-02-04T12:20:38Z INFO - Workflow publisher: url=tcp://vld601.cmpd1.metoffice.gov.uk:43020
2022-02-04T12:20:38Z INFO - Run: (re)start=1 log=1
2022-02-04T12:20:38Z INFO - Cylc version: 8.0rc1.dev
2022-02-04T12:20:38Z INFO - Run mode: live
2022-02-04T12:20:38Z INFO - Initial point: 1
2022-02-04T12:20:38Z INFO - Final point: 5
2022-02-04T12:20:38Z INFO - [3/x running(runahead) job:01 flows:1] => running
2022-02-04T12:20:40Z INFO - [3/x running job:01 flows:1] (polled)started at 2022-02-04T12:20:35Z
2022-02-04T12:20:40Z INFO - [3/x running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2022-02-04T12:20:40Z INFO - [4/x waiting(runahead) job:01 flows:1] (polled)succeeded at 2022-02-04T12:20:28Z
2022-02-04T12:20:40Z INFO - [4/x waiting(runahead) job:01 flows:1] => succeeded(runahead)
2022-02-04T12:20:41Z INFO - [3/x running job:01 flows:1] (received)succeeded at 2022-02-04T12:20:40Z
2022-02-04T12:20:41Z INFO - [3/x running job:01 flows:1] => succeeded
2022-02-04T12:20:41Z INFO - Workflow shutting down - AUTOMATIC
2022-02-04T12:20:41Z INFO - DONE

cylc/flow/scripts/trigger.py Outdated Show resolved Hide resolved
cylc/flow/scripts/trigger.py Outdated Show resolved Hide resolved
@hjoliver
Copy link
Member Author

hjoliver commented Feb 4, 2022

This does not work across a restart:

That's a bug, but it's not in scope for this PR (and not really specific to reflow either).

  1. submit no. does not increment until the preparing stage, so a) polling of a waiting(runahead) task at restart sees the previous job; and b) we should not be polling waiting tasks anyway
  2. in the context of your example, this PR is about triggering 3/x so that the main flow catches up with it while it is still running; not 4/x which is a future task that the scheduler doesn't catch up with (in the sense of requiring a merge)

@oliver-sanders
Copy link
Member

future task

If I subtract one from the trigger point I get the same result:

[scheduling]
    initial cycle point = 1
    final cycle point = 5
    cycling mode = integer
    runahead limit = P1
    [[graph]]
        P1 = x

[runtime]
    [[x]]
        script = """
            if (( CYLC_TASK_CYCLE_POINT == 1 )); then
                cylc trigger "${CYLC_WORKFLOW_ID}//3/x"
            elif (( CYLC_TASK_CYCLE_POINT == 2 )); then
                cylc stop "${CYLC_WORKFLOW_ID}" --now --now
                contact="${HOME}/cylc-run/${CYLC_WORKFLOW_ID}/.service/contact"
                while [[ -f "${contact}" ]]; do
                    sleep 0.1
                done
                cylc play "${CYLC_WORKFLOW_ID}"
            fi
        """
2022-02-04T15:30:57Z INFO - Initial point: 1
2022-02-04T15:30:57Z INFO - Final point: 5
2022-02-04T15:30:57Z INFO - [2/x running(runahead) job:01 flows:1] => running
2022-02-04T15:30:57Z WARNING - Task 3/x already spawned in {1}
2022-02-04T15:30:58Z INFO - [2/x running job:01 flows:1] (polled)started at 2022-02-04T15:30:51Z
2022-02-04T15:30:58Z INFO - [2/x running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2022-02-04T15:30:59Z INFO - [2/x running job:01 flows:1] (received)succeeded at 2022-02-04T15:30:59Z
2022-02-04T15:30:59Z INFO - [2/x running job:01 flows:1] => succeeded
2022-02-04T15:31:00Z INFO - Workflow shutting down - AUTOMATIC
2022-02-04T15:31:00Z INFO - DONE

Whereas with this diff, the workflow runs on to cycle 5 correctly:

             elif (( CYLC_TASK_CYCLE_POINT == 2 )); then
+                return
                 cylc stop "${CYLC_WORKFLOW_ID}" --now --now

@hjoliver hjoliver mentioned this pull request Feb 5, 2022
7 tasks
@hjoliver
Copy link
Member Author

hjoliver commented Feb 5, 2022

That's a bug, but it's not in scope for this PR (and not really specific to reflow either).
...
future task

If I subtract one from the trigger point I get the same result:

So:
-2. now your test does do something that this PR addresses (flow merge) ✔️
-1. but the reason it fails is still off-topic ❌ ...

... see #4658

Both variants run to completion with that change in.

Full credit, you're hammering this issue so hard that we're flushing other bugs out of the system, which is great 😁

@hjoliver
Copy link
Member Author

hjoliver commented Feb 5, 2022

(Added some logging and rebased again)

@hjoliver hjoliver force-pushed the fix-4644 branch 2 times, most recently from aafb0e3 to 4232627 Compare February 8, 2022 00:43
@hjoliver
Copy link
Member Author

hjoliver commented Feb 8, 2022

(Rebased again to pick up unit test fixes).

cylc/flow/task_pool.py Outdated Show resolved Hide resolved
@oliver-sanders
Copy link
Member

Full credit, you're hammering this issue so hard that we're flushing other bugs out of the system

So much so I think I got a new one #4665

@oliver-sanders
Copy link
Member

oliver-sanders commented Feb 8, 2022

I'm ok with this approach, due to the number of trigger-related bugs uncovered recently I'd like to test/understand more tomorrow (still need to get other PRs merged so not blocking). Would be good to generalise the trigger tests to make sure we are covering all cases correctly.

@hjoliver
Copy link
Member Author

hjoliver commented Feb 9, 2022

Full credit, you're hammering this issue so hard that we're flushing other bugs out of the system

So much so I think I got a new one #4665

Fixed! (And not really spawn-on-demand related, for the record).

@oliver-sanders
Copy link
Member

One remaining issue with the coverage thing #4645 (comment)

@hjoliver hjoliver force-pushed the fix-4644 branch 3 times, most recently from ed7d780 to 7e5f7ca Compare February 14, 2022 05:09
@hjoliver
Copy link
Member Author

hjoliver commented Feb 14, 2022

Fast test fail needs #4684 cherry picked it

@hjoliver
Copy link
Member Author

One remaining issue with the coverage thing #4645 (comment)

The test was flaky, so sometimes the merge didn't occur. Should be good now.

@hjoliver
Copy link
Member Author

Coverage 100% 🎉

cylc/flow/scripts/trigger.py Show resolved Hide resolved
cylc/flow/scripts/trigger.py Show resolved Hide resolved
@oliver-sanders oliver-sanders merged commit 81ca304 into cylc:master Feb 14, 2022
@hjoliver hjoliver deleted the fix-4644 branch February 14, 2022 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A triggered task can block a flow.
3 participants