Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix suite freeze on non-existent xtrigger #3056

Merged
merged 4 commits into from
Apr 1, 2019

Conversation

dwsutherland
Copy link
Member

As described in this issue:
#3054

Problem
The suite freezes completely when a xtrigger is declared with a non-existent function (i.e. a misspell of wall_clock as wallclock).. This can be devastating, as the only course of action is to kill the suite program/process via the OS...

Solution
This change will:

  • Catch non-existent function errors on suite configure, by attempting to load each xtrigger function before passing it to the manager.
  • Catch the Type error exception causing the suite to freeze (in case the python scripts of a running suite are deleted/moved)

Given the following suite section:

    [[xtriggers]]
        oopsie = not_a_function()
        clock_pt1h = wall_clock(offset=PT1H)
        jin_state = suite_state(jin, baz, %(point)s, cylc_run_dir=/home/sutherlander/cylc-run)
    [[dependencies]]
        [[[P1M]]]
            graph = """
@jin_state => baa
baa => qaz
@clock_pt1h => qux
@oopsie => qux
"""

You can expect the following on error:

(cylc8proto) sutherlander@cortex-vbox:baz$ cylc validate suite.rc 
'ERROR, xtrigger function not found: not_a_function'
(cylc8proto) sutherlander@cortex-vbox:baz$ cylc run --hold baz
            ._.                                                       
            | |                 The Cylc Suite Engine [8.0a0]         
._____._. ._| |_____.           Copyright (C) 2008-2019 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you 
      !_____!           are welcome to redistribute it under certain  
2019-04-01T17:39:40+13:00 ERROR - 'ERROR, xtrigger function not found: not_a_function'
	Traceback (most recent call last):
	  File "/home/sutherlander/repos/cylc8/lib/cylc/subprocpool.py", line 52, in get_func
	    mod_by_name = __import__(mod_name, fromlist=[mod_name])
	ModuleNotFoundError: No module named 'not_a_function'
	
	During handling of the above exception, another exception occurred:
	
	Traceback (most recent call last):
	  File "/home/sutherlander/repos/cylc8/lib/cylc/config.py", line 2101, in load_graph
	    xfunc = get_func(xtrig.func_name, self.fdir)
	  File "/home/sutherlander/repos/cylc8/lib/cylc/subprocpool.py", line 57, in get_func
	    mod_by_name = __import__(mod_name, fromlist=[mod_name])
	ModuleNotFoundError: No module named 'cylc.xtriggers.not_a_function'
	
	During handling of the above exception, another exception occurred:
	
	Traceback (most recent call last):
	  File "/home/sutherlander/repos/cylc8/lib/cylc/scheduler.py", line 255, in start
	    self.configure()
	  File "/home/sutherlander/repos/cylc8/lib/cylc/scheduler.py", line 400, in configure
	    self.load_suiterc()
	  File "/home/sutherlander/repos/cylc8/lib/cylc/scheduler.py", line 990, in load_suiterc
	    share_dir=self.suite_share_dir,
	  File "/home/sutherlander/repos/cylc8/lib/cylc/config.py", line 561, in __init__
	    self.load_graph()
	  File "/home/sutherlander/repos/cylc8/lib/cylc/config.py", line 2104, in load_graph
	    f"ERROR, "
	cylc.config.SuiteConfigError: 'ERROR, xtrigger function not found: not_a_function'
2019-04-01T17:39:40+13:00 ERROR - error caught: cleaning up before exit
2019-04-01T17:39:40+13:00 INFO - Suite shutting down - ERROR: 'ERROR, xtrigger function not found: not_a_function'

@dwsutherland dwsutherland self-assigned this Apr 1, 2019
@dwsutherland dwsutherland requested review from matthewrmshin, hjoliver and kinow and removed request for hjoliver April 1, 2019 05:02
@cylc cylc deleted a comment Apr 1, 2019
@matthewrmshin matthewrmshin added this to the cylc-8.0a1 milestone Apr 1, 2019
@matthewrmshin matthewrmshin added the bug Something is wrong :( label Apr 1, 2019
@matthewrmshin
Copy link
Contributor

Looks like CI test failure is expected after this change? Is it possible to add the required dependency so the test can pass?

@kinow
Copy link
Member

kinow commented Apr 1, 2019

Funny, @dwsutherland didn't add any Kafka dependency in his commits. I wonder why Travis is complaining about it now :S

@matthewrmshin
Copy link
Contributor

@dwsutherland can correct me if I am wrong.

It probably wasn't checking for it before? If I understand this change correctly, it now checks that the xtrigger function is correctly imported and callable. So if Kafka is not installed, this will cause validation to fail?

@dwsutherland
Copy link
Member Author

@dwsutherland can correct me if I am wrong.

It probably wasn't checking for it before? If I understand this change correctly, it now checks that the xtrigger function is correctly imported and callable. So if Kafka is not installed, this will cause validation to fail?

That's correct, it wasn't checking those before. I'll have a closer look at the failures..

@dwsutherland
Copy link
Member Author

dwsutherland commented Apr 1, 2019

So that 04-builtin-suites.t, gathers all suites and validates them... But it also includes that kafka suite:
https://github.com/cylc/cylc/blob/master/etc/examples/xtrigger/kafka/consumer/suite.rc

So why do we include test suites where the software isn't available?
I can understand having it as an example... But as the xtriggers are run by the suite, they need to be in the scope of validation.

So should certain suites be excluded from tests, or the tests include more software?

@hjoliver
Copy link
Member

hjoliver commented Apr 1, 2019

Ah, that's exactly why the kafka xtrigger example was kept under etc/dev-suites/ until very recently. I was expecting an examples test to fail when we moved it to etc/examples/ and deleted etc/dev-suites/, but it didn't and I didn't look deeper at the time. Sorry! But your change has explained why, which is good (we only validate example suites in test battery, not run them ... but now validation looks for the trigger function). Lemme just think about what to do here...

@dwsutherland
Copy link
Member Author

Ah, that's exactly why the kafka xtrigger was kept under etc/dev-suites/ until very recently. I was expecting an examples test to fail when we moved it to etc/examples/ and deleted etc/dev-suites/, but it didn't and I didn't look deeper at the time.

I guess if it only validated them (not ran them), then this wouldn't have been the cause of hanging CI tests.

@hjoliver
Copy link
Member

hjoliver commented Apr 1, 2019

(have been asked to attend to late timesheets first, grrr...)

@kinow
Copy link
Member

kinow commented Apr 1, 2019

@hjoliver maybe we can simply ignore kafka suites for now in the validate test, adding a TODO to revisit it later and decide what to do about it. Would be nice to test it later if possible I think.

This diff makes the tests pass in my environment. It ignores the Kafka almost in the same manner it ignores already empy tests if no dependency installed (tests which are broken by the way, sending a pull request for it in a bit).

diff --git a/tests/validate/04-builtin-suites.t b/tests/validate/04-builtin-suites.t
index dd773fbb4..2622189fc 100755
--- a/tests/validate/04-builtin-suites.t
+++ b/tests/validate/04-builtin-suites.t
@@ -53,6 +53,11 @@ for suite in ${SUITES[@]}; do
         skip 2 "${TEST_NAME}: EmPy not installed"
         continue
     fi
+    # ignore kafka suites TODO: revisit it later (kafka dependency)
+    if [[ $suite == *"kafka"* ]]; then
+        skip 2 "${TEST_NAME}: Skipping suites using Kafka dependencies"
+        continue
+    fi
     run_ok "${TEST_NAME}" cylc validate "${suite}" -v
     filter_warnings "${TEST_NAME}.stderr"
     cmp_ok "${TEST_NAME}.stderr.processed" /dev/null

Running:

$ cylc test-battery ./tests/validate/04-builtin-suites.t
./tests/validate/04-builtin-suites.t .. ok       
All tests successful.
Files=1, Tests=126, 22 wallclock secs ( 0.06 usr  0.00 sys + 20.22 cusr  2.03 csys = 22.31 CPU)
Result: PASS

@hjoliver
Copy link
Member

hjoliver commented Apr 1, 2019

@kinow - yes, that's exactly what I was thinking too.

@dwsutherland - can you implement the "ignorage" on this branch?

@dwsutherland
Copy link
Member Author

@kinow - yes, that's exactly what I was thinking too.

@dwsutherland - can you implement the "ignorage" on this branch?

Done. Thanks @kinow!
(let's see how it does)

@kinow
Copy link
Member

kinow commented Apr 1, 2019

(tests which are broken by the way, sending a pull request for it in a bit).

Funny, did a pip uninstall empy after EmPy tests failed, then after pip install empy re-running the test, the issue went away :s so no pull request, unless I see that test failing again

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @dwsutherland.

(Approving pending Travis CI result...)

@hjoliver
Copy link
Member

hjoliver commented Apr 1, 2019

(Note that SuiteConfigError is changed by #2995 ... but this PR will probably go in first).

@dwsutherland
Copy link
Member Author

Yay .. chunk 4 passed.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! 💥

@kinow kinow merged commit a07dcfe into cylc:master Apr 1, 2019
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we should have tests for this. An integration test would be easy enough (a suite that defines a non-existent trigger function should fail validation) but does this fall into the category of stuff we should really be using unit tests for? - can you advise @kinow?

@kinow
Copy link
Member

kinow commented Apr 1, 2019

Oh, just realized @matthewrmshin was also in the list of reviewers. Sorry if I merged it too quickly!

@kinow
Copy link
Member

kinow commented Apr 1, 2019

Actually we should have tests for this. An integration test would be easy enough (a suite that defines a non-existent trigger function should fail validation) but does this fall into the category of stuff we should really be using unit tests for? - can you advise @kinow?

Too late, just merged it, sorry! 😞

I think it could be tested with unit tests. We can verify that xtriggers work with the current tests, and do the negative tests with subunit + mocking IMHO.

oliver-sanders added a commit to oliver-sanders/cylc-flow that referenced this pull request Apr 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants