Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom xtriggers: undesired behaviour w/ illegal dict #3237

Open
sadielbartholomew opened this issue Jul 24, 2019 · 8 comments
Open

custom xtriggers: undesired behaviour w/ illegal dict #3237

sadielbartholomew opened this issue Jul 24, 2019 · 8 comments
Labels
bug Something is wrong :(
Milestone

Comments

@sadielbartholomew
Copy link
Collaborator

In the results dictionary returned in the satisfied signature (True, results) from a custom external trigger function, certain "bad" dictionary values will cause the suite to shut down.

I will at some point survey the nature of what constitutes a "bad" value; possibly they are just values which are other structures such as (further) dictionaries (forming an overall nested results dict), as in the example used here.

We can document that results must be in a certain form, but can we also catch such errors for invalid return values from xtriggers so that they do not propagate & cause strange task states & suite shutdown, as described below? Validating the environment variables and their values produced by the results dict to ensure they are sound before passing them to downstream tasks seems like the ideal solution, but that may be too tricky (or inefficient, etc.) to set up.

Release version(s) and/or repository branch(es) affected?
master branch i.e. 8.0a0 (+ earlier versions).

Steps to reproduce the bug

Example with dictionary values in results

Using the xtrigger given in cylc/cylc-doc#41 in this example suite, which with the "nested" argument returns results dict values that are themselves dictionaries:

[scheduling]
    initial cycle point = now
    [[xtriggers]]
        breakable_xtrigger = breakable_script("nested")
    [[dependencies]]
        [[[PT1M]]]
            graph = """
                    @breakable_xtrigger => wait_around
                    to_get_some_cycles_going =>  wait_around
                    """

produces the logically wrong task state even for the independent task to_get_some_cycles_going, as shown in the screenshot below, & a KeyError shuts down the suite, with the following in the log/suite/log:

2019-07-24T13:48:09+01:00 INFO - [to_get_some_cycles_going.20190724T1348+01] status=running: (received)succeeded at 2019-07-24T13:48:09+01:00 for job(01)
2019-07-24T13:48:10+01:00 ERROR - 'breakable_xtrigger_dictA'
	Traceback (most recent call last):
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 254, in start
	    self.run()
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 1532, in run
	    self.process_task_pool()
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/scheduler.py", line 1169, in process_task_pool
	    self.suite, itasks, self.config.run_mode('simulation'))
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 199, in submit_task_jobs
	    prepared_tasks, bad_tasks = self.prep_submit_task_jobs(suite, itasks)
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 175, in prep_submit_task_jobs
	    check_syntax=check_syntax)
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/task_job_mgr.py", line 758, in _prep_submit_task_job
	    poverride(rtconfig, overrides, prepend=True)
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/util.py", line 225, in poverride
	    poverride(target[key], val, prepend)
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/util.py", line 225, in poverride
	    poverride(target[key], val, prepend)
	  File "/net/home/h06/sbarth/cylc.git/cylc/flow/parsec/OrderedDict.py", line 42, in __getitem__
	    return OrderedDict.__getitem__(self, key)
	KeyError: 'breakable_xtrigger_dictA'
2019-07-24T13:48:10+01:00 CRITICAL - Suite shutting down - 'breakable_xtrigger_dictA'

Expected behavior
A above, I'm not sure the best way to manage this, but there should be way to prevent bad xtrigger return values from making the suite behave strangely in runtime (not just for downstream tasks) &/or shutting down.

Screenshots

xtrigger_nested

@hjoliver
Copy link
Member

hjoliver commented Jul 25, 2019

@sadielbartholomew - see my comments in #3232

If an xtrigger function returns an illegal result, it is technically broken and your suite is presumably going to stall with tasks waiting on it. (Because we - correctly - don't treat a broken xtrigger as "satisfied").

So what to do about that? We could:

  1. have the suite shut down cleanly, with warnings/errors about the broken function
  2. or keep going, even if stalled, but issue prominent warnings/errors (in the suite log at Cylc 7; hopefully more in the Cylc 8 UI)

Number 2. may be preferable so long as you can fix the trigger function in the live suite, but I don't think that 1. is "strange behaviour" 😁

@hjoliver
Copy link
Member

(Sorry, I see your "strange behaviour" comment doesn't relate to the suite shutdown).

@hjoliver
Copy link
Member

hjoliver commented Jul 25, 2019

Re-reading more carefully (sorry)! I guess we need to distinguish between "broken" xtrigger functions that return totally illegal results (i.e. anything but (T/F, dict)) and ones that return success (T, dict) but with an illegal results dict.

In the latter case, it is correct that the downstream task triggered (because the trigger was satisfied) but the illegal results dict caused problems.

@trwhitcomb
Copy link
Collaborator

I ran into this as well while trying to write some custom xtrigger functions. In my case, the suite didn't stall because the tasks were waiting on it (as @hjoliver) but rather crashed in a json decode that led to the suite controller itself becoming unresponsive - probably not the intended behavior. I'm OK with the second option above, but having an xtrigger function return totally illegal results shouldn't freeze things up.

@hjoliver hjoliver added bug Something is wrong :( and removed bug? Not sure if this is a bug or not labels Aug 9, 2019
@hjoliver hjoliver modified the milestones: cylc-8.0.0, next-release Aug 9, 2019
@sadielbartholomew
Copy link
Collaborator Author

Thanks for reporting your observations @trwhitcomb. That is definitely a prompt for us to investigate further. In general I agree that a bad dict in (T, dict) should not cause an unclean suite shut down, & certainly not a freeze, but @hjoliver's outlined case:

  1. have the suite shut down cleanly, with warnings/errors about the broken function

is correct, so would not be a bug. Actually when I wrote up the Issue (probably only very clearly, sorry!) as well as the shutdown it appeared that they were tasks emerging or persisting in states that were not right, as depicted somewhat in the Cylc Review screenshot, but I still need to the bottom of the 'strangeness' occurring in that case.

Also, for the clean shutdown, I think a specific error regarding an illegal xtrigger return dict would be much better than the less-transparent ultimate error, e.g. the KeyError in the case from my opening comment.

To go forward, I'll try out various cases for an illegal dict & report here on the resulting suite behaviour in each case. Then we can discuss what is correct/acceptable & what needs changing. I'll try to recreate the freeze as you described, which will need a fix.

(Sorry @hjoliver for the delay in responding to your previous comments, & those on the other xtrigger Issue I raised at a similar time, #3232, when I went back to look into these after some priority assigned work, I ran into #3275 which seemed the most pressing xtrigger aspect to address!)

@sadielbartholomew sadielbartholomew changed the title custom xtriggers: bad results values kill suite custom xtriggers: undesired behaviour w/ illegal dict Aug 10, 2019
@hjoliver
Copy link
Member

hjoliver commented Aug 11, 2019

@sadielbartholomew @trwhitcomb - yes, agreed that:

  • illegal xtrigger return values should not cause an unclean shutdown or the server to become unresponsive
  • in case of an illegal return value being detected, we should either:
    • shut down cleanly, with a clear error message
    • OR stay up, but log a clear error message (in this case, we should perhaps highlight the xtrigger node in the UI in red (say) to indicate its brokenness ... but that might be a longer term enhancement)

@matthewrmshin
Copy link
Contributor

I vote for suite staying up and remain responsive in all cases. The events should be logged as errors like task failures.

@matthewrmshin matthewrmshin modified the milestones: cylc-7.8.4, cylc-8.0.0 Aug 28, 2019
@oliver-sanders
Copy link
Member

Should be closed with #3497

@hjoliver hjoliver modified the milestones: cylc-8.0.0, cylc-8.x Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

5 participants