Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch CancelledError in scheduler_cli run function #3685

Closed
wants to merge 1 commit into from

Conversation

kinow
Copy link
Member

@kinow kinow commented Jul 14, 2020

This is a small change with no associated Issue.

I am using the workflow five with Cylc UI, on Python 3.8. And it's really inconvenient having to press CTRL+C twice (and often I forget it). So I did a quick debugging before lunch to see what was going on.

Apparently, the scheduler_cli run function now sees a CancelledError when I hit the CTRL+C (in PyCharm you just need to debug with the option to emulate a terminal, so that the signals are sent and trigger the debugger).

image

(image above is where the exception was [re]raised, in the image below I added a global except + traceback to see what was going on, as the first time I tried it, the finally block got ret=0, even though there was an error).

image

I think the change in Python 3.8 that is related to this issue is bpo-32528. Implemented in this commit.

The CancelledError would be caught in the except Exception block before Python 3.8. But now it passes through the scheduler_cli try/except error handling. Actually any BaseException error will do the same, and ret will be 0, which I think is not desirable (not sure, but not the main issue here I guess)?

I am not sure if this is the right fix. Even though this works, it's not clear to me why I'm getting the CancelledError when hitting CTRL+C instead of KeyboardInterrupt. Just thought it could be a good starting point for others to troubleshoot and investigate this issue 👍 (and works for me as a temporary local-patch so I can continue working on Cylc UI)

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Already covered by existing tests.
  • No change log entry required (why? e.g. invisible to users).
  • No documentation update required.

@kinow
Copy link
Member Author

kinow commented Jul 14, 2020

Here's the output, when you have to press CTRL+C twice in the terminal window.

2020-07-15T11:43:55+12:00 INFO - Final point: None
2020-07-15T11:43:55+12:00 INFO - [prep.20130808T0000+12] -submit-num=01, owner@host=ranma
2020-07-15T11:43:55+12:00 INFO - [prep.20130808T0000+12] -triggered off []
^CTraceback (most recent call last):
  File "/home/kinow/Development/python/workspace/cylc-flow/venv/bin/cylc-run", line 33, in <module>
    sys.exit(load_entry_point('cylc-flow', 'console_scripts', 'cylc-run')())
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/scripts/cylc_run.py", line 22, in main
    scheduler_main(is_restart=False)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/scheduler_cli.py", line 446, in main
    return run()  # pylint: disable=E1120
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/terminal.py", line 209, in wrapper
    wrapped_function(*wrapped_args, **wrapped_kwargs)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/scheduler_cli.py", line 470, in run
    return scheduler_cli(parser, options, args, is_restart=False)
  File "/home/kinow/Development/python/workspace/cylc-flow/cylc/flow/scheduler_cli.py", line 362, in scheduler_cli
    ret = asyncio.run(
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/asyncio/base_events.py", line 603, in run_until_complete
    self.run_forever()
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
    event_list = self._selector.select(timeout)
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2020-07-15T11:43:57+12:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
^CException ignored in: <module 'threading' from '/home/kinow/Development/python/anaconda3/lib/python3.8/threading.py'>
Traceback (most recent call last):
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/threading.py", line 1388, in _shutdown
    lock.acquire()
KeyboardInterrupt:

@hjoliver
Copy link
Member

I see CancelledError appears like this in the Python 3.7 case:

...
        KeyboardInterrupt
        During handling of the above exception, another exception occurred:
        Traceback (most recent call last):
          File "/home/oliverh/cylc/cylc-flow/cylc/flow/scheduler.py", line 589, in start_scheduler
            await self.main_loop()
          File "/home/oliverh/cylc/cylc-flow/cylc/flow/scheduler.py", line 1639, in main_loop
            await asyncio.sleep(duration)
          File "/home/oliverh/miniconda3/envs/cylc37/lib/python3.7/asyncio/tasks.py", line 595, in sleep
            return await future
        concurrent.futures._base.CancelledError
...

@hjoliver
Copy link
Member

hjoliver commented Jul 15, 2020

Trying to reproduce this in a simple asyncio test program. I get the same result for both Python 3.7 and 3.8, which is (I think): KeyboardInterrupt gets caught at a low level, then coroutines get "cancelled" and CancelledError gets raised instead - so with asyncio we can't catch KeyboardInterrupt anymore (at a high level).

The only obvious difference between 3.7 and 3.8 is:

  • (3.7) concurrent.futures._base.CancelledError
  • (3.8) asyncio.exceptions.CancelledError

However this doesn't quite gel with what I'm seeing in Cylc ...

@hjoliver
Copy link
Member

hjoliver commented Jul 15, 2020

Hang on, I'm confused. I can catch KeyboardInterrupt at the highest level (outside the event loop). This works:

try:
    asyncio.run(main())
except KeyboardInterrupt:
    ...

@dwsutherland
Copy link
Member

Hang on, I'm confused. I can catch KeyboardInterrupt at the highest level (outside the event loop). This works:

try:
    asyncio.run(main())
except KeyboardInterrupt:
    ...

Yeah we do similar in cylc subscribe script

@hjoliver
Copy link
Member

Interesting Issue on the (now-archived) python/asyncio repo:

python/asyncio#341 ("Confusing error handling for KeyboardInterrupt")

Near the end, GVM himself says This thread is running off the rails... 🤣

@hjoliver
Copy link
Member

hjoliver commented Jul 15, 2020

Yeah we do similar in cylc subscribe script

But why/how does this work at all:

try:
    asyncio.run(main())
except KeyboardInterrupt:
    ...

when you can't catch KeyboardInterrupt inside async def run() ?

(Although, in Cylc's scheculer_cli.py we essentially are doing what I've just said (on the basis of a small test program) that you can't do ... hence 😕 )

@hjoliver
Copy link
Member

KeyboardInterrupt might be "complicated" in asyncio? E.g.: ethereum/py-evm#385 (comment)

@dwsutherland
Copy link
Member

I suppose it hits the async loop and not the coroutine(s) it's running (?)

@hjoliver
Copy link
Member

Summary of "Graceful Shutdown" section of this presentation https://asyncio-wrong.herokuapp.com/#/


TL;DR: Graceful Shutdown:
- try/except/finally isn't enough
- Use signal handlers
- Listen for the appropriate signals

@kinow
Copy link
Member Author

kinow commented Jul 15, 2020

@hjoliver I think this is closer to what is happening in our case:

#!/usr/bin/env python

import asyncio


# same as scheduler_cli#_run() here
async def _run():
    try:
        # this actually be scheduler.run(), but we just loop+sleep here
        while True:
            print("oi")
            await asyncio.sleep(1)
    except KeyboardInterrupt:
        print("BYE! We got the right interrupt")
        return
    except:
        print("HUH! Where's the KeyboardInterrupt?")
        raise


def main():
    # same as done in scheduler_cli() here
    try:
        # there's no try/except in our code here actually, only in the _run function I think
        # adding just for testing
        asyncio.run(_run(), debug=True) 
    except asyncio.exceptions.CancelledError:
        print('Got a CancelledError!')
    except KeyboardInterrupt:
        print("Here's the KeyboardInterrupt")
    except:
        print('OH NO!')


if __name__ == '__main__':
    main()

I think the problem is that KeyboardInterrupt is being caught, but within the task/future running in the loop. And that appears to generate a KeyboardInterrupt wrapped in a CancelledError (not sure the nomenclature, in Java it would be a wrapped exception, or nested exception, etc).

Here's what I get running it with Python 3.8:

oi
oi
^CHUH! Where's the KeyboardInterrupt?
Here's the KeyboardInterrupt

@oliver-sanders
Copy link
Member

Hmmm,

Sadly I don't think this is the right solution, and it looks like it's breaking cylc run in some strange way.

I can shine a little light on the traceback:

  File "/home/kinow/Development/python/anaconda3/lib/python3.8/asyncio/base_events.py", line 1823, in _run_once
    event_list = self._selector.select(timeout)
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/selectors.py", line 468, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2020-07-15T11:43:57+12:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
^CException ignored in: <module 'threading' from '/home/kinow/Development/python/anaconda3/lib/python3.8/threading.py'>
Traceback (most recent call last):
  File "/home/kinow/Development/python/anaconda3/lib/python3.8/threading.py", line 1388, in _shutdown
    lock.acquire()
KeyboardInterrupt:

This occurs when you have a thread which includes a sleep statement (There's a cpython bug report for this out there somewhere). This caused me problems in Python3.7 but the asyncio.run implementation fixed the issues, sadly it looks like the problems resurfaced in Python3.8 (did the asyncio.run implementation change?).

So I think this is due to the threads we start in the scheduler which are:

  • server
  • publisher
  • zmq curve authenticator

(any I've missed?)

The Proper Solution?

I think we probably shouldn't be casually spawning threads in the async world, we should probably run synchronous code via a thread pool executor thinggy. This would involve removing the threading logic from the server and publisher and switching to a non-threaded version of the zmq curve authenticator.

The Hacky Solution?

If this is too much of a headache there might be a temporary workaround by rearranging the logic so that the KeyboardInterrupt gets caught at the higher level then do asyncio.run(scheduler.shutdown()).

@hjoliver
Copy link
Member

@oliver-sanders you may be right, but I'm still curious to understand the seemingly weird behaviour of KeyboardInterrupt that I was talking about above (and is illustrated by @kinow 's example too) which involves no additional threads. Not sure about @kinow 's explanation though, as Python doesn't have "wrapped exceptions" (I think?), and even if it did why wouldn't the exception be the same at both levels in the code?

@hjoliver
Copy link
Member

hjoliver commented Jul 15, 2020

A variation on the example above:

import asyncio
import sys

async def run():
    try:
        while True:
            print("x", end='')
            sys.stdout.flush()
            await asyncio.sleep(1)
    except KeyboardInterrupt:
        print("KeyboardInterrupt")
        raise
    except asyncio.exceptions.CancelledError:
        print("CancelledError")
        raise
    except:
        print("Exception")
        raise

def main():
    try:
        asyncio.run(run(), debug=True) 
    except KeyboardInterrupt:
        print("Got KeyboardInterrupt")
    except asyncio.exceptions.CancelledError:
        print('Got a CancelledError!')
    except:
        print('Got Exception')

if __name__ == '__main__':
    main()

If I run this (Python 3.8) I get:

xxxxxxxx^CCancelledError
Got KeyboardInterrupt

My guess is something like this: KeyboardInterrupt is caught by the event loop which then cancels coroutines with CancelledError ... and then it (the event loop) kills itself by raising KeyboardInterrupt. (This might be what @dwsutherland meant above by "I suppose it hits the async loop and not the coroutine") If so, we can't catch KeyboardInterrupt inside a coroutine - which it looks like we are trying to do in scheduler_cli.py at the moment??

I guess the example above looks weird to the non-asyncio eye, but only because the hidden event loop is doing hidden things with KeyboardInterrupt ... seems like that sort of thing should be highlighted in Python docs...

@@ -417,7 +417,7 @@ async def _run(parser, options, reg, is_restart, scheduler):
# stop cylc stop
except SchedulerError:
ret = 1
except KeyboardInterrupt as exc:
except (KeyboardInterrupt, asyncio.exceptions.CancelledError) as exc:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of example code and discussion on this issue, we can't catch KeyboardInterrupt here at all (whether or not catching CancelledError is the right solution).

@oliver-sanders
Copy link
Member

See the related #3564 (comment).

@oliver-sanders oliver-sanders added this to the cylc-8.0a3 milestone Jul 15, 2020
@kinow
Copy link
Member Author

kinow commented Jul 15, 2020

See the related #3564 (comment).

I think it was from that comment, or from that issue, that I remembered we had threads somewhere. When I hit CTRL+C, there are 2 threads running, one with zmq publisher I think. The other thread the IDE was not able to open/display (could be an IDE thread maybe?).

And hitting CTRL+C again, there's a threading related message. So @oliver-sanders might be on the right track. Happy to close this one once we have another solution (for the moment I'm using this patch just so I don't forget to close the workflow when looking at Cylc UI)

@hjoliver
Copy link
Member

I think @oliver-sanders is on the right track in terms of the other threads messing with shutdown, but we're also on the right track in that the current catching of KeyboardInterrupt doesn't work (and that's regardless of threads). I wonder if catching CancelledError is OK as an interim solution?

@kinow
Copy link
Member Author

kinow commented Jul 15, 2020

but we're also on the right track in that the current catching of KeyboardInterrupt doesn't work (and that's regardless of

It looks really strange that that exception is happening and not being caught. At least with Python 3.8. Though it's confusing for me how it was working with Python 3.7.

I wonder if catching CancelledError is OK as an interim solution?

+1


(digressing I think)

@hjoliver from your presentation linked yesterday in Element, there was a slide saying something how asyncio.gather would swallow exceptions by default. I see it has a return_exceptions=False by default, and we use that in Cylc Flow:

If intentional we can add a comment to the code using it, and/or make it explicit with return_exceptions=False.

@hjoliver
Copy link
Member

hjoliver commented Jul 16, 2020

It looks really strange that that exception is happening and not being caught.

I think the reason for that must be as I've described above: the event loop catches KeyboardInterrupt and raises it to the top level code (that starts the event loop) but it just "cancels" the coroutines and they see CancelledError instead.

@hjoliver
Copy link
Member

@hjoliver from your presentation linked yesterday in Element, there was a slide saying something how asyncio.gather would swallow exceptions by default. I see it has a return_exceptions=False by default, and we use that in Cylc Flow:

Yes that's one of the things that we'll need to be aware of and use properly...

@dwsutherland
Copy link
Member

Read through that presentation (it's great!), need to have a play with it 🚀

I think we probably shouldn't be casually spawning threads in the async world, we should probably run synchronous code via a thread pool executor thinggy. This would involve removing the threading logic from the server and publisher and switching to a non-threaded version of the zmq curve authenticator.

Agreed.
However, and I think I mentioned somewhere else, we can probably drop running the publisher in a separate thread as I don't think there's any blocking code in there now (?)

@kinow
Copy link
Member Author

kinow commented Nov 20, 2020

While this might help with #3946 , I suspect there is a better fix. Closing 👍

@kinow kinow closed this Nov 20, 2020
@hjoliver hjoliver modified the milestones: cylc-8.0a3, cylc-8.0b0 Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants