Skip to content
This repository was archived by the owner on Nov 23, 2017. It is now read-only.

Confusing error handling for KeyboardInterrupt #341

Open
sametmax opened this issue May 4, 2016 · 19 comments
Open

Confusing error handling for KeyboardInterrupt #341

sametmax opened this issue May 4, 2016 · 19 comments

Comments

@sametmax
Copy link

sametmax commented May 4, 2016

I reported this issue on the Python tracker but got no answer.

If you trigger KeyboardInterrupt in a coroutine and catch it, the program terminates cleanly:

import asyncio

async def bar():
    raise KeyboardInterrupt

loop = asyncio.get_event_loop()

try:
    loop.run_until_complete(bar())
except KeyboardInterrupt:
    print("It's ok")
finally:
    loop.stop()
    loop.close()

This outputs:

It's ok

However, if you wrap the coroutine in a Task, you will get a mixed behavior:


try:
    task = asyncio.ensure_future(bar())
    loop.run_until_complete(task)
except KeyboardInterrupt:
    print("It's ok")

This outputs:


It's ok
Task exception was never retrieved
future: <Task finished coro=<bar() done, defined at ki_bug.py:4> exception=KeyboardInterrupt()>
Traceback (most recent call last):
  File "ki_bug.py", line 10, in <module>
    loop.run_until_complete(main_future)
  File "/usr/lib/python3.5/asyncio/base_events.py", line 325, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 295, in run_forever
    self._run_once()
  File "/usr/lib/python3.5/asyncio/base_events.py", line 1258, in _run_once
    handle._run()
  File "/usr/lib/python3.5/asyncio/events.py", line 125, in _run
    self._callback(*self._args)
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "ki_bug.py", line 5, in bar
    raise KeyboardInterrupt
KeyboardInterrupt

We have several contradictory behaviors: the KeyboardInterrupt is raised, and captured by the future (since your can do task.exception() to suppress the stracktrace) but also catched by except while the program is allowed to continue, yet still the stack trace is displayed and eventually the program return code will be 0.

It's very confusing.

I believe it's because exceptions don't break out the event loop, but KeyboardInterrupt is having some kind special treatment.

@vxgmichel
Copy link

vxgmichel commented Jul 28, 2016

There is a bit of explanation in BaseEventLoop.run_until_complete:

        try:
            self.run_forever()
        except:
            if new_task and future.done() and not future.cancelled():
                # The coroutine raised a BaseException. Consume the exception
                # to not log a warning, the caller doesn't have access to the
                # local task.
                future.exception()
            raise

If the exception is an instance of BaseException but not Exception (such as KeyboardInterrupt), it bubbles up and reaches this try-except. If the task has been created inside run_until_complete (new_task is True), then the exception is consumed since there is no way to access the task before it is garbage collected.

The problem you reported could easily be fixed by removing new_task in the test. However, the raised BaseException might be different from the exception set in the task, so I guess it is more of a feature than a bug. Another way to address this issue would be to compare the exceptions:

        try:
            self.run_forever()
        except BaseException as exc:
            if future._exception is exc \
               # Is this line still necessary?
               or new_task and future.done() and not future.cancelled():
                # The coroutine raised a BaseException. Consume the exception
                # to not log a warning, the caller doesn't have access to the
                # local task.
                future.exception()
            raise

However, this part of the code is very central, so I guess any change here is a pretty big deal.

@sametmax
Copy link
Author

Thank you.

We manage to live with it, it's just debugging will confuse the hell out of new comers. Asyncio is pretty hard to understand, and even more to debug, so I believe we should do everything to make it easier. I have been working on an asyncio project for some month now, and I must say there is no way an average programmer would have the patience to dig into all the debugging sessions we went through.

@gvanrossum
Copy link
Member

gvanrossum commented Jul 30, 2016 via email

@vxgmichel
Copy link

@gvanrossum But doesn't it make sense for a KeyboardInterrupt to bubble up to loop.run_forever and stop the loop regardless of where the exception has been raised?

I'm thinking about how the PR #305 might break the servers relying on Ctrl+C to stop their execution. Consider the TCP echo server using streams from the asyncio documentation. If a KeyboardInterrupt is raised while waiting for a client to connect (in loop._run_once), the execution will stop as expected. But if the KeyboardInterrupt is raised while dealing with a client request (in the handle_echo coroutine), the loop will not stop ( and a Task exception was never retrieved message will be logged). This inconsistent behavior seems like a pretty serious issue.

@gvanrossum
Copy link
Member

That's the kind of thing we need to solve before accepting the PR. Do you
have a suggestion for a fix?

On Sunday, July 31, 2016, Vincent Michel notifications@github.com wrote:

@gvanrossum https://github.com/gvanrossum But doesn't it make sense for
a KeyboardInterrupt to bubble up to loop.run_forever and stop the loop
regardless of where the exception has been raised?

I'm thinking about how the PR #305
#305 might break the servers
relying on Ctrl+C to stop their execution. Consider the TCP echo server
using streams
https://docs.python.org/3.4/library/asyncio-stream.html#tcp-echo-server-using-streams
from the asyncio documentation. If a KeyboardInterrupt is raised while
waiting for a client to connect (in loop._run_once), the execution will
stop as expected. But if the KeyboardInterrupt is raised while dealing
with a client request (in the handle_echo coroutine), the loop will not
stop ( and a Task exception was never retrieved message will be logged).
This inconsistent behavior seems like a pretty serious issue.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#341 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ACwrMgNweN2dEIArfWBzze_StbYddxSjks5qbIa_gaJpZM4IWvsV
.

--Guido (mobile)

@vxgmichel
Copy link

vxgmichel commented Jul 31, 2016

@gvanrossum
I like @Haypo's idea of setting a handler for SIGINT. A keyboard interrupt is an asynchronous event, and it makes a lot of sense to let the event loop deal with it. It also enforces the idea that the code between two await statements is safe and will never be interrupted.

The default SIGINT handler could stop the loop, and make sure that a KeyboardInterrupt exception is raised in loop.run_forever so the application can react and let the running coroutines terminate properly.

You mentioned the fact that a SIGINT won't be able to interrupt a tight CPU loop, but I don't really see that as a problem.

@asvetlov
Copy link

@gvanrossum I don't see any harm from #305
aiohttp (and all other libraries supported by me) definitely doesn't change own behavior after PR appliance.

@1st1
Copy link
Member

1st1 commented Jul 31, 2016

You mentioned that a SIGINT won't be able to interrupt a tight CPU loop, but I don't really see that as a problem.

@vxgmichel This is a big no-no. In the first version of uvloop I did exactly this -- handle SIGINT and let the loop to handle it asynchronously. It was completely unusable. Turns out people write tight loops quite frequently, and inability to stop your Python program with Ctrl-C is something they aren't prepared to handle at all. Now I'm using PyErr_SetInterrupt to actually interrupt any Python code on SIGINT to replicate asyncio behaviour.

@gvanrossum
Copy link
Member

gvanrossum commented Aug 1, 2016 via email

@vxgmichel
Copy link

I crafted another example to demonstrate that some programs might get non-interruptible after PR #305. In this example, the background task is most likely to catch and ignore the first Ctrl-C (the second one will interrupt the program though).

@1st1

[...] inability to stop your Python program with Ctrl-C is something they aren't prepared to handle at all.

That is also what gets me worried about PR #305. I'd sum it up this way:

Ctrl-C stops the loop, ...

@gvanrossum
Copy link
Member

gvanrossum commented Aug 1, 2016 via email

@sametmax
Copy link
Author

sametmax commented Aug 2, 2016

There is an alternative possibility:

  • catch all exceptions;
  • add a system to stop the loop and reraise some exceptions;
  • add a configuration system to chose the exceptions to catch and reraise. We should have hooks to create custom handlers for thoose;
  • make the default conf values to be catching keyboard interrupt and reraise it;
  • then document that and let the user choose the behavior;

Having a generic middleware system to handle exception and decide which one stop the loop, which one you log and which one you silent would be a nice thing anyway.

@1st1
Copy link
Member

1st1 commented Aug 2, 2016

catch all exceptions;
add a system to stop the loop and reraise some exceptions;

This sounds right.

add a configuration system to chose the exceptions to catch and reraise. We should have hooks to create custom handlers for thoose;

I'm not sure I like this. What's the use case? What if you have two asyncio libs requiring different configs?

@sametmax
Copy link
Author

sametmax commented Aug 2, 2016

Two async libs can already change the event loop, the policy and some the signal handlers. There is nothing to stop conflicting libs to screw with each others except documentation and good will.

The use case for this is the creation of a framework, and the definition of the error handling policy for this framework : logging, recovery, debugging, etc.

E.G:

On the project we are working now, we want to make debugging in dev mode very easy, especially for beginners. So we go to great lengths to detect common or weird errors and ensure the user can easily spot them, understand the problem and choose a fix.

Unfortunately, with the current architecture, this is complicated, we had to:

And yet, while I think 30% of our unit tests are for those (https://github.com/Tygs/tygs/tree/master/tests), we have still some problems such as some errors being silent in unit tests with pytest.

And after all that, and 100% code coverage, we are maybe at 40% of the feature we expect for v0.1, as this took us many more iterations than we expected to get right. Not complaining, I'm glad we have asyncio in Python, and I'm super excited about being able to code this, but let's use the opportunity to make the next people's life easier.

Error handling, life cycle and debugging are really a key component to get right in asyncio, especially since the handling of the loop is manual, explicit and open to wind. NodeJS dev don't have to worry about half of that.

@1st1
Copy link
Member

1st1 commented Aug 2, 2016

@sametmax Thanks for the explanation. I'll take a closer look at #305 (probably will have to be rewritten) in a few days.

And yet we have still some problems such as some errors being silent in unit tests with pytest.

Yeah, I saw that with pytest. Was wondering what's going on but didn't have time to investigate in detail.

@dbivolaru
Copy link

I ran into similar issues as @sametmax.

The principles I used, in short:

  • Two types of tasks: dependent and independent; the independent ones (started by an ensure_future wrapper) are appended to a list; dependent tasks started either with gather() or coming from main are used for synchronous jobs (eg REQ-REP scenarios); independent ones are used for asynchronous communications (XREQ-XREP, PUB-SUB)
  • An exception handler which checks every 1 sec if there are any done() tasks in the tracked list and displays the exception (does not raise it) and then removes that task from the list
  • Use a state machine with { running, soft-stopping, communication-stopping, hard-stopping, stopped }
    • running is the default when starting the loop
    • signal-handler for SIGTERM, SIGINT transitions into soft-stop and sets a flag to notify all tasks they need to terminate gracefully pretty soon (including the exception handler and socket close)
    • soft-stopping transitions (sleep) into communication-stopping after a TIMEOUT (here we destroy the ZMQ Context)
    • communication-stopping transitions (sleep) into hard-stopping after a different TIMEOUT (here we cancel all independent tasks which are not yet done())
    • hard-stopping transitions (call_later) into stopped after a different TIMEOUT (by now the dependent tasks should have finished properly and we call loop.stop())
  • The main coroutine starts the exception handler, gathers() one co-routine passed as parameter to be run, then gathers() the exception handler in case we get an exception in the exception handler and finally if still in running mode, starts the whole unwinding procedure above (without the loop.stop)
  • The main coroutine is ran by loop.run_until_complete() inside a try-catch-finally block where we finally display any exceptions that the exception handler might have missed since the soft-stopping started until the end of the winding down; the catch pass'es on any Cancelled exceptions
  • At the end i do a loop.close() and sys.exit(0) just in case there's something still living

@sametmax
Copy link
Author

The more I think about it, the more I realise the event loop is kinda half a state machine. And each of us a reinventing the rest, including defined states, event propagation, error handling, etc., so we can have a clean use of asyncio.

@gvanrossum
Copy link
Member

This thread is running off the rails. Asyncio is a much lower level than nodejs, like it or not. Somebody should probably write an opinionated framework for writing servers on top of it, but such a framework does not belong in the stdlib -- just like a web framework does not belong in the stdlib even though sockets are. Different groups may write, promote and adopt different frameworks, just like for web frameworks we have Pyramid, Django, Flask, and many others. (And if you want to discuss this more, please start a new issue so it doesn't add noise to this issue. Or go to python-ideas.)

Re-focusing on KeyboardInterrupt, I think the deep problem is due to the way signals work and are expected to work. When you have a callback that somehow doesn't return, e.g. because it contains while True: pass, the signal is the only way to break through that (and note that Python itself already does some hackery here to ensure the exception is raised at the boundary between two opcodes). Here loop.add_signal_handler() would prevent desirable functionality, since it would make such a while-True loop essentially unstoppable.

But when the event loop or some other thing (e.g. a future or queue) is updating its own state we probably don't want the signal to interrupt, and there the behavior of loop.add_signal_handler() is desirable. (Which, BTW, at the scale of callbacks, does the same thing that Python itself does at the scale of opcodes -- delaying the "user" code to handle the signal until a clean breaking point.)

Maybe we should mask SIGINT except when callbacks are running or when we're blocked in the selector? And maybe Task._setup() should not call set_exception() when it catches BaseException? (Perhaps instead it should cancel the task?) Then SIGINT will still interrupt callbacks, and it will also interrupt I/O waits, and it can then just be raised out of run_forever() or run_until_complete() without changing any internal state.

That's still thinking aloud, but with a focus on the known use cases for ^C.

@pior
Copy link

pior commented Aug 3, 2017

Would it be possible to get an update on this issue?
Did some of the described behaviours changed since then?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants