Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

jdranczewski · 2024-02-13T17:55:49Z

Fixes #1202 by introducing a timed exit to the Tk (on Linux) and Qt (on Windows and Linux) eventloops so schedule_stop_aborting can fire after stop_on_error_timeout in the kernel's main io_loop.

This only affects the eventloops specified, as the others periodically hand control back to the kernel, but Tk (on Linux) and Qt normally wait for a ZMQ socket event to go back to the kernel. The stop_on_error_timeout does not fire a socket event, so the timer does not expire and future events are aborted when they should not.

The proposed solution introduces an optional _schedule_exit method to eventloops that need it, and makes sure it is called if an eventloop exit is needed for the stop_on_error_timeout to fire.

See issue ipython#1202 - this implements the base logic, eventloops need to implement `_schedule_exit` themselves.

See issue ipython#1202, allows for returning to the kernel io_loop on a timeout

The QTimer was introduced in d4755f6 and then changed is issue ipython#990 to close a memory leak. This change makes the code more concise by removing not instancing a QTimer object but calling the static singleShot method instead

See issue ipython#1202 for _schedule_exit reasoning. Replaced `stream` as an argument with a direct reference to `kernel.shell_stream`, simplifying code. The reason for `stream` being an argument was support for multiple `kernel.shell_streams`, which has been deprecated since, see e719892

The default QTimer is coarse, so can fire within +-5% of the time specified. We _need_ to fire after the delay specified so that we exit _after_ schedule_stop_aborting is scheduled. The QTimer thus needs to be a PreciseTimer. For small values of stop_on_error_timeout I found the QTimer _still_ fired up to 5ms too early, so added 10ms of delay offset. singleShot signature is inconsistent between PySide and PyQt, so we store a timer object, similar to ipython#990

blink1073

This looks great, thank you!

blink1073 · 2024-02-15T02:20:23Z

@ccordoba12 did you want to look at this as well?

ccordoba12 · 2024-02-15T16:19:14Z

Yes, thanks for the ping @blink1073! I'll run our test suite against this PR and report back any issues in a couple of days.

If the qt loop is disabled through `%gui` and then enabled again, _qt_notifier and _qt_timer stay as references to dead Qt objects, causing an error

ccordoba12 · 2024-02-19T12:53:50Z

@blink1073, just so you know, @jdranczewski is helping us to try to solve an additional issue in Spyder (referenced above), so this could take a bit longer than expected.

blink1073 · 2024-02-19T12:54:38Z

Understood, thank you both!

jdranczewski · 2024-02-24T17:44:52Z

Hi @blink1073, while investigating spyder-ide/spyder#21299 with @ccordoba12 I found that it's not caused by #1202 but likely a slightly different issue with how ipykernel schedules the next eventloop run.

schedule_next in kernelbase.py puts advance_eventloop on an io_loop call_later, which means it will be called soon, but usually after the msg_queue is dealt with. I've found that if the ZMQ socket event for a dispatch_shell arrives before we enter the eventloop, there is sometimes a race condition between the advance_eventloop call and process_one running the call to dispatch shell.

In a situation like this, schedule_dispatch is called, and the dispatch_shell call is put on the msg_queue. Normally advance_eventloop is supposed to guard against entering the eventloop if there are any unprocessed messages: https://github.com/ipython/ipykernel/blob/v6.29.2/ipykernel/kernelbase.py#L480-L484 Unfortunately, since there is already a getter registered for the queue in process_one, the event is immediately consumed, the queue size is reported as zero, and the eventloop is entered before the dispatch_shell can fire. This results in the kernel apparently hanging, as it has entered the eventloop and it won't process the shell request until the next ZMQ event is received.

I have not been able to meaningfully reproduce this in Jupyter, so this race condition may be fairly rare, but it's possible to reproduce in Spyder by spamming the run button until the request arrives at just the right time to trigger this situation, and for some users it apparently arises naturally. A fix I could suggest is putting the call to advance the eventloop on the message queue, so it's for sure only called once any dispatch events are processed: jdranczewski@30da44a What do you think of doing it this way? Do you think it should be a separate PR (as it solves a different problem than stop_on_error_timeout not working right), or would it fall under the umbrella of this one as 'eventloop scheduling'?

blink1073 · 2024-02-24T21:22:14Z

I'd say if it is easier for you to combine them in this PR, that is fine.

This ensures that the msg_queue is truly empty before entering the eventloop, fixing a possible race condition where process_one has consumed a dispatch but not executed it yet. See spyder-ide/spyder#21299 and ipython#1212

jdranczewski · 2024-02-24T23:50:38Z

Ok, in that case I think it will be easier to just continue with this PR.

@ccordoba12 I've pushed the fix to this branch if you would like to run the Spyder test suite against it.

I'm not sure why the linting test started failing, I have not touched the line it's now failing at...

blink1073 · 2024-02-25T01:08:43Z

The lint failure is unrelated, from a typings change in ipython.

ccordoba12 · 2024-02-25T16:11:30Z

Thanks for the update @jdranczewski! I opened spyder-ide/spyder#21834 to test your work on our side.

ccordoba12 · 2024-02-25T18:49:10Z

@blink1073, good news! Our tests are passing without issues, so this should be ready to be merged.

Also, we'd appreciate if you could release a new IPykernel version with this fix so we can depend on it in our next Spyder version. Thanks!

blink1073 · 2024-02-25T18:57:04Z

Excellent! I'll make a bug release tomorrow.

jdranczewski added 5 commits February 3, 2024 22:37

_abort_queues check for timeout and schedule eventloop exit

c9fd098

See issue ipython#1202 - this implements the base logic, eventloops need to implement `_schedule_exit` themselves.

Implement _schedule_exit for Qt eventloop

17947b3

See issue ipython#1202, allows for returning to the kernel io_loop on a timeout

Replace QTimer object with static method

eb4db2a

The QTimer was introduced in d4755f6 and then changed is issue ipython#990 to close a memory leak. This change makes the code more concise by removing not instancing a QTimer object but calling the static singleShot method instead

blink1073 added the enhancement label Feb 15, 2024

Merge branch 'main' into main

a8aa24b

blink1073 approved these changes Feb 15, 2024

View reviewed changes

ccordoba12 mentioned this pull request Feb 15, 2024

IPython kernel hangs when using other graphics backend than "inline" spyder-ide/spyder#21299

Closed

10 tasks

Clean up Qt objects

1d43903

If the qt loop is disabled through `%gui` and then enabled again, _qt_notifier and _qt_timer stay as references to dead Qt objects, causing an error

blink1073 mentioned this pull request Feb 19, 2024

Replace Tornado with AnyIO #1079

Merged

jdranczewski added 2 commits February 24, 2024 23:33

Use msg_queue for scheduling eventloop advances

a9db81e

This ensures that the msg_queue is truly empty before entering the eventloop, fixing a possible race condition where process_one has consumed a dispatch but not executed it yet. See spyder-ide/spyder#21299 and ipython#1212

Merge branch 'main' into main

73f004f

jdranczewski changed the title ~~Schedule eventloop exits for stop_on_error_timeout~~ Eventloop scheduling improvements for stop_on_error_timeout and schedule_next Feb 24, 2024

ccordoba12 mentioned this pull request Feb 25, 2024

PR: Fix hangs when using Matplotlib interactive backends (IPython console) spyder-ide/spyder#21834

Merged

blink1073 merged commit de2221c into ipython:main Feb 25, 2024
31 of 33 checks passed

jdranczewski mentioned this pull request Feb 27, 2024

Next cell does not execute after an exception is thrown jupyter/notebook#6526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

jdranczewski commented Feb 13, 2024

blink1073 left a comment

blink1073 commented Feb 15, 2024

ccordoba12 commented Feb 15, 2024

ccordoba12 commented Feb 19, 2024

blink1073 commented Feb 19, 2024

jdranczewski commented Feb 24, 2024

blink1073 commented Feb 24, 2024

jdranczewski commented Feb 24, 2024

blink1073 commented Feb 25, 2024

ccordoba12 commented Feb 25, 2024

ccordoba12 commented Feb 25, 2024

blink1073 commented Feb 25, 2024

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

Eventloop scheduling improvements for stop_on_error_timeout and schedule_next #1212

Conversation

jdranczewski commented Feb 13, 2024

blink1073 left a comment

Choose a reason for hiding this comment

blink1073 commented Feb 15, 2024

ccordoba12 commented Feb 15, 2024

ccordoba12 commented Feb 19, 2024

blink1073 commented Feb 19, 2024

jdranczewski commented Feb 24, 2024

blink1073 commented Feb 24, 2024

jdranczewski commented Feb 24, 2024

blink1073 commented Feb 25, 2024

ccordoba12 commented Feb 25, 2024

ccordoba12 commented Feb 25, 2024

blink1073 commented Feb 25, 2024