Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Navigating in the JupyterLab UI can prevent idle kernels from being culled. #1360

Closed
ojarjur opened this issue Nov 14, 2023 · 11 comments · Fixed by #1429
Closed

Navigating in the JupyterLab UI can prevent idle kernels from being culled. #1360

ojarjur opened this issue Nov 14, 2023 · 11 comments · Fixed by #1429
Labels

Comments

@ojarjur
Copy link
Contributor

ojarjur commented Nov 14, 2023

Description

When a Jupyter server is configured to cull idle kernels that have connections, the JupyterLab interactions with the Jupyter server can prevent this culling from actually happening.

The simplest such case is when the user navigates between different tabs within JupyterLab, but I've also seen this when refreshing the browser tab that the JupyterLab UI is in.

The root issue seems to be an unexpected interaction between the behavior of the JupyterLab UI and the behavior of the kernel activity tracking.

Specifically, the JupyterLab UI will (on certain events) send a debug_request on the control channel to the running kernel, which results in both a debug_reply on the control channel from the kernel and multiple (in my case I keep seeing two of them) status messages from the kernel on the iopub channel.

This causes issues because it seems that the server will update the kernel's last_activity timestamp on all IOPub messages rather than just execute_input/execute_result messages

That, in turn, prevents the kernel from being considered idle, which then prevents it from being culled.

I am very surprised to see that status messages update the last_activity timestamp. I would have expected those to be no-ops from the standpoint of activity tracking.

Reproduce

  1. Set the following config options (either in a config file or on the jupyter lab ... command line):

    1. MappingKernelManager.cull_idle_timeout = 900
    2. MappingKernelManager.cull_connected = True
    3. MappingKernelManager.cull_interval = 10
  2. Open the JupyterLab UI, open a new notebook, then close the JupyterLab UI and instead open the .../api/kernels page in your browser.

  3. After 15 minutes refresh the .../api/kernels page and see that the idle kernel has been culled.

  4. Then re-open the JupyterLab UI, open a new notebook, and then open a new launcher tab at the same time.

  5. Without ever executing anything in the notebook, switch between the two tabs in JupyterLab once every minute.

  6. Note that the kernel is never culled.

  7. Open the .../api/kernels page and note how every time you switch tabs the last_activity timestamp of the kernel is updated.

Expected behavior

I would expect the last_activity timestamp to not be updated due to the status messages sent by the kernel.

Context

  • Operating System and version: Debian 13
  • Browser and version: Chrome 118
  • Jupyter Server version: 2.9.1 (JupyterLab version 4.0.7)
@ojarjur
Copy link
Contributor Author

ojarjur commented Nov 16, 2023

We discussed this in the weekly meeting today.

To make the activity tracking work as expected, we need to keep track of the last time we knew the kernel was active in response to a user request.

@Zsailer suggested that the message types that should be categorized as user requests are:

  • execute_request
  • complete_request
  • execute_input
  • execute_reply
  • inspect_request

On top of that, the status messages are important, when they are used to track whether or not a previous user request is still running.

Note that this seems to indicate that somewhere we need to keep track of whether or not the kernel was previously running a user request, so that we can continue to treat the kernel as being active until we get confirmation from the kernel that it has switched back to idle.

My current thinking is to add a flag (so that users can preserve the existing behavior), and then change the MappingKernelManager code to not update the last activity timestamp if:

  1. The new behavior is being used,
  2. The iopub message was not one of those types listed above,
  3. The parent of the iopub message was not one of those types, and
  4. We know that the kernel has been idle since the last such user message was sent.

For the last check we would err on the side of updating the activity timestamp rather than erring on the side of not updating it.

This last activity logic currently exists in the MappingKernelManager, but the tracking of whether or not the kernel has been idle probably makes more sense to go into the ServerKernelManager. I don't know yet whether it makes sense to move all of the logic into ServerKernelManager, or if it would be better to have the logic split into two places.

@ojarjur
Copy link
Contributor Author

ojarjur commented Nov 16, 2023

I'm now leaning towards not having a flag for preserving the old behavior.

It seems that the old behavior was simply incorrect rather than just representing a differing opinion on what constitutes "activity".

In particular, the status message contents are used to update the execution_state of the kernel regardless of the parent message.

This means control messages would cause the execution_state to switch to idle even if there is a long-running code cell executing.

I'm going to put together a PR with my understanding of the right way to fix this, and share it here for early feedback. The fix itself looks relatively simple, but there will be a lot of work required for testing it.

@ojarjur
Copy link
Contributor Author

ojarjur commented Nov 17, 2023

@Zsailer thanks for your offer during today's community meeting to discuss the details about how to fix this issue.

I've put together what I am now relatively confident is the right way to resolve this issue and would like to solicit feedback on it. For now this is just in a commit here.

Once I get some more time to work on it, I'll add some tests for the various edge-case scenarios that I can think of and then send it out in a pull request.

@jasongrout
Copy link
Contributor

@Zsailer suggested that the message types that should be categorized as user requests are:

I would also add comm_open, comm_msg, and comm_close messages. For example, a user might have an ipywidgets gui up and just be manipulating the widgets, which will send comm messages, but not any of those other types of messages.

@jasongrout
Copy link
Contributor

I think there are two issues conflated in the busy/idle message problem that is brought up here.

For activity tracking, we are trying to distinguish between regular system-spawned messages, and messages that convey user activity intent. In execute_request, we have the notion of a "silent" execution that sort of tries to do this - a system can send a "silent" execution to slip an execution in between other user activity, and not have the user activity (like the execution counter) be disrupted by the system execution. Perhaps we could use some flag like that for all messages? If the flag was in the header metadata, then it would be carried into all responses as well. That would let the originator of a message indicate whether it indicated user activity or not. In this example, some debug messages do indicate user activity (for example, stepping through a function), whereas others may not (like an inspect_variables request to auto-update a variable listing). We see the same for comm messages - some comm messages may be system-generated from frontend components, while others may indicate actual user activity, and you can't tell just by looking at the message type.

The other issue is that it is not easy to determine exactly what is busy/idle from the busy/idle message. One of the issues here is that messages to the control channel spawn busy/idle messages on the iopub channel. However, it is not clear from the busy/idle message that the control thread is busy, rather than the shell thread. So unless you maintain a mapping of messages that can be sent to either channel (and the frontend respects that mapping), you don't know if the shell channel is actually busy or not.

Should we have busy/idle messages for activity on the non-shell channel? Or should busy/idle messages carry clear info about which thread is going to be busy/idle? This becomes even more important to solve if we make progress on the subshell JEP, where messages on the shell channel may cause only one subshell to be busy.

Note that only the kernel can really tell if the shell thread is going to be busy. For example, old ipykernel versions had a single thread managing both control and shell, while newer versions have separate threads for control and shell. In the old ipykernel, a control message arguably should issue a busy/idle message for the shell channel, while in new ipykernel, control messages do not block the shell channel.

@jasongrout
Copy link
Contributor

That said, both of the points I brought up require more fundamental changes to the protocol, or at least more fundamental changes to the conventions we use in the protocol. Inferring user activity from message type may be the best temporary fix, though I'd caution against ramifications, such as:

  1. if a user is stepping through a function slowly, looking and thinking about the code for a long time, can things time out?
  2. If a user is just working with an ipywidgets gui (e.g., sending ipywidgets comm messages), can things time out?

@ojarjur
Copy link
Contributor Author

ojarjur commented Nov 17, 2023

@jasongrout Thanks for the detailed and informative responses!

You've convinced me that there are two separate issues in play here, so I've updated my change to try to address them separately:

  1. I think the existing status tracking is just a bug because unrelated busy/idle messages are conflated. My first version did that too, but now I've switched to tracking the parent message of any busy statuses and then matching the corresponding idle status against that; I now have the status only switch back to idle when all previously tracked busy status messages have been followed up by a matching idle status message.
  2. I see now that the decision on what messages to include in activity tracking is far more subtle than I initially thought. The existing logic treats all messages as user activity, whereas my initial change treated a subset of user activity based on a hard-coded list of message types. I see what you mean by neither approach being 100% correct. As a result, I've kept an allowlist of message types to treat as user activity, but made it configurable so that a user can modify it as best for them, including reverting to the previous behavior if they want (by allowlisting all possible message types).

As before, any additional feedback/comments are much appreciated.

Thanks!

@ojarjur
Copy link
Contributor Author

ojarjur commented Nov 29, 2023

I would also add comm_open, comm_msg, and comm_close messages. 

@jasongrout Thanks for the suggestion. I've been digging into this a bit more, and had a follow up question:

Are you sure that comm_open and comm_close messages should be treated as user actions? Reading through the message docs it seems that the comm_open and comm_close messages are meant to set up the communication channel, but that actual user interaction would correspond to comm_msg messages.

Is my understanding about that incorrect?

@miloaissatu
Copy link

related issue i raised (on the wrong project :)) jupyter/notebook#7260

looked at @ojarjur's changes, looks mostly the same as what i attempted to achieve in m yexample logic.

my 2 cents is i think having a configurable list of message types is a decent middleground, if not specified it maintains the existing behaviour until there's better distinction between user activity and control activity.

With cull_connected=True i think the current list of message types that @ojarjur has is probably fair game in regard to idle culling.

  1. if a user is stepping through a function slowly, looking and thinking about the code for a long time, can things time out?
  2. If a user is just working with an ipywidgets gui (e.g., sending ipywidgets comm messages), can things time out?

in terms of ramifications such as these, if used in conjunction with longer cull_idle_timeouts (several hours++) they would be unlikely to affect a users' experience, and again if the msg_type is configurable then its up to respective jupyter server admins to tailor it to their infrastructure and user expectations.

@oliver-sanders
Copy link
Contributor

oliver-sanders commented Apr 10, 2024

To make the activity tracking work as expected, we need to keep track of the last time we knew the kernel was active in response to a user request.

@Zsailer suggested that the message types that should be categorized as user requests are:

execute_request
complete_request
execute_input
execute_reply
inspect_request

A kernel could be "active" whilst performing a long lived operation e.g. some heavy processing, watching some files for changes, monitoring a socket, etc. So it might be possible for a kernel to appear inactive, but still be in use, i.e. the user might still have the terminal open in the UI.

It depends what do we consider to be an active kernel:

  • A kernel that is generating output or receiving user input (within a timeframe).
  • OR, a kernel that the user has open in the UI (within a timeframe).

It might be worth mentioning that websockets have a ping-pong mechanism which can be used to detect broken websocket connections. This ping can identify broken/closed websocket connections even when the connection is quiet (because pings are sent according to a set interval rather than when there is data to send). Currently Jupyter Server does not configure a default ping interval, so there is no ping at all. I don't think there's any early detection of client-severed websocket connections because of this.

The default ping interval is now exposed as a Jupyter Server config: #1391

So it might be possible that a valid approach to determining whether a kernel is "active" or "idle" is to configure a default ping interval (to ensure client-severed kernel connections are closed from the server end) and count the number of websockets connected to a kernel. If there are no websockets open on a kernel (within a configured time window), then no user has the kernel open in the UI.

In a Jupyter Server extension that I maintain, we use this mechanism to handle the same issue. In our case the websockets are connected to GraphQL subscriptions rather than kernels, so its a different wss subprotocol, but the same principle, i.e. some long-lived server side process that sends data back to the client when events occur. We kill our "subscriptions" when the last connected websocket connection is closed (i.e. when the user isn't using it any more).

@Zsailer
Copy link
Member

Zsailer commented Jun 6, 2024

It might be worth mentioning that websockets have a ping-pong mechanism which can be used to detect broken websocket connections. This ping can identify broken/closed websocket connections even when the connection is quiet (because pings are sent according to a set interval rather than when there is data to send). Currently Jupyter Server does not configure a default ping interval, so there is no ping at all. I don't think there's any early detection of client-severed websocket connections because of this.

The default ping interval is now exposed as a Jupyter Server config: #1391

Thanks for adding this PR, but I think this was always configurable through tornado_settings trait, by passing ws_ping_interval to that dict. The default ping interval is 30 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants