Large notebook issues with saving #17017

mlucool · 2024-11-29T20:39:19Z

Description

Large notebooks take a long time to save, both making the heap very large temporarily and blocking the main thread. While this maybe considered ok when a user requests for a save, using autosave means this happens periodically and out of a users control, freezing up the UI.

Reproduce

Create a large notebook with a lot of strings (or any notebook that ends up being a few hundred MB). This happens both with/without using ydoc.

Expected behavior

Ideally the following is true:

Notebooks don't autosave when nothing changed
The main thread is minimally blocked. Maybe we can just send diffs when using y doc, or maybe JSON.stringify can be stream or a thread or...
Browser memory doesn't ~double to save. This creates a big allocation and then cleanup later.

Context

Operating System and version: Windows
Browser and version: Chrome
JupyterLab version: Version 4.3.0b3

krassowski · 2024-12-04T11:38:39Z

Reproducer from the screenshot:

for i in range(10**8):
    print("qwertyuiopasdfghjklzxcvbnm")

Note: on my machine using the reproducer to generate data requires increasing iopub limits. Even at 10**6 I see that the server is throttling sending the outputs:

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes[/sec](http://127.0.0.1:8889/sec))
ServerApp.rate_limit_window=3.0 (secs)

To effectively disable the iopub limits I restarted the server with --ServerApp.iopub_data_rate_limit=1000000000.0, but the the websocket connection crashed when I tried to run the reproducer.

krassowski · 2024-12-04T11:43:44Z

I was able to reproduce this with a notebook of 100 MB. Indeed, significant time is spent just serializing the notebook content to JSON in:

jupyterlab/packages/services/src/contents/index.ts

Lines 1316 to 1325 in 761d34f

    
           async save( 
        
             localPath: string, 
        
             options: Partial<Contents.IModel> = {} 
        
           ): Promise<Contents.IModel> { 
        
             const settings = this.serverSettings; 
        
             const url = this._getUrl(localPath); 
        
             const init = { 
        
               method: 'PUT', 
        
               body: JSON.stringify(options) 
        
             };

With RTC enabled it is a no-op wasting a lot of time on main thread. #16900 could solve it for the RTC case.

krassowski · 2024-12-04T12:20:06Z

or maybe JSON.stringify can be stream

Technically, the standards allow to use ReadableStream since whatwg/fetch#425. There is some obstacles to adoption:

Firefox does not support it yet https://bugzilla.mozilla.org/show_bug.cgi?id=1387483
while Chrome supports it since v105, it requires setting duplex: 'half'
this would only ever work on HTTP/2 or HTTP/3 but tornado officially does not even support HTTP/2:
"Most of redirects and authentication responses lead to errors. This comes from the fact that a streaming body cannot be replayed without storing the entire body (even after sending them), and we don't want to store the entire body." as per https://github.com/yutakahirano/fetch-with-streams/blob/master/streaming-upload.md
Node.js does not support it either Fetch - POST - use stream as body? nodejs/help#4126 (it matters since the code is in core @jupyterlab/services package and there are users who use it outside of browser environment)

we still could implement it with jupyverse stack but it would require many defensive conditions to avoid failing on the most popular stack.

krassowski · 2024-12-04T12:55:56Z

We can break up the JSON.stringify call by serializing one cell at a time and yielding to main thread (or using a webworker), but it looks like this will only somewhat reduce the blocking time, as twice as much time is spent in the Request constructor and browser fetch call (for ~100MB notebook):

handleRequest makes a copy of the init options which might contribute to peak memory usage:

jupyterlab/packages/services/src/serverconnection.ts

Line 308 in 761d34f

const request = new settings.Request(url, { ...settings.init, ...init });

krassowski · 2024-12-04T13:39:02Z

Notebooks don't autosave when nothing changed

In principle this sounds right. However, saving has some side effects:

a) the modification date is updated
b) if content changed on disk this is checked, and user is given an option to reload from disk

One could argue that these side-effects should only happen when user triggers the save manually. I think this is right, though for (b) we may want to poll in background so that users are not left in dark and later presented with a conflict if they worked on outdated version of a file.

mlucool added the bug label Nov 29, 2024

jupyterlab-probot bot added the status:Needs Triage Applied to new issues that need triage label Nov 29, 2024

krassowski added the tag:Performance label Dec 2, 2024

krassowski self-assigned this Dec 3, 2024

JasonWeill removed the status:Needs Triage Applied to new issues that need triage label Dec 3, 2024

This was referenced Dec 3, 2024

Reduce default autosave timeout #16892

Open

Weekly Triage meetings: Jul–Dec 2024 jupyterlab/frontends-team-compass#250

Open

krassowski mentioned this issue Dec 12, 2024

reduced the time for the autosaveInterval from 120sec to 5sec #16893

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large notebook issues with saving #17017

Large notebook issues with saving #17017

mlucool commented Nov 29, 2024

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024 •

edited

Loading

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024

Large notebook issues with saving #17017

Large notebook issues with saving #17017

Comments

mlucool commented Nov 29, 2024

Description

Reproduce

Expected behavior

Context

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024 • edited Loading

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024

krassowski commented Dec 4, 2024 •

edited

Loading