Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large notebook issues with saving #17017

Open
mlucool opened this issue Nov 29, 2024 · 5 comments
Open

Large notebook issues with saving #17017

mlucool opened this issue Nov 29, 2024 · 5 comments

Comments

@mlucool
Copy link
Contributor

mlucool commented Nov 29, 2024

Description

Large notebooks take a long time to save, both making the heap very large temporarily and blocking the main thread. While this maybe considered ok when a user requests for a save, using autosave means this happens periodically and out of a users control, freezing up the UI.

Reproduce

Create a large notebook with a lot of strings (or any notebook that ends up being a few hundred MB). This happens both with/without using ydoc.

image

Expected behavior

Ideally the following is true:

  1. Notebooks don't autosave when nothing changed
  2. The main thread is minimally blocked. Maybe we can just send diffs when using y doc, or maybe JSON.stringify can be stream or a thread or...
  3. Browser memory doesn't ~double to save. This creates a big allocation and then cleanup later.

Context

  • Operating System and version: Windows
  • Browser and version: Chrome
  • JupyterLab version: Version 4.3.0b3
@mlucool mlucool added the bug label Nov 29, 2024
@jupyterlab-probot jupyterlab-probot bot added the status:Needs Triage Applied to new issues that need triage label Nov 29, 2024
@krassowski krassowski self-assigned this Dec 3, 2024
@JasonWeill JasonWeill removed the status:Needs Triage Applied to new issues that need triage label Dec 3, 2024
@krassowski
Copy link
Member

Reproducer from the screenshot:

for i in range(10**8):
    print("qwertyuiopasdfghjklzxcvbnm")

Note: on my machine using the reproducer to generate data requires increasing iopub limits. Even at 10**6 I see that the server is throttling sending the outputs:

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes[/sec](http://127.0.0.1:8889/sec))
ServerApp.rate_limit_window=3.0 (secs)

To effectively disable the iopub limits I restarted the server with --ServerApp.iopub_data_rate_limit=1000000000.0, but the the websocket connection crashed when I tried to run the reproducer.

@krassowski
Copy link
Member

I was able to reproduce this with a notebook of 100 MB. Indeed, significant time is spent just serializing the notebook content to JSON in:

async save(
localPath: string,
options: Partial<Contents.IModel> = {}
): Promise<Contents.IModel> {
const settings = this.serverSettings;
const url = this._getUrl(localPath);
const init = {
method: 'PUT',
body: JSON.stringify(options)
};

With RTC enabled it is a no-op wasting a lot of time on main thread. #16900 could solve it for the RTC case.

@krassowski
Copy link
Member

krassowski commented Dec 4, 2024

or maybe JSON.stringify can be stream

Technically, the standards allow to use ReadableStream since whatwg/fetch#425. There is some obstacles to adoption:

we still could implement it with jupyverse stack but it would require many defensive conditions to avoid failing on the most popular stack.

@krassowski
Copy link
Member

We can break up the JSON.stringify call by serializing one cell at a time and yielding to main thread (or using a webworker), but it looks like this will only somewhat reduce the blocking time, as twice as much time is spent in the Request constructor and browser fetch call (for ~100MB notebook):

image

handleRequest makes a copy of the init options which might contribute to peak memory usage:

const request = new settings.Request(url, { ...settings.init, ...init });

@krassowski
Copy link
Member

  1. Notebooks don't autosave when nothing changed

In principle this sounds right. However, saving has some side effects:

  • a) the modification date is updated
  • b) if content changed on disk this is checked, and user is given an option to reload from disk

One could argue that these side-effects should only happen when user triggers the save manually. I think this is right, though for (b) we may want to poll in background so that users are not left in dark and later presented with a conflict if they worked on outdated version of a file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants